Path: blob/master/clustering_old/text_similarity/text_similarity.Rmd
2624 views
------All the code text_similarity.R and data used in the documentation the data folder can be found in this folder.
Text Similarity
One of the most common data-mining problem is to find similar items in the given dataset, this also applies for text data. Examples that comes directly into our mind includes detecting plagiarism, news recommendation (don't show identical news articles to users) and the list goes on.
We'll use three documents as our toy example. The following section loads them in and also preprocess them by :
Removing any punctuation mark.
Transform all letters to lower cases.
Convert extra white space as a single blank, and remove white spaces up front and at the end.
Split the text string into separate words.
Link to the dataset is provided at the end.
K-Shingling
The first notion we'll introduce is Shingling, a common technique of representing documents as sets. Given the document, its k-shingle is said to be all the possible consecutive substring of length k found within it. An example with k = 3 is given below :
As you can see from the printed out first document. By picking our k to be 3, the k-shingle of the first document consists of substrings of lenth 3. The first 3-shingle is r doc1[[1]][1] as it's the first consecutive substring of length 3 in the document, then the second 3-shingle r doc1[[1]][2] is the next 3 word long substring after excluding the first word of the document. And the list goes on.
Another thing to note is that a document's k-shingle set should only consists of unique k-shingles. For example, if the first document above contains more than one r doc1[[1]][1] then it will only appear once as the set of k-shingle for that document.
Now that we have that in mind, we'll construct a "characteristic" matrix that visualizes the relationships between our three documents. The matrix will be a boolean matrix, where its
rows = the elements of the universal set (every unique possible combinations of shingles sets across all documents ).
columns = one column per document.
Thus the matrix will have 1 in row i and column j if and only if document j contains the shingle i and 0 otherwise. Example below. ( I will use the data frame data structure to replace matrix throughout the whole documentation ).
Looking at the first row of the matrix above, all three columns of recorded a 1. This denotes that all three documents contains the 3-shingle r rownames(M)[1]. As for the second, column cell value of [r M[2,]] means that document 2 does not have the 3-shingle r rownames(M)[2], while document 1 and 3 do. And so on.
Note that unlike our toy example, these "characteristic" matrices are almost always sparse. Therefore, in practice we usually represent these kind of matrices only by the positions in which the 1 appear as it is considered to be more space efficient.
Jaccard Similarity
Now that we've transformed all the documents into shingle sets and used characteristic matrix to visualize the relationships between our documents. The next question is how do we measure the similarity between documents?
One well known measure for determining the degree of similarity is the Jaccard Similarity. Given two sets of shingles, set1 and set2. The math formula of this measurement will be :
This is equivalent to saying for any two given sets, its jaccard similarity is the size of their intersection divided by their union. For our "characteristic" matrix, if we restrict ourselves to only two columns (documents), then the rows can be divided into three types :
Type 1 : rows have 1s in both columns.
Type 2 : rows have 1 in one of the columns and 0 in the other.
Type 3 : rows have 0s in both columns.
Applying the jaccard similarity concept here, the similarities between two sets will then be .
The following section will calculate the pairwise jaccard similarities for all three documents and print out the original document's content to get a feel of the performace. One quick way to calculate pairwise distance in R is the dist function, which computes and returns the distance/similarity matrix between either rows or columns of a matrix/data frame. We can also define our own measures for the distance/similarity in the proxy library's database.
The similarity matrix d1 tells us that document 1 and 3 is the most similar among the three documents. Looking at its original content, it seems like it does match our intuition as both of the documents contains the substring "sun is bright" and "the sky is blue" (despite the different ordering in the context) .
Now that we have calculated the pair-wise similarity between each document in a pretty straightforward way, we're done right ?! Well, it's a yes and a no. For small datasets, this method works out perfectly fine, but imagine if we have a large number of documents to compare instead of just three, let's say the number is N. We will then be forced to do a number of N choose 2 comparisons. This is obviously not going to scale well.
Thus in the following sections, we'll discuss techniques that can be used to help us save computations so that we can compare document similarities on a large scale. The first is Minhashing.
MinHash
Recall that we've said about the "characteristic" matrix : typically these matrices tend to be sparse. That is the set of unique shingles across all documents will be fairly large, making computing the jaccard similarity between the documents a heavy burden.
This is where MinHash comes in. This algorithm will provide us with a fast approximation to the jaccard similarity. The concept is to condense the large sets of unique shingles into a much smaller representations called "signatures. We will then use these signatures alone to measure the similarity between documents. Note that it is impossible for these signatures to give the exact similiarity, but the estimates they provide are close (The larger the number of signatures you choose, the more accurate the estimate).
For our toy example, suppose you wish to minhash our characteristic matrix of row r nrow(M) into 4 signatures. Then the first step is to generate 4 columns of randomly permutated rows (independent of each other).
A Implement Trick :
To Implement the idea of generating randomly permutated rows, we don't actually generate the random numbers, since it is not feasible to do so for large datasets. e.g. For a million itemset you will have to generate a million integers ..., not to mention you have to do this for each signatures that you wish to generate. One way to avoid having to generate n permutated rows ( here n is 4, because we chose the signature numbers to be 4 ) is to pick n hash functions in the form of :
Where :
xis the row numbers of your original characteristic matrix.aandbare any random numbers smaller or equivalent to the maximum number of x, though they both must be unique in each signature. e.g. For signature 1, if you generated a 5 to serve as your a coefficient, you must ensure that this value does not serve as your a coefficient multiple times within signature 1, though you can still use 5 as your b coefficient in signature 1. And this restriction will refresh for the next signature, that is, you can use 5 to serve as your a or b coefficient for signature 2, but again no multiple 5 for signature 2's a coefficient and so on.cis a prime number slightly larger than the total number of shingle sets. You can find suitable prime number for your dataset here. For our example dataset, our total row count isr nrow(M), thus prime number 17 will do fine.
We can see it for ourselves that this simple hash function does in fact generate random permutated rows
To be clear, we'll print out our r signature_num randomly generated hash function :
From the output shown above, we can see that all r signature_num hash functions does in fact produce permutated numbers. There are 0s, but it will not affect our computation and you'll see why later.
Using our randomly permutated rows, we'll carry out the second step of the MinHash algorithm, calculating the signatures. The signature value of any column (document) is obtained by : Using the permutated order generated by each hash function, the number of the first row in which the column has a 1.
See example below. We'll combine our randomly permutated rows (generated by hash functions) with our original characteristic matrix and change the row names of the matrix to its row number to illustrate the calculation :
Consider the matrix shown above, we'll start with our first hash function (hash_1).
According to our first hash function's permutated row order, the first row is row 5 ( why is it row 5 ? because 0 is the smallest value for our randomly generated permutation, and it has a 0 in row 5, making it the first row ). Then we'll look at row 5's entry for all three documents and ask "which document's entry at row 5 is a 1 ?". Aha!! document 1's (doc_1) row 5 is a 1, thus the signature value for document 1 generated by our first hash function is 0. But document 2 and 3's entry at row 5 are both 0, thus we'll have to keep looking.
According to our first hash function's permutated row order the second row is row 15 ( 1 is the second smallest value for our randomly generated permutation, and it has a value of 1 at row 15 ). We apply the same concept as above and found that document 3's (doc_3) entry for row 15 is a 1, thus the signature value for document 3 generated by our first hash function is 1. Note that we're already done with document 1, we do not need to check if it contains a 1 anymore. But we're still not done, document 2's entry at row 15 is still a 0. So we'll have to look deeper.
Again, checking the permutated row order for our first hash function, the third row is row 8. Is document 2's entry for row 8 a 1 ? Yes it is ! Therefore, we're done with calculating the signature values for all three columns using our first hash function!! Which are [0, 2, 1].
We can then apply the same notion to calculate the signature value for each column (document) using the second hash function, and so on for the third, fourth, blah blah blah. A quick look at the signature second hash function shows that the first row according to its permutated row order is row 1 and all three document has a 1 in row 1. Hence, the signature values generated by our second hash function for all three document are [0, 0, 0].
As for these calculated signature values, we will store them into a signature matrix along the way, which will later replace our original characteristic matrix. The following section will calculate the signature values for all r ncol(M) columns using all r signature_num hash functions, and print out the signature matrix.
Note that despite our signature matrix has the same number of columns as the original characteristic matrix, but it only has n rows, where n is the number of hash functions we wish to generate (in this case r signature_num).
As for calculating the pair-wise similarity using this condensed signature matrix. A cool way of saying it is to calculate the fraction in which the hash functions hash into the same bucket. In plain English, this is equivalent to calculate the fraction the rows in which their values are the same. e.g. for document 1 and 2 (column 1 and 2), it's similarity would be r mean( SM[,1] == SM[,2] ) because they only agree in 1 row out of a total of 4 (both column's row 2 is 0).
From the difference of the result between the original jaccard similarity and our new similarity obtained using the signature similarity, you might be doubting is this a true estimate ? Well, the short answer is, we said right in the beginning of this section that Minhash's purpose is to provide us with a fast "approximation" to the true jaccard similarity, and this example is simply way too small for the law of large numbers to assure that the estimates are close.
For the next section, suppose your final goal is to compute the similarity of every possible pair ( for text clustering perhaps ), then it will most likely be useless to you. However, if you simply wish to find the pairs that are most likely to be similar, then stay tuned. Because in the next section we will be discussing Locality Sensitive Hashing, a technique that allows us to do just that.
Locality Sensitive Hashing
While the information necessary to compute the similarity between documents have been condensed from the original sparse characteristic matrix into a much smaller signature matrix, the underlying problem of needing to perform pairwise comparisons on all the documents still exists.
The concept for locality sensitive hashing (LSH) is that given our signature matrix of size n (row count), we will partition it into b bands, resulting each band with r rows. This is equivalent to the simple math formula : n = br, thus when you're doing the partition be sure the b you choose is divisible by n. Using our signature matrix above, choosing the band size to be 2 will then become :
What locality sensitive hashing tells us is: If the signature values of two documents agree in all the rows of at least one band, then these two documents are likely to be similar and should be compared (list it as our candidate pair). Using this small set of documents is probably a really bad example ...., since none of them will get chosen as our candidate pair. For instance, if the signature value of document 2 for band 1 becomes [ 0, 0 ] instead of the current [ 2, 0 ] now, then document 2 and document 1 will become a candidate pair as both of their rows in band1 takes the same value of [ 0, 0 ].
Takeaways:
MinHash, or the formal name, min-wise independent permutations locality sensitive hashing scheme computes an accurate estimate of the jaccard similarity coefficient between two large sets. This is because the number of signatures tend to be much shorter than the number of unique shingles across all documents, making the comparison operation simpler and quicker.
To note, there's also a R library
textreusethat implements the techniques that we've walked through in this documentation. You should be able understand its vignette with no problem.