Path: blob/master/clustering_old/tf_idf/tf_idf.Rmd
2615 views
------This documentation assumes you are familiar with hierarchical clustering. To follow along, all the code (tf-idf.R) and the news data (news.csv) can be found here.
TF-IDF
tf-idf, short for term frequency–inverse document frequency, is a numeric measure that is use to score the importance of a word in a document based on how often did it appear in that document and a given collection of documents. The intuition for this measure is : If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it's probably not a unique identifier, therefore we should assign a lower score to that word. The math formula for this measure :
Where t denotes the terms; d denotes each document; D denotes the collection of documents.
In the following documentation, we'll break down this formula using four small documents to illustrate the idea.
TF Term Frequency
The first part of the formula is simply to calculate the number of times each word appeared in each document. Of course, as with common text mining methods: stop words like "a", "the", punctuation marks will be removed beforehand and words will all be converted to lower cases.
And that's it for the term frequency part, easy peasy!!
IDF Inverse Document Frequency
Let's first write down the complete math formula for IDF.
The numerator :
Dis infering to our document space. It can also be seen as D = where n is the number of documents in your collection. Thus for our example , the size of our document space isr length(doc), since we're only usingr length(doc)documents.The denominator : implies the total number of times in which term t appeared in all of your document d ( the restricts the document to be in your current document space ). Note that this implies it doesn't matter if a term appeard 1 time or 100 times in a document, it will still be counted as 1, since it simply did appear in the document. As for the plus 1, it is there to avoid zero division.
Using our term frequency matrix, the idf weight for can be calculated like below.
Now that we have our matrix with the term frequency and the idf weight, we're ready to calculate the full tf-idf weight. To do this matrix multiplication, we will also have to transform the idf vector into a diagonal matrix. Both calculations are shown below.
Don't start cheering yet, there's still one more step to do for this tf-idf matrix. Recall that in the tf (term frequency) section, we're representing each term as the number of times they appeared in the document. The main issue for this representation is that it will create a bias towards long documents, as a given term has more chance to appear in longer document, making them look more important than actually they are.
Thus the approach the resolve this issue is the good old L2 normalization. Math formula :
For each vector , you divide it by its norm (length, magnitude). Calculation as below
And that's it, our final tf-idf matrix, when comparing it with our original document text.
One thing you can see is that the word "bright", which appeared only in 3 out of the 4 documents is a given really low score across all the documents. This matches what we've said about the intuition of tf-idf in the beginning. A word should be representative of a document if it shows up a lot, but if that word occurs too often across all the documents, then it is most likely a meaningless indicator.
Text Clustering
Now that we have this tf-idf matrix, one thing we can do with it is to perform text clustering !!
To performing document clustering using the tf-idf weight matrix, we'll use the cosine similarity to measure how close are two given documents. Math formula :
Where v and w are the two vectors that you wish to calculate the distance; and are components of vector v and w respectively; and n is the number of components you have. A toy example of the calculation is shown below with two simple vector.
After calculating , you can also obtain the actual (degree) using the acos function in R. Note that the function returns the radian, you have to multiply it by 180 and divide by pi to obtain the actual degrees.
As for why we're using this distance measure, remember what we've said in the normalization part, since documents are usually not of equal length, simply computing the difference between two vectors by using euclidean distance has the disadvantage that documents of similar content but different length are not regarded as similar in the vector space.
For this section, we'll move on to a slightly larger dataset, since there's really no point of performing text clustering when you only have 4 documents....
These are some news articles collected from the BBC website, data consists of r nrow(news) rows and r ncol(news) columns, where the columns are simply the title of the news and its corresponding links (urls). We'll be only be representing each news (document) with its title. Link to the data is provided at the end.
The following code :
Calculate the tf-idf score for this document collection.
Define our cosine distance.
Set this pre-defined cosine distance into R
proxylibrary's database ( backbone for thedistfunction ) to calculate the pairwise distance matrix.Performs hierarchical clustering and visualize the clustering result with a dendogram. Note that we WON'T be needing to normalize the tf-idf matrix before calculating the cosine distance, cosine distance will do that for us.
After plotting the dendogram, I have decided that it should partitioned into 17 cluster ( Simply change it if you disagree ). A quick digression about the code chunck above. There's already a built in cosine distance in the R's proxy library. So you don't have to define one yourself now that you've understand the implementation. Simply change the dist calculation to dist( news_tf_idf, method = "cosine" ).
We'll examine three potential cluster that the algorithm provided and print out the original news' title to determine whether the result matches our intuition.
Overall, news' in the first cluster is mostly referring to something about Taiwan. The second cluster seems to be talking about China and UK, and the third cluster's news are all related Uighurs. Not bad, huh? Given the fact that we're only using news' title instead of the entire news' article to represent the news. Since clustering is an unsupervised algorithm (meaning there're probably no such thing as a one hundred percent correct answer), I'll leave it to you to decide whether the clustering results are actually acceptable.
One last thing before we wrap up this discussion, if you are to perform text clustering on you're own, try not to use K-means. You can read why in this StackOverflow (slide to the bottom).