Path: blob/master/clustering/tfidf/tfidf.ipynb
2611 views
Table of Contents
Nearest Neighbors
When exploring a large set of text documents -- such as Wikipedia, news articles, StackOverflow, etc. -- it can be useful to get a list of related material. To find relevant documents you typically need to convert the text document into bag of words or TF-IDF format and choose the distance metric to measure the notion of similarity. In this documentation we'll explore the tradeoffs with representing documents using bag of words and TF-IDF.
We will be using the Wikipedia pages dataset. Each row of the dataset consists of a link to the Wikipedia article, the name of the person, and the text of the article (in lowercase). To follow along, please download the file from this dropbox link.
We start by converting the document into bag of words format and utilize the euclidean distance to find the nearest neighbors of the Barack Obama.
Looking at the result, it seems nice that all of the 10 people are politicians. Let's dig a bit deeper and find out why Joe Biden was considered a close neighbor of Obama by looking at the most frequently used words in both pages. To do this, we'll take the sparse row matrix that we had and extract the word count dictionary for each document.
The next question we want to ask is, among these common words that appear in both documents, how many other documents in the Wikipedia dataset also contain them?
Given this result, we saw that much of the perceived commonalities between the two articles were due to occurrences of extremely frequent words, such as "the", "and", and "his". All of these words appear very often in all of the documents. To retrieve articles that are more relevant, maybe we should be focusing more on rare words that don't happen in every article. And this is where TF-IDF (term frequency–inverse document frequency) comes in. Note that although we can remove stop words to prevent some of this behavior, sometimes the frequently appeared words might not be a stop word.
TF-IDF
TF-IDF, short for term frequency–inverse document frequency, is a numeric measure that is use to score the importance of a word in a document based on how often did it appear in that document and a given collection of documents. The intuition for this measure is : If a word appears frequently in a document, then it should be important and we should give that word a high score. But if a word appears in too many other documents, it's probably not a unique identifier, therefore we should assign a lower score to that word. In short, it is a feature representation that penalizes words that are too common.
Let us consider the following toy dataset that consists of 3 documents.
Most commonly, TF-IDF is calculated as follows:
Where t denotes the terms; d denotes each document; D denotes the collection of documents.
The first part of the formula stands for term frequency, which is defined by the number of times a term occurs in a document .
Based on the vocabulary, the word "and" would be the first column in each document vector in tf and it appears once in the third document.
In order to understand the second part of the formlua , inverse document frequency, let's first write down the complete math formula for IDF.
The numerator :
Dis infering to our document space. It can also be seen as D = where n is the number of documents in your collection. Thus for our example , the size of our document space is 3, since we have 3 documents.The denominator : is the document freqency. To be explicit, it is the number of documents that contain the term . Note that this implies it doesn't matter if a term appeared 1 time or 100 times in a document, it will still be counted as 1, since the term simply did appear in the document.
The constant 1 is added to the denominator to avoid a zero-division error if a term is not contained in any document in the test dataset.
Note that there are very different variation of the formula. For example, The tf-idfs in scikit-learn are calculated by:
Here, the +1 count is added directly to the idf, instead of the denominator. The effect of this is that terms with zero idf, i.e. that occur in all documents of a training set, will not be entirely ignored. We can demonstrate this by calculating the idfs manually using the equation above and do the calculation ourselves and compare the results to the TfidfTransformer output using the settings use_idf=True, smooth_idf=False, norm=None (more on smooth_idf and norm later).
Next, recall that in the tf (term frequency) section, we’re representing each term as the number of times they appeared in the document. The main issue for this representation is that it will create a bias towards long documents, as a given term has more chance to appear in longer document, making them look more important than actually they are. Thus the approach to resolve this issue is the good old L2 normalization. i.e., dividing the raw term frequency vector by its length (L2- or Euclidean norm).
Another parameter in the TfidfTransformer is the smooth_idf, which is described as
smooth_idf : boolean, default=True Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
So, our idf would then be defined as follows:
Again let's confirm this by computing it manually and using the library.
To sum it up, the default settings in the TfidfTransformer is:
use_idf=Truesmooth_idf=Truenorm='l2'
And we can use the TfidfVectorizer to compute the TF-IDF score from raw text in one step without having to do use CountVectorizer to convert it to bag of words representation and then transform it to TF-IDF using TfidfTransformer. For those interested, this link contains the full TF-IDF implemented from scratch.
Now that we've understood the inner details on TF-IDF let's return back to our initial task and use this weighting scheme instead of bag of words. We start by converting the document into TF-IDF format and use this along with cosine distance to find the nearest neighbors of the Barack Obama (if we normalized our articles in the TF-IDF transformation, then the euclidean distance and the cosine distance is proportional to each other, hence they're doing the same thing). Then we extract the commonly used words (now weighted by the TF-IDF score instead of word count) in both the Joe Biden and Obama article. Finally we compute how many other documents in the Wikipedia dataset also contain these words. In short, we're doing the exact same thing as the beginning except we now use TF-IDF to represent the documents and use cosine distance to measure similarity between documents.
Notice the huge difference in this calculation using TF-IDF scores instead of raw word counts. We've eliminated noise arising from extremely common words. Happily ever after? Not so fast. To illustrate another possible issue, let's first compute the length of each Wikipedia document, and examine the document lengths for the 100 nearest neighbors to Obama's page.
The visualization is basically telling us that 100 nearest neighbors using TF-IDF weighting and cosine distance provide a sampling across articles with different length. The thing is that whether we're choosing euclidean or cosine distance to measure our articles' similarity, both of them still ignore the document's length, which may be great in certain situations but not in others. For instance, consider the following (admittedly contrived) tweet.
How similar is this tweet to Barack Obama's Wikipedia article? Let's transform the tweet into TF-IDF features; compute the cosine distance between the Barack Obama article and this tweet and compare this distance to the distance between the Barack Obama article and all of its Wikipedia 10 nearest neighbors.
With cosine distances, the tweet is "nearer" to Barack Obama than most of the articles. If someone is reading the Barack Obama Wikipedia page, would we want to recommend they read this tweet? Ignoring article lengths completely resulted in nonsensical results. In practice, it is common to enforce maximum or minimum document lengths. After all, when someone is reading a long article from The Atlantic, we wouldn't recommend him/her a tweet.