Path: blob/master/tfidftransformer/TFIDFTransformer vs. TFIDFVectorizer.ipynb
314 views
TFIDFTransformer vs. TFIDFVectorizer
Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The only difference is that with Tfidftransformer, you will systematically compute the word counts, generate idf values and then compute a tfidf score or set of scores.
With Tfidfvectorizer, you will do all three steps at once. Under the hood, it computes the word counts, idf values, and tfidf scores all using the same dataset. Below you will find examples on how to use them individually.
Usage examples
Imports and Data
The dataset we will be using is not a reflection of a real world data set. I made it simple to showcase the differences. For a real world example please reference this: http://kavita-ganesan.com/extracting-keywords-from-text-tfidf/.
Tfidftransformer
In order to use TfidfTransformer you will first have to create a count vectorizer to count the number of words, limit your vocabulary size, apply stop words and etc. Only then you can apply TfidfTransformer.
Initialize CountVectorizer
Let's check the shape. We should have 5 rows (5 docs) and 16 columns (16 unique words):
Sweet, this is what we want!
Compute the IDFs
This next steps computes the IDF values and prints it.
The values make sense. the and mouse appear in every document and thus have a lower score than other words.
Compute the TFIDF score for your documents
We will compute the tfidf scores for all 5 documents we used for our IDF.
Import Note: In practice, your IDF should be based on a very large corpora. This example is to only showcase the differences between tfidftransformer and tfidfvectorizer.
Tfidfvectorizer
With Tfidfvectorizer you compute the word counts, idf and tfidf values all at once. Its really simple.
Now let's print the tfidf values for the first document. Notice that these values are identical to the ones from Tfidftransformer
Here's another way to do it by calling fit and transform separately.
So when to use what?
If you need the count vectorizer for other tasks, use TFIDFTransformer
If you need to compute tf-idf scores on documents within your "training" dataset, use tfidfvectorizer
If you need to compute tf-idf scores on documents outside your "training" dataset, use either one