GitHub Repository: kavgan/nlp-in-practice
Path: blob/master/tfidftransformer/TFIDFTransformer vs. TFIDFVectorizer.ipynb
³¹⁴ views

Kernel: Python 3

TFIDFTransformer vs. TFIDFVectorizer

Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The only difference is that with Tfidftransformer, you will systematically compute the word counts, generate idf values and then compute a tfidf score or set of scores.

With Tfidfvectorizer, you will do all three steps at once. Under the hood, it computes the word counts, idf values, and tfidf scores all using the same dataset. Below you will find examples on how to use them individually.

Usage examples

Imports and Data

The dataset we will be using is not a reflection of a real world data set. I made it simple to showcase the differences. For a real world example please reference this: http://kavita-ganesan.com/extracting-keywords-from-text-tfidf/.

In [2]:

import pandas as pd

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# this is a very toy example, do not try this at home unless you want to understand the usage differences
docs=["the house had a tiny little mouse",
      "the cat saw the mouse",
      "the mouse ran away from the house",
      "the cat finally ate the mouse",
      "the end of the mouse story"
     ]

Tfidftransformer

In order to use TfidfTransformer you will first have to create a count vectorizer to count the number of words, limit your vocabulary size, apply stop words and etc. Only then you can apply TfidfTransformer.

Initialize CountVectorizer

In [3]:

cv=CountVectorizer()

# this steps generates word counts for the words in your docs
word_count_vector=cv.fit_transform(docs)

Let's check the shape. We should have 5 rows (5 docs) and 16 columns (16 unique words):

In [4]:

word_count_vector.shape

Out[4]:

(5, 16)

Sweet, this is what we want!

Compute the IDFs

This next steps computes the IDF values and prints it.

In [5]:

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

# print idf values
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
df_idf.sort_values(by=['idf_weights'])

Out[5]:

The values make sense. the and mouse appear in every document and thus have a lower score than other words.

Compute the TFIDF score for your documents

We will compute the tfidf scores for all 5 documents we used for our IDF.

Import Note: In practice, your IDF should be based on a very large corpora. This example is to only showcase the differences between tfidftransformer and tfidfvectorizer.

In [8]:

count_vector=cv.transform(docs)
tf_idf_vector=tfidf_transformer.transform(count_vector)

In [9]:

feature_names = cv.get_feature_names()

#get tfidf vector for first document
first_document_vector=tf_idf_vector[0]

#print the scores
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"])
df.sort_values(by=["tfidf"],ascending=False)

Out[9]:

Tfidfvectorizer

With Tfidfvectorizer you compute the word counts, idf and tfidf values all at once. Its really simple.

In [22]:

# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(use_idf=True)

# just send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)

# get the first vector out (for the first document)
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]

Now let's print the tfidf values for the first document. Notice that these values are identical to the ones from Tfidftransformer

In [23]:

df1 = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df1.sort_values(by=["tfidf"],ascending=False)

Out[23]:

Here's another way to do it by calling fit and transform separately.

In [47]:

# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(use_idf=True)

# just send in all your docs here
fitted_vectorizer=tfidf_vectorizer.fit(docs)
tfidf_vectorizer_vectors=fitted_vectorizer.transform(docs)

# get the first vector out (for the first document)
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]

In [48]:

df2 = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df2.sort_values(by=["tfidf"],ascending=False)

Out[48]:

So when to use what?

If you need the count vectorizer for other tasks, use TFIDFTransformer
If you need to compute tf-idf scores on documents within your "training" dataset, use tfidfvectorizer
If you need to compute tf-idf scores on documents outside your "training" dataset, use either one

In [ ]: