Path: blob/master/tf-idf/Keyword Extraction with TF-IDF and SKlearn.ipynb
314 views
Extracting Important Keywords from Text with TF-IDF and Python's Scikit-Learn
Back in 2006, when I had to use TF-IDF for keyword extraction in Java, I ended up writing all of the code from scratch as Data Science nor GitHub were a thing back then and libraries were just limited. The world is much different today. You have several libraries and open-source code on Github that provide a decent implementation of TF-IDF. If you don't need a lot of control over how the TF-IDF math is computed then I would highly recommend re-using libraries from known packages such as Spark's MLLib or Python's scikit-learn.
The one problem that I noticed with these libraries is that they are meant as a pre-step for other tasks like clustering, topic modeling and text classification. TF-IDF can actually be used to extract important keywords from a document to get a sense of what characterizes a document. For example, if you are dealing with wikipedia articles, you can use tf-idf to extract words that are unique to a given article. These keywords can be used as a very simple summary of the document, it can be used for text-analytics (when we look at these keywords in aggregate), as candidate labels for a document and more.
In this article, I will show you how you can use scikit-learn to extract top keywords for a given document using its tf-idf modules. We will specifically do this on a stackoverflow dataset.
Dataset
Since we used some pretty clean user reviews in some of my previous tutorials, in this example, we will be using a Stackoverflow dataset which is slightly noisier and simulates what you could be dealing with in real life. You can find this dataset in my tutorial repo. Notice that there are two files, the larger file with (20,000 posts)[https://github.com/kavgan/data-science-tutorials/tree/master/tf-idf/data] is used to compute the Inverse Document Frequency (IDF) and the smaller file with 500 posts would be used as a test set for us to extract keywords from. This dataset is based on the publicly available Stackoverflow dump on Google's Big Query.
Let's take a peek at our dataset. The code below reads a one per line json string from data/stackoverflow-data-idf.json into a pandas data frame and prints out its schema and total number of posts. Here, lines=True simply means we are treating each line in the text file as a separate json string. With this, the json in line 1 is not related to the json in line 2.
Take note that this stackoverflow dataset contains 19 fields including post title, body, tags, dates and other metadata which we don't quite need for this tutorial. What we are mostly interested in for this tutorial is the body and title which is our source of text. We will now create a field that combines both body and title so we have it in one field. We will also print the second text entry in our new field just to see what the text looks like.
Hmm, doesn't look very pretty with all the html in there, but that's the point. Even in such a mess we can extract some great stuff out of this. While you can eliminate all code from the text, we will keep the code sections for this tutorial for the sake of simplicity.
Creating the IDF
CountVectorizer to create a vocabulary and generate word counts
The next step is to start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our df_idf['text'] and generate counts for each row in df_idf['text']. The result of the last two lines is a sparse matrix representation of the counts, meaning each column represents a word in the vocabulary and each row represents the document in our dataset where the values are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.
Now let's check the shape of the resulting vector. Notice that the shape below is (20000,149391) because we have 20,000 documents in our dataset (the rows) and the vocabulary size is 149391 meaning we have 149391 unique words (the columns) in our dataset minus the stopwords. In some of the text mining applications, such as clustering and text classification we limit the size of the vocabulary. It's really easy to do this by setting max_features=vocab_size when instantiating CountVectorizer.
Let's limit our vocabulary size to 10,000
Now, let's look at 10 words from our vocabulary. Sweet, these are mostly programming related.
We can also get the vocabulary by using get_feature_names()
TfidfTransformer to Compute Inverse Document Frequency (IDF)
In the code below, we are essentially taking the sparse matrix from CountVectorizer to generate the IDF when you invoke fit. An extremely important point to note here is that the IDF should be based on a large corpora and should be representative of texts you would be using to extract keywords. I've seen several articles on the Web that compute the IDF using a handful of documents. To understand why IDF should be based on a fairly large collection, please read this page from Standford's IR book.
Let's look at some of the IDF values:
Computing TF-IDF and Extracting Keywords
Once we have our IDF computed, we are now ready to compute TF-IDF and extract the top keywords. In this example, we will extract top keywords for the questions in data/stackoverflow-test.json. This data file has 500 questions with fields identical to that of data/stackoverflow-data-idf.json as we saw above. We will start by reading our test file, extracting the necessary fields (title and body) and get the texts into a list.
The next step is to compute the tf-idf value for a given document in our test set by invoking tfidf_transformer.transform(...). This generates a vector of tf-idf scores. Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n items with the corresponding feature names, In the example below, we are extracting keywords for the first document in our test set.
The sort_coo(...) method essentially sorts the values in the vector while preserving the column index. Once you have the column index then its really easy to look-up the corresponding word value as you would see in extract_topn_from_vector(...) where we do feature_vals.append(feature_names[idx]).
From the keywords above, the top keywords actually make sense, it talks about eclipse, maven, integrate, war and tomcat which are all unique to this specific question. There are a couple of kewyords that could have been eliminated such as possibility and perhaps even project and you can do this by adding more common words to your stop list and you can even create your own set of stop list, very specific to your domain as described here.
Now let's look at keywords generated for a much longer question:
Generate keywords for a batch of documents
Whoala! Now you can extract important keywords from any type of text!