Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
kavgan
GitHub Repository: kavgan/nlp-in-practice
Path: blob/master/tf-idf/Keyword Extraction with TF-IDF and SKlearn.ipynb
314 views
Kernel: Python 3

Extracting Important Keywords from Text with TF-IDF and Python's Scikit-Learn

Back in 2006, when I had to use TF-IDF for keyword extraction in Java, I ended up writing all of the code from scratch as Data Science nor GitHub were a thing back then and libraries were just limited. The world is much different today. You have several libraries and open-source code on Github that provide a decent implementation of TF-IDF. If you don't need a lot of control over how the TF-IDF math is computed then I would highly recommend re-using libraries from known packages such as Spark's MLLib or Python's scikit-learn.

The one problem that I noticed with these libraries is that they are meant as a pre-step for other tasks like clustering, topic modeling and text classification. TF-IDF can actually be used to extract important keywords from a document to get a sense of what characterizes a document. For example, if you are dealing with wikipedia articles, you can use tf-idf to extract words that are unique to a given article. These keywords can be used as a very simple summary of the document, it can be used for text-analytics (when we look at these keywords in aggregate), as candidate labels for a document and more.

In this article, I will show you how you can use scikit-learn to extract top keywords for a given document using its tf-idf modules. We will specifically do this on a stackoverflow dataset.

Dataset

Since we used some pretty clean user reviews in some of my previous tutorials, in this example, we will be using a Stackoverflow dataset which is slightly noisier and simulates what you could be dealing with in real life. You can find this dataset in my tutorial repo. Notice that there are two files, the larger file with (20,000 posts)[https://github.com/kavgan/data-science-tutorials/tree/master/tf-idf/data] is used to compute the Inverse Document Frequency (IDF) and the smaller file with 500 posts would be used as a test set for us to extract keywords from. This dataset is based on the publicly available Stackoverflow dump on Google's Big Query.

Let's take a peek at our dataset. The code below reads a one per line json string from data/stackoverflow-data-idf.json into a pandas data frame and prints out its schema and total number of posts. Here, lines=True simply means we are treating each line in the text file as a separate json string. With this, the json in line 1 is not related to the json in line 2.

import pandas as pd # read json into a dataframe df_idf=pd.read_json("data/stackoverflow-data-idf.json",lines=True) # print schema print("Schema:\n\n",df_idf.dtypes) print("Number of questions,columns=",df_idf.shape)
Schema: id int64 title object body object answer_count int64 comment_count int64 creation_date object last_activity_date object last_editor_display_name object owner_display_name object owner_user_id float64 post_type_id int64 score int64 tags object view_count int64 accepted_answer_id float64 favorite_count float64 last_edit_date object last_editor_user_id float64 community_owned_date object dtype: object Number of questions,columns= (20000, 19)

Take note that this stackoverflow dataset contains 19 fields including post title, body, tags, dates and other metadata which we don't quite need for this tutorial. What we are mostly interested in for this tutorial is the body and title which is our source of text. We will now create a field that combines both body and title so we have it in one field. We will also print the second text entry in our new field just to see what the text looks like.

import re def pre_process(text): # lowercase text=text.lower() #remove tags text=re.sub("</?.*?>"," <> ",text) # remove special characters and digits text=re.sub("(\\d|\\W)+"," ",text) return text df_idf['text'] = df_idf['title'] + df_idf['body'] df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x)) #show the first 'text' df_idf['text'][2]
'gradle command line i m trying to run a shell script with gradle i currently have something like this def test project tasks create test exec commandline bash c bash c my file dir script sh the problem is that i cannot run this script because i have spaces in my dir name i have tried everything e g commandline bash c bash c my file dir script sh tokenize commandline bash c bash c my file dir script sh commandline bash c new stringbuilder append bash append c my file dir script sh commandline bash c bash c my file dir script sh file dir file c my file dir script sh commandline bash c bash dir getabsolutepath im using windows bit and if i use a path without spaces the script runs perfectly therefore the only issue as i can see is how gradle handles spaces '

Hmm, doesn't look very pretty with all the html in there, but that's the point. Even in such a mess we can extract some great stuff out of this. While you can eliminate all code from the text, we will keep the code sections for this tutorial for the sake of simplicity.

Creating the IDF

CountVectorizer to create a vocabulary and generate word counts

The next step is to start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our df_idf['text'] and generate counts for each row in df_idf['text']. The result of the last two lines is a sparse matrix representation of the counts, meaning each column represents a word in the vocabulary and each row represents the document in our dataset where the values are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

from sklearn.feature_extraction.text import CountVectorizer import re def get_stop_words(stop_file_path): """load stop words """ with open(stop_file_path, 'r', encoding="utf-8") as f: stopwords = f.readlines() stop_set = set(m.strip() for m in stopwords) return frozenset(stop_set) #load a set of stop words stopwords=get_stop_words("resources/stopwords.txt") #get the text column docs=df_idf['text'].tolist() #create a vocabulary of words, #ignore words that appear in 85% of documents, #eliminate stop words cv=CountVectorizer(max_df=0.85,stop_words=stopwords) word_count_vector=cv.fit_transform(docs)

Now let's check the shape of the resulting vector. Notice that the shape below is (20000,149391) because we have 20,000 documents in our dataset (the rows) and the vocabulary size is 149391 meaning we have 149391 unique words (the columns) in our dataset minus the stopwords. In some of the text mining applications, such as clustering and text classification we limit the size of the vocabulary. It's really easy to do this by setting max_features=vocab_size when instantiating CountVectorizer.

word_count_vector.shape
(20000, 124901)

Let's limit our vocabulary size to 10,000

cv=CountVectorizer(max_df=0.85,stop_words=stopwords,max_features=10000) word_count_vector=cv.fit_transform(docs) word_count_vector.shape
(20000, 10000)

Now, let's look at 10 words from our vocabulary. Sweet, these are mostly programming related.

list(cv.vocabulary_.keys())[:10]
['serializing', 'private', 'struct', 'public', 'class', 'contains', 'properties', 'string', 'serialize', 'attempt']

We can also get the vocabulary by using get_feature_names()

list(cv.get_feature_names())[2000:2015]
['customization', 'customize', 'customized', 'customlog', 'customview', 'cut', 'cv', 'cv_', 'cval', 'cvc', 'cw', 'cwd', 'cx', 'cx_oracle', 'cxf']

TfidfTransformer to Compute Inverse Document Frequency (IDF)

In the code below, we are essentially taking the sparse matrix from CountVectorizer to generate the IDF when you invoke fit. An extremely important point to note here is that the IDF should be based on a large corpora and should be representative of texts you would be using to extract keywords. I've seen several articles on the Web that compute the IDF using a handful of documents. To understand why IDF should be based on a fairly large collection, please read this page from Standford's IR book.

from sklearn.feature_extraction.text import TfidfTransformer tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) tfidf_transformer.fit(word_count_vector)
TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

Let's look at some of the IDF values:

tfidf_transformer.idf_
array([ 7.37717703, 9.80492526, 9.51724319, ..., 8.82409601, 10.21039037, 9.51724319])

Computing TF-IDF and Extracting Keywords

Once we have our IDF computed, we are now ready to compute TF-IDF and extract the top keywords. In this example, we will extract top keywords for the questions in data/stackoverflow-test.json. This data file has 500 questions with fields identical to that of data/stackoverflow-data-idf.json as we saw above. We will start by reading our test file, extracting the necessary fields (title and body) and get the texts into a list.

# read test docs into a dataframe and concatenate title and body df_test=pd.read_json("data/stackoverflow-test.json",lines=True) df_test['text'] = df_test['title'] + df_test['body'] df_test['text'] =df_test['text'].apply(lambda x:pre_process(x)) # get test docs into a list docs_test=df_test['text'].tolist() docs_title=df_test['title'].tolist() docs_body=df_test['body'].tolist()
def sort_coo(coo_matrix): tuples = zip(coo_matrix.col, coo_matrix.data) return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True) def extract_topn_from_vector(feature_names, sorted_items, topn=10): """get the feature names and tf-idf score of top n items""" #use only topn items from vector sorted_items = sorted_items[:topn] score_vals = [] feature_vals = [] for idx, score in sorted_items: fname = feature_names[idx] #keep track of feature name and its corresponding score score_vals.append(round(score, 3)) feature_vals.append(feature_names[idx]) #create a tuples of feature,score #results = zip(feature_vals,score_vals) results= {} for idx in range(len(feature_vals)): results[feature_vals[idx]]=score_vals[idx] return results

The next step is to compute the tf-idf value for a given document in our test set by invoking tfidf_transformer.transform(...). This generates a vector of tf-idf scores. Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n items with the corresponding feature names, In the example below, we are extracting keywords for the first document in our test set.

The sort_coo(...) method essentially sorts the values in the vector while preserving the column index. Once you have the column index then its really easy to look-up the corresponding word value as you would see in extract_topn_from_vector(...) where we do feature_vals.append(feature_names[idx]).

# you only needs to do this once feature_names=cv.get_feature_names() # get the document that we want to extract keywords from doc=docs_test[0] #generate tf-idf for the given document tf_idf_vector=tfidf_transformer.transform(cv.transform([doc])) #sort the tf-idf vectors by descending order of scores sorted_items=sort_coo(tf_idf_vector.tocoo()) #extract only the top n; n here is 10 keywords=extract_topn_from_vector(feature_names,sorted_items,10) # now print the results print("\n=====Title=====") print(docs_title[0]) print("\n=====Body=====") print(docs_body[0]) print("\n===Keywords===") for k in keywords: print(k,keywords[k])
=====Title===== Integrate War-Plugin for m2eclipse into Eclipse Project =====Body===== <p>I set up a small web project with JSF and Maven. Now I want to deploy on a Tomcat server. Is there a possibility to automate that like a button in Eclipse that automatically deploys the project to Tomcat?</p> <p>I read about a the <a href="http://maven.apache.org/plugins/maven-war-plugin/" rel="nofollow noreferrer">Maven War Plugin</a> but I couldn't find a tutorial how to integrate that into my process (eclipse/m2eclipse).</p> <p>Can you link me to help or try to explain it. Thanks.</p> ===Keywords=== eclipse 0.593 war 0.317 integrate 0.281 maven 0.273 tomcat 0.27 project 0.239 plugin 0.214 automate 0.157 jsf 0.152 possibility 0.146

From the keywords above, the top keywords actually make sense, it talks about eclipse, maven, integrate, war and tomcat which are all unique to this specific question. There are a couple of kewyords that could have been eliminated such as possibility and perhaps even project and you can do this by adding more common words to your stop list and you can even create your own set of stop list, very specific to your domain as described here.

# put the common code into several methods def get_keywords(idx): #generate tf-idf for the given document tf_idf_vector=tfidf_transformer.transform(cv.transform([docs_test[idx]])) #sort the tf-idf vectors by descending order of scores sorted_items=sort_coo(tf_idf_vector.tocoo()) #extract only the top n; n here is 10 keywords=extract_topn_from_vector(feature_names,sorted_items,10) return keywords def print_results(idx,keywords): # now print the results print("\n=====Title=====") print(docs_title[idx]) print("\n=====Body=====") print(docs_body[idx]) print("\n===Keywords===") for k in keywords: print(k,keywords[k])

Now let's look at keywords generated for a much longer question:

idx=120 keywords=get_keywords(idx) print_results(idx,keywords)
=====Title===== SQL Import Wizard - Error =====Body===== <p>I have a CSV file that I'm trying to import into SQL Management Server Studio.</p> <p>In Excel, the column giving me trouble looks like this: <a href="https://i.stack.imgur.com/pm0uS.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/pm0uS.png" alt="enter image description here"></a></p> <p>Tasks > import data > Flat Source File > select file</p> <p><a href="https://i.stack.imgur.com/G4b6I.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/G4b6I.png" alt="enter image description here"></a></p> <p>I set the data type for this column to DT_NUMERIC, adjust the DataScale to 2 in order to get 2 decimal places, but when I click over to Preview, I see that it's clearly not recognizing the numbers appropriately:</p> <p><a href="https://i.stack.imgur.com/NZhiQ.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/NZhiQ.png" alt="enter image description here"></a></p> <p>The column mapping for this column is set to type = decimal; precision 18; scale 2.</p> <p>Error message: Data Flow Task 1: Data conversion failed. The data conversion for column "Amount" returned status value 2 and status text "The value could not be converted because of a potential loss of data.". (SQL Server Import and Export Wizard)</p> <p>Can someone identify where I'm going wrong here? Thanks!</p> ===Keywords=== column 0.365 import 0.286 data 0.283 wizard 0.27 decimal 0.227 conversion 0.224 sql 0.217 status 0.164 file 0.147 appropriately 0.142
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:1089: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):

Generate keywords for a batch of documents

#generate tf-idf for all documents in your list. docs_test has 500 documents tf_idf_vector=tfidf_transformer.transform(cv.transform(docs_test)) results=[] for i in range(tf_idf_vector.shape[0]): # get vector for a single document curr_vector=tf_idf_vector[i] #sort the tf-idf vector by descending order of scores sorted_items=sort_coo(curr_vector.tocoo()) #extract only the top n; n here is 10 keywords=extract_topn_from_vector(feature_names,sorted_items,10) results.append(keywords) df=pd.DataFrame(zip(docs,results),columns=['doc','keywords']) df

Whoala! Now you can extract important keywords from any type of text!