CoCalc -- Keyword Extraction with TF-IDF and SKlearn.ipynb

GitHub Repository: kavgan/nlp-in-practice
Path: blob/master/tf-idf/Keyword Extraction with TF-IDF and SKlearn.ipynb
³¹⁴ views

Kernel: Python 3

Extracting Important Keywords from Text with TF-IDF and Python's Scikit-Learn

Back in 2006, when I had to use TF-IDF for keyword extraction in Java, I ended up writing all of the code from scratch as Data Science nor GitHub were a thing back then and libraries were just limited. The world is much different today. You have several libraries and open-source code on Github that provide a decent implementation of TF-IDF. If you don't need a lot of control over how the TF-IDF math is computed then I would highly recommend re-using libraries from known packages such as Spark's MLLib or Python's scikit-learn.

The one problem that I noticed with these libraries is that they are meant as a pre-step for other tasks like clustering, topic modeling and text classification. TF-IDF can actually be used to extract important keywords from a document to get a sense of what characterizes a document. For example, if you are dealing with wikipedia articles, you can use tf-idf to extract words that are unique to a given article. These keywords can be used as a very simple summary of the document, it can be used for text-analytics (when we look at these keywords in aggregate), as candidate labels for a document and more.

In this article, I will show you how you can use scikit-learn to extract top keywords for a given document using its tf-idf modules. We will specifically do this on a stackoverflow dataset.

Dataset

Since we used some pretty clean user reviews in some of my previous tutorials, in this example, we will be using a Stackoverflow dataset which is slightly noisier and simulates what you could be dealing with in real life. You can find this dataset in my tutorial repo. Notice that there are two files, the larger file with (20,000 posts)[https://github.com/kavgan/data-science-tutorials/tree/master/tf-idf/data] is used to compute the Inverse Document Frequency (IDF) and the smaller file with 500 posts would be used as a test set for us to extract keywords from. This dataset is based on the publicly available Stackoverflow dump on Google's Big Query.

Let's take a peek at our dataset. The code below reads a one per line json string from data/stackoverflow-data-idf.json into a pandas data frame and prints out its schema and total number of posts. Here, lines=True simply means we are treating each line in the text file as a separate json string. With this, the json in line 1 is not related to the json in line 2.

In [1]:

import pandas as pd

# read json into a dataframe
df_idf=pd.read_json("data/stackoverflow-data-idf.json",lines=True)

# print schema
print("Schema:\n\n",df_idf.dtypes)
print("Number of questions,columns=",df_idf.shape)

Out[1]:

Schema:

 id                            int64
title                        object
body                         object
answer_count                  int64
comment_count                 int64
creation_date                object
last_activity_date           object
last_editor_display_name     object
owner_display_name           object
owner_user_id               float64
post_type_id                  int64
score                         int64
tags                         object
view_count                    int64
accepted_answer_id          float64
favorite_count              float64
last_edit_date               object
last_editor_user_id         float64
community_owned_date         object
dtype: object
Number of questions,columns= (20000, 19)

Take note that this stackoverflow dataset contains 19 fields including post title, body, tags, dates and other metadata which we don't quite need for this tutorial. What we are mostly interested in for this tutorial is the body and title which is our source of text. We will now create a field that combines both body and title so we have it in one field. We will also print the second text entry in our new field just to see what the text looks like.

In [2]:

import re
def pre_process(text):
    
    # lowercase
    text=text.lower()
    
    #remove tags
    text=re.sub("</?.*?>"," <> ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text

df_idf['text'] = df_idf['title'] + df_idf['body']
df_idf['text'] = df_idf['text'].apply(lambda x:pre_process(x))

#show the first 'text'
df_idf['text'][2]

Out[2]:

'gradle command line i m trying to run a shell script with gradle i currently have something like this def test project tasks create test exec commandline bash c bash c my file dir script sh the problem is that i cannot run this script because i have spaces in my dir name i have tried everything e g commandline bash c bash c my file dir script sh tokenize commandline bash c bash c my file dir script sh commandline bash c new stringbuilder append bash append c my file dir script sh commandline bash c bash c my file dir script sh file dir file c my file dir script sh commandline bash c bash dir getabsolutepath im using windows bit and if i use a path without spaces the script runs perfectly therefore the only issue as i can see is how gradle handles spaces '

Hmm, doesn't look very pretty with all the html in there, but that's the point. Even in such a mess we can extract some great stuff out of this. While you can eliminate all code from the text, we will keep the code sections for this tutorial for the sake of simplicity.

Creating the IDF

CountVectorizer to create a vocabulary and generate word counts

The next step is to start the counting process. We can use the CountVectorizer to create a vocabulary from all the text in our df_idf['text'] and generate counts for each row in df_idf['text']. The result of the last two lines is a sparse matrix representation of the counts, meaning each column represents a word in the vocabulary and each row represents the document in our dataset where the values are the word counts. Note that with this representation, counts of some words could be 0 if the word did not appear in the corresponding document.

In [3]:

from sklearn.feature_extraction.text import CountVectorizer
import re

def get_stop_words(stop_file_path):
    """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

#load a set of stop words
stopwords=get_stop_words("resources/stopwords.txt")

#get the text column 
docs=df_idf['text'].tolist()

#create a vocabulary of words, 
#ignore words that appear in 85% of documents, 
#eliminate stop words
cv=CountVectorizer(max_df=0.85,stop_words=stopwords)
word_count_vector=cv.fit_transform(docs)

Now let's check the shape of the resulting vector. Notice that the shape below is (20000,149391) because we have 20,000 documents in our dataset (the rows) and the vocabulary size is 149391 meaning we have 149391 unique words (the columns) in our dataset minus the stopwords. In some of the text mining applications, such as clustering and text classification we limit the size of the vocabulary. It's really easy to do this by setting max_features=vocab_size when instantiating CountVectorizer.

In [4]:

word_count_vector.shape

Out[4]:

(20000, 124901)

Let's limit our vocabulary size to 10,000

In [5]:

cv=CountVectorizer(max_df=0.85,stop_words=stopwords,max_features=10000)
word_count_vector=cv.fit_transform(docs)
word_count_vector.shape

Out[5]:

(20000, 10000)

Now, let's look at 10 words from our vocabulary. Sweet, these are mostly programming related.

In [6]:

list(cv.vocabulary_.keys())[:10]

Out[6]:

['serializing',
 'private',
 'struct',
 'public',
 'class',
 'contains',
 'properties',
 'string',
 'serialize',
 'attempt']

We can also get the vocabulary by using get_feature_names()

In [7]:

list(cv.get_feature_names())[2000:2015]

Out[7]:

['customization',
 'customize',
 'customized',
 'customlog',
 'customview',
 'cut',
 'cv',
 'cv_',
 'cval',
 'cvc',
 'cw',
 'cwd',
 'cx',
 'cx_oracle',
 'cxf']

TfidfTransformer to Compute Inverse Document Frequency (IDF)

In the code below, we are essentially taking the sparse matrix from CountVectorizer to generate the IDF when you invoke fit. An extremely important point to note here is that the IDF should be based on a large corpora and should be representative of texts you would be using to extract keywords. I've seen several articles on the Web that compute the IDF using a handful of documents. To understand why IDF should be based on a fairly large collection, please read this page from Standford's IR book.

In [8]:

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

Out[8]:

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

Let's look at some of the IDF values:

In [9]:

tfidf_transformer.idf_

Out[9]:

array([ 7.37717703,  9.80492526,  9.51724319, ...,  8.82409601,
       10.21039037,  9.51724319])

Computing TF-IDF and Extracting Keywords

Once we have our IDF computed, we are now ready to compute TF-IDF and extract the top keywords. In this example, we will extract top keywords for the questions in data/stackoverflow-test.json. This data file has 500 questions with fields identical to that of data/stackoverflow-data-idf.json as we saw above. We will start by reading our test file, extracting the necessary fields (title and body) and get the texts into a list.

In [10]:

# read test docs into a dataframe and concatenate title and body
df_test=pd.read_json("data/stackoverflow-test.json",lines=True)
df_test['text'] = df_test['title'] + df_test['body']
df_test['text'] =df_test['text'].apply(lambda x:pre_process(x))

# get test docs into a list
docs_test=df_test['text'].tolist()
docs_title=df_test['title'].tolist()
docs_body=df_test['body'].tolist()

In [11]:

def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

The next step is to compute the tf-idf value for a given document in our test set by invoking tfidf_transformer.transform(...). This generates a vector of tf-idf scores. Next, we sort the words in the vector in descending order of tf-idf values and then iterate over to extract the top-n items with the corresponding feature names, In the example below, we are extracting keywords for the first document in our test set.

The sort_coo(...) method essentially sorts the values in the vector while preserving the column index. Once you have the column index then its really easy to look-up the corresponding word value as you would see in extract_topn_from_vector(...) where we do feature_vals.append(feature_names[idx]).

In [12]:

# you only needs to do this once
feature_names=cv.get_feature_names()

# get the document that we want to extract keywords from
doc=docs_test[0]

#generate tf-idf for the given document
tf_idf_vector=tfidf_transformer.transform(cv.transform([doc]))

#sort the tf-idf vectors by descending order of scores
sorted_items=sort_coo(tf_idf_vector.tocoo())

#extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,10)

# now print the results
print("\n=====Title=====")
print(docs_title[0])
print("\n=====Body=====")
print(docs_body[0])
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])

Out[12]:

=====Title=====
Integrate War-Plugin for m2eclipse into Eclipse Project

=====Body=====
<p>I set up a small web project with JSF and Maven. Now I want to deploy on a Tomcat server. Is there a possibility to automate that like a button in Eclipse that automatically deploys the project to Tomcat?</p>

<p>I read about a the <a href="http://maven.apache.org/plugins/maven-war-plugin/" rel="nofollow noreferrer">Maven War Plugin</a> but I couldn't find a tutorial how to integrate that into my process (eclipse/m2eclipse).</p>

<p>Can you link me to help or try to explain it. Thanks.</p>

===Keywords===
eclipse 0.593
war 0.317
integrate 0.281
maven 0.273
tomcat 0.27
project 0.239
plugin 0.214
automate 0.157
jsf 0.152
possibility 0.146

From the keywords above, the top keywords actually make sense, it talks about eclipse, maven, integrate, war and tomcat which are all unique to this specific question. There are a couple of kewyords that could have been eliminated such as possibility and perhaps even project and you can do this by adding more common words to your stop list and you can even create your own set of stop list, very specific to your domain as described here.

In [13]:

# put the common code into several methods
def get_keywords(idx):

    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([docs_test[idx]]))

    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    
    return keywords

def print_results(idx,keywords):
    # now print the results
    print("\n=====Title=====")
    print(docs_title[idx])
    print("\n=====Body=====")
    print(docs_body[idx])
    print("\n===Keywords===")
    for k in keywords:
        print(k,keywords[k])

Now let's look at keywords generated for a much longer question:

In [14]:

idx=120
keywords=get_keywords(idx)
print_results(idx,keywords)

Out[14]:

=====Title=====
SQL Import Wizard - Error

=====Body=====
<p>I have a CSV file that I'm trying to import into SQL Management Server Studio.</p>

<p>In Excel, the column giving me trouble looks like this:
<a href="https://i.stack.imgur.com/pm0uS.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/pm0uS.png" alt="enter image description here"></a></p>

<p>Tasks > import data > Flat Source File > select file</p>

<p><a href="https://i.stack.imgur.com/G4b6I.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/G4b6I.png" alt="enter image description here"></a></p>

<p>I set the data type for this column to DT_NUMERIC, adjust the DataScale to 2 in order to get 2 decimal places, but when I click over to Preview, I see that it's clearly not recognizing the numbers appropriately:</p>

<p><a href="https://i.stack.imgur.com/NZhiQ.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/NZhiQ.png" alt="enter image description here"></a></p>

<p>The column mapping for this column is set to type = decimal; precision 18; scale 2.</p>

<p>Error message: Data Flow Task 1: Data conversion failed. The data conversion for column "Amount" returned status value 2 and status text "The value could not be converted because of a potential loss of data.".
 (SQL Server Import and Export Wizard)</p>

<p>Can someone identify where I'm going wrong here?  Thanks!</p>

===Keywords===
column 0.365
import 0.286
data 0.283
wizard 0.27
decimal 0.227
conversion 0.224
sql 0.217
status 0.164
file 0.147
appropriately 0.142

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/feature_extraction/text.py:1089: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):

Generate keywords for a batch of documents

In [41]:

#generate tf-idf for all documents in your list. docs_test has 500 documents
tf_idf_vector=tfidf_transformer.transform(cv.transform(docs_test))

results=[]
for i in range(tf_idf_vector.shape[0]):
    
    # get vector for a single document
    curr_vector=tf_idf_vector[i]
    
    #sort the tf-idf vector by descending order of scores
    sorted_items=sort_coo(curr_vector.tocoo())

    #extract only the top n; n here is 10
    keywords=extract_topn_from_vector(feature_names,sorted_items,10)
    
    
    results.append(keywords)

df=pd.DataFrame(zip(docs,results),columns=['doc','keywords'])
df

Out[41]:

Whoala! Now you can extract important keywords from any type of text!