GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_13/practice/solution-code/topic_modeling_lda-codealong-solutions.ipynb
¹⁹⁰⁴ views

Kernel: Python 2

Guided Practice with Topic Modeling and LDA

Authors: Dave Yerrington (SF)

Note: this lab is intended to be a guided lab with the instructor.

In practice it would be a very rare to need to build an unsupervised topic model like LDA from scratch. Lucky for us, sklearn comes with LDA topic modeling functionality. Another popular LDA module which we will explore in this lab is from the gensim package.

Let's explore a brief walkthrough of LDA and topic modeling using gensim. We will work with a small collection of documents represented as a list.

1. Load the packages and create the small "documents".

You may need to install the gensim package with pip or conda.

In [1]:

from gensim import corpora, models, matutils
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import defaultdict
import pandas as pd


doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."

# compile sample documents into a list
documents = [doc_a, doc_b, doc_c, doc_d, doc_e]
df        = pd.DataFrame(documents, columns=['text'])

(Output Hidden)

In [2]:

df

Out[2]:

2. Load stop words either from NLTK or sklearn

In [3]:

from nltk.corpus import stopwords

nltk_stops = stopwords.words()

(Output Hidden)

In [4]:

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

custom_stop_words = list(ENGLISH_STOP_WORDS)

# You can of course add your own custom stopwords
custom_stop_words.append('mother')
custom_stop_words.append('brother')

3. Use CountVectorizer to transform our text, taking out the stopwords.

In [5]:

vectorizer = CountVectorizer(stop_words=custom_stop_words)
X = vectorizer.fit_transform(df['text'])

4. Extract the tokens that remain after stopword removal.

The .vocabulary_ attribute of the vectorizer contains a dictionary of terms. There is also the built-in function .get_feature_names() which will extract the column names.

In [6]:

vectorizer.vocabulary_

Out[6]:

{u'baseball': 0,
 u'better': 1,
 u'blood': 2,
 u'brocolli': 3,
 u'cause': 4,
 u'drive': 5,
 u'driving': 6,
 u'eat': 7,
 u'experts': 8,
 u'feel': 9,
 u'good': 10,
 u'health': 11,
 u'increased': 12,
 u'likes': 13,
 u'lot': 14,
 u'perform': 15,
 u'practice': 16,
 u'pressure': 17,
 u'professionals': 18,
 u'say': 19,
 u'school': 20,
 u'spends': 21,
 u'suggest': 22,
 u'tension': 23,
 u'time': 24}

In [7]:

vectorizer.get_feature_names()

Out[7]:

[u'baseball',
 u'better',
 u'blood',
 u'brocolli',
 u'cause',
 u'drive',
 u'driving',
 u'eat',
 u'experts',
 u'feel',
 u'good',
 u'health',
 u'increased',
 u'likes',
 u'lot',
 u'perform',
 u'practice',
 u'pressure',
 u'professionals',
 u'say',
 u'school',
 u'spends',
 u'suggest',
 u'tension',
 u'time']

5. Get counts of tokens.

Convert the matrix from the vectorizer to a dense matrix, then sum by column to get the counts per term.

In [8]:

docs = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())
docs.sum()

Out[8]:

baseball         1
better           1
blood            1
brocolli         3
cause            1
drive            1
driving          2
eat              2
experts          1
feel             1
good             3
health           3
increased        1
likes            1
lot              1
perform          1
practice         1
pressure         2
professionals    1
say              1
school           1
spends           1
suggest          1
tension          1
time             1
dtype: int64

6. Setup the vocabulary dictionary

First we need to setup the vocabulary. Gensim's LDA expects our vocabulary to be in a format where the dictionary keys are the column indices and the values are the words themselves.

Create this dictionary below.

In [9]:

# This is the fastest way to swap a dictionary key / value order.  
# This is the format gensim LDA expects it's vocabulary.
vocab = {v: k for k, v in vectorizer.vocabulary_.items()}
vocab

Out[9]:

{0: u'baseball',
u'better',
u'blood',
u'brocolli',
u'cause',
u'drive',
u'driving',
u'eat',
u'experts',
u'feel',
u'good',
u'health',
u'increased',
u'likes',
u'lot',
u'perform',
u'practice',
u'pressure',
u'professionals',
u'say',
u'school',
u'spends',
u'suggest',
u'tension',
u'time'}

7. Create a token to id mapping with gensim's `corpora.Dictionary`

This dictionary class is a more standard way to work with with gensim models. There are a few standard steps we should go through:

7.1. Count the frequency of words.

We can do this easily with the python defaultdict(int), which doesn't require us to already have the key in the dictionary to be able to add to it:

frequency = defaultdict(int)

for text in documents:
    for token in text.split():
        frequency[token] += 1

In [11]:

frequency = defaultdict(int)

for text in documents:
    for token in text.split():
        frequency[token] += 1
        
frequency

Out[11]:

defaultdict(int,
            {'Brocolli': 1,
             'Health': 1,
             'I': 1,
             'My': 2,
             'Some': 1,
             'a': 1,
             'and': 1,
             'around': 1,
             'at': 1,
             'baseball': 1,
             'better.': 1,
             'blood': 1,
             'brocolli': 1,
             'brocolli,': 1,
             'brother': 3,
             'but': 2,
             'cause': 1,
             'do': 1,
             'drive': 1,
             'driving': 2,
             'eat': 1,
             'eat.': 1,
             'experts': 1,
             'feel': 1,
             'for': 1,
             'good': 3,
             'health': 1,
             'health.': 1,
             'increased': 1,
             'is': 2,
             'likes': 1,
             'lot': 1,
             'may': 1,
             'mother': 2,
             'mother.': 1,
             'my': 4,
             'never': 1,
             'not': 1,
             'of': 1,
             'often': 1,
             'perform': 1,
             'practice.': 1,
             'pressure': 1,
             'pressure.': 1,
             'professionals': 1,
             'say': 1,
             'school,': 1,
             'seems': 1,
             'spends': 1,
             'suggest': 1,
             'tension': 1,
             'that': 2,
             'time': 1,
             'to': 6,
             'well': 1,
             'your': 1})

7.2 Remove any words that only appear once, or appear in the stopwords.

Iterate through the documents and only keep useful words/tokens.

In [13]:

texts = [[token for token in text.split() if frequency[token] > 1 and token not in nltk_stops]
          for text in documents]

texts

Out[13]:

[['good', 'My', 'brother', 'good'],
 ['My', 'mother', 'driving', 'brother'],
 ['driving'],
 ['mother', 'brother'],
 ['good']]

7.3 Create the corpora.Dictionary object with the retained tokens.

In [14]:

# Create gensim dictionary object
dictionary = corpora.Dictionary(texts)
dictionary

Out[14]:

<gensim.corpora.dictionary.Dictionary at 0x1174a2d10>

7.4 Use the dictionary.doc2bow() function to convert the texts to bag-of-word representations.

In [15]:

# Create corpus matrix
corpus = [dictionary.doc2bow(text) for text in texts]
corpus

Out[15]:

[[(0, 2), (1, 1), (2, 1)],
 [(1, 1), (2, 1), (3, 1), (4, 1)],
 [(3, 1)],
 [(2, 1), (4, 1)],
 [(0, 1)]]

Why should we use this process?

The main advantage is that this dictionary object has quick helper functions.

However, there are also some major performance advantages if you ever want to save your model to a file, then load it at a later time. Tokenizations can take a while to be computed, especially when your text files are quite large. You can save these post-computed dictionary items to file, then load them from disk later which is quite a bit faster. Also, it's possible to add new documents to your corpus without having to re-tokenize your entire set. This is great for online systems that can take new documents on demmand.

As you work with larger datasets with text, this is a much better way to handle LDA and other Gensim models from a performance point of view.

8. Set up the LDA model

We can create the gensim LDA model object like so:

lda = models.LdaModel(
    # supply our sparse predictor matrix wrapped in a matutils.Sparse2Corpus object
    matutils.Sparse2Corpus(X, documents_columns=False),
    # or alternatively use the corpus object created with the dictionary in the previous frame!
    # corpus,
    # The number of topics we want:
    num_topics  =  3,
    # how many passes over the vocabulary:
    passes      =  20,
    # The id2word vocabulary we made ourselves
    id2word     =  vocab
    # or use the gensim dictionary object!
    # id2word     =  dictionary
)

In [16]:

lda = models.LdaModel(
    matutils.Sparse2Corpus(X, documents_columns=False),
    num_topics  =  3,
    passes      =  20,
    id2word     =  vocab
)

9. Look at the topics

The model has a .print_topics function that accepts the number of topics to print and number of words per topic. The number before the word is the probability of occurance for that word in the topic.

In [18]:

lda.print_topics(num_topics=3, num_words=5)

Out[18]:

[(0,
  u'0.093*pressure + 0.093*perform + 0.093*drive + 0.093*school + 0.093*feel'),
 (1, u'0.156*good + 0.156*brocolli + 0.110*health + 0.109*eat + 0.062*likes'),
 (2,
  u'0.100*driving + 0.057*tension + 0.057*suggest + 0.057*cause + 0.057*experts')]

10. Get the topic scores for a document

The .get_document_topics function accepts a bag-of-words representation for a document and returns the scores for each topic.

In [21]:

lda.get_document_topics(dictionary.doc2bow(texts[2]))

Out[21]:

[(0, 0.16725301762611078), (1, 0.66572741047902195), (2, 0.16701957189486732)]

11. Label and visualize the topics

Lets come up with some high level labels. This is the subjective part of LDA. What do the word probabilties that represent topics mean? Let's make some up.

Plot a heatmap of the topic probabilities for each of the documents.

In [22]:

topics_labels = {
   0: "Family Stress",
   1: "Driving",
   2: "Food"
}

In [25]:

doc_topics = [lda.get_document_topics(doc) for doc in corpus]

topic_data = []

for document_id, topics in enumerate(doc_topics):
    
    document_topics = []
    
    for topic, probability in topics:
       
        topic_data.append({
            'document_id':  document_id,
            'topic_id':     topic,
            'topic':        topics_labels[topic],
            'probability':  probability
        })

topics_df = pd.DataFrame(topic_data)
topics_df.pivot_table(values="probability", index=["document_id", "topic"]).T

Out[25]:

document_id  topic        
0            Driving          0.067495
             Family Stress    0.262425
             Food             0.670080
1            Driving          0.273550
             Family Stress    0.268529
             Food             0.457921
2            Driving          0.665728
             Family Stress    0.167253
             Food             0.167019
3            Driving          0.111993
             Family Stress    0.112465
             Food             0.775541
4            Driving          0.168073
             Family Stress    0.168824
             Food             0.663104
Name: probability, dtype: float64

In [23]:

%matplotlib inline

In [24]:

import seaborn as sns

doc_topics = [lda.get_document_topics(doc) for doc in corpus]

doc_topic_probabilities = []

for document in doc_topics:
    
    single_document = []
    
    for topic, probablity in document:
        
        single_document.append(probablity)
        
    doc_topic_probabilities.append(single_document)
    
docs_topics = pd.DataFrame(doc_topic_probabilities)
sns.heatmap(docs_topics)

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x118477990>

12. Fit an LDA model with sklearn

Sklearn's LDA model is in the decomposition submodule:

from sklearn.decomposition import LatentDirichletAllocation

One of the greatest benefits of the sklearn implementation is that it comes with the familiar .fit(), .transform() and .fit_transform() methods.

12.1 Initialize and fit an sklearn LDA with n_topics=3 on our output from the CountVectorizer.

In [27]:

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=3)
lda.fit(X)

Out[27]:

//anaconda/lib/python2.7/site-packages/sklearn/decomposition/online_lda.py:508: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
  DeprecationWarning)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_jobs=1, n_topics=3, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

12.2 Print out the topic-word distributions using the .components_ attribute.

Each row of this matrix represents a topic, and the columns are the words. (These are not probabilities).

In [30]:

lda.components_

Out[30]:

array([[ 0.4686259 ,  1.28359537,  1.2695242 ,  0.45126442,  1.29628404,
         1.28339283,  1.24262646,  0.46460671,  1.26800285,  1.24432761,
         0.48331578,  1.26064164,  1.26550392,  0.48221946,  0.47786664,
         1.22751348,  0.42398754,  2.09872276,  0.4617138 ,  0.45508956,
         1.24187425,  0.47430803,  1.24181076,  1.23946787,  0.50433305],
       [ 0.42171165,  0.44251846,  0.4335032 ,  2.83346157,  0.41743243,
         0.42724243,  0.45708056,  2.06404518,  0.45796338,  0.53391681,
         2.94132632,  2.0668106 ,  0.49442483,  1.25008998,  0.42392156,
         0.48420376,  0.50375205,  0.47471725,  1.27635667,  1.27888661,
         0.42394533,  0.47346115,  0.43471492,  0.45086331,  0.47528005],
       [ 1.2381646 ,  0.42591884,  0.48193159,  0.47942542,  0.46027655,
         0.4557511 ,  1.24809171,  0.4718105 ,  0.45047475,  0.46904498,
         0.48925124,  0.4696807 ,  0.45321348,  0.43189098,  1.27584977,
         0.44919413,  1.28123578,  0.45466396,  0.4241175 ,  0.48747429,
         0.45099348,  1.28394932,  0.44536017,  0.47026073,  1.31041292]])

12.3 Use the .transform() method to convert the matrix into the topic scores.

These are the document-topic distributions.

In [33]:

lda.transform(X)

Out[33]:

array([[ 0.04205151,  0.91573086,  0.04221763],
       [ 0.04900801,  0.04830734,  0.90268465],
       [ 0.92968046,  0.03522627,  0.03509327],
       [ 0.90252032,  0.04860767,  0.04887202],
       [ 0.04914443,  0.90242855,  0.04842702]])

13. Further steps

This has been a very basic example. LDA typically doesn't perform well on very small datasets. You should try to see how it behaves on your own using a larger text dataset. Keep in mind: finding the optimal number of topics can be tricky and subjective.

Generally, you should consider:

How well topics are applied to documents overall
The strength of topics overall, to all documents
Improving preprocessing such as stopword removal
Building a nice web interface to explore your documents (see: LDAExplorer, and pyLDAvis)

These general guidelines should help you tune your hyperparameter K for number of topics.

In [ ]:

Guided Practice with Topic Modeling and LDA

1. Load the packages and create the small "documents".

2. Load stop words either from NLTK or sklearn

3. Use CountVectorizer to transform our text, taking out the stopwords.

4. Extract the tokens that remain after stopword removal.

5. Get counts of tokens.

6. Setup the vocabulary dictionary

7. Create a token to id mapping with gensim's `corpora.Dictionary`

8. Set up the LDA model

9. Look at the topics

10. Get the topic scores for a document

11. Label and visualize the topics

12. Fit an LDA model with sklearn

13. Further steps

Product

Resources

Company

Guided Practice with Topic Modeling and LDA

1. Load the packages and create the small "documents".

2. Load stop words either from NLTK or sklearn

3. Use CountVectorizer to transform our text, taking out the stopwords.

4. Extract the tokens that remain after stopword removal.

5. Get counts of tokens.

6. Setup the vocabulary dictionary

7. Create a token to id mapping with gensim's corpora.Dictionary

8. Set up the LDA model

9. Look at the topics

10. Get the topic scores for a document

11. Label and visualize the topics

12. Fit an LDA model with sklearn

13. Further steps

7. Create a token to id mapping with gensim's `corpora.Dictionary`