Path: blob/master/lessons/lesson_13/practice/solution-code/topic_modeling_lda-codealong-solutions.ipynb
1904 views
Guided Practice with Topic Modeling and LDA
Authors: Dave Yerrington (SF)
Note: this lab is intended to be a guided lab with the instructor.
In practice it would be a very rare to need to build an unsupervised topic model like LDA from scratch. Lucky for us, sklearn comes with LDA topic modeling functionality. Another popular LDA module which we will explore in this lab is from the gensim
package.
Let's explore a brief walkthrough of LDA and topic modeling using gensim. We will work with a small collection of documents represented as a list.
1. Load the packages and create the small "documents".
You may need to install the gensim package with pip
or conda
.
2. Load stop words either from NLTK or sklearn
3. Use CountVectorizer to transform our text, taking out the stopwords.
4. Extract the tokens that remain after stopword removal.
The .vocabulary_
attribute of the vectorizer contains a dictionary of terms. There is also the built-in function .get_feature_names()
which will extract the column names.
5. Get counts of tokens.
Convert the matrix from the vectorizer to a dense matrix, then sum by column to get the counts per term.
6. Setup the vocabulary dictionary
First we need to setup the vocabulary. Gensim's LDA expects our vocabulary to be in a format where the dictionary keys are the column indices and the values are the words themselves.
Create this dictionary below.
7. Create a token to id mapping with gensim's corpora.Dictionary
This dictionary class is a more standard way to work with with gensim models. There are a few standard steps we should go through:
7.1. Count the frequency of words.
We can do this easily with the python defaultdict(int)
, which doesn't require us to already have the key in the dictionary to be able to add to it:
7.2 Remove any words that only appear once, or appear in the stopwords.
Iterate through the documents and only keep useful words/tokens.
7.3 Create the corpora.Dictionary
object with the retained tokens.
7.4 Use the dictionary.doc2bow()
function to convert the texts to bag-of-word representations.
Why should we use this process?
The main advantage is that this dictionary object has quick helper functions.
However, there are also some major performance advantages if you ever want to save your model to a file, then load it at a later time. Tokenizations can take a while to be computed, especially when your text files are quite large. You can save these post-computed dictionary items to file, then load them from disk later which is quite a bit faster. Also, it's possible to add new documents to your corpus without having to re-tokenize your entire set. This is great for online systems that can take new documents on demmand.
As you work with larger datasets with text, this is a much better way to handle LDA and other Gensim models from a performance point of view.
8. Set up the LDA model
We can create the gensim LDA model object like so:
9. Look at the topics
The model has a .print_topics
function that accepts the number of topics to print and number of words per topic. The number before the word is the probability of occurance for that word in the topic.
10. Get the topic scores for a document
The .get_document_topics
function accepts a bag-of-words representation for a document and returns the scores for each topic.
11. Label and visualize the topics
Lets come up with some high level labels. This is the subjective part of LDA. What do the word probabilties that represent topics mean? Let's make some up.
Plot a heatmap of the topic probabilities for each of the documents.
12. Fit an LDA model with sklearn
Sklearn's LDA model is in the decomposition submodule:
One of the greatest benefits of the sklearn implementation is that it comes with the familiar .fit()
, .transform()
and .fit_transform()
methods.
12.1 Initialize and fit an sklearn LDA with n_topics=3
on our output from the CountVectorizer.
12.2 Print out the topic-word distributions using the .components_
attribute.
Each row of this matrix represents a topic, and the columns are the words. (These are not probabilities).
12.3 Use the .transform()
method to convert the matrix into the topic scores.
These are the document-topic distributions.
13. Further steps
This has been a very basic example. LDA typically doesn't perform well on very small datasets. You should try to see how it behaves on your own using a larger text dataset. Keep in mind: finding the optimal number of topics can be tricky and subjective.
Generally, you should consider:
How well topics are applied to documents overall
The strength of topics overall, to all documents
Improving preprocessing such as stopword removal
Building a nice web interface to explore your documents (see: LDAExplorer, and pyLDAvis)
These general guidelines should help you tune your hyperparameter K for number of topics.