Path: blob/master/15_topic_modeling/05_lda_with_gensim.ipynb
2923 views
Topic Modeling: Latent Dirichlet Allocation with gensim
Gensim is a specialized NLP library with a fast LDA implementation and many additional features. We will also use it in the next chapter on word vectors (see the notebook lda_with_gensim for details.
Imports & Settings
Load BBC data
Convert to DataFrame
Create Train & Test Sets
Vectorize train & test sets
LDA with gensim
Using CountVectorizer
Input
Convert sklearn DTM to gensim data structures
It faciltiates the conversion of DTM produced by sklearn to gensim data structures as follows:
Train Model & Review Results
Evaluate Topic Coherence
Topic Coherence measures whether the words in a topic tend to co-occur together.
It adds up a score for each distinct pair of top ranked words.
The score is the log of the probability that a document containing at least one instance of the higher-ranked word also contains at least one instance of the lower-ranked word.
Large negative values indicate words that don't co-occur often; values closer to zero indicate that words tend to co-occur more often.
Gensim permits topic coherence evaluation that produces the topic coherence and shows the most important words per topic: