Path: blob/master/15_topic_modeling/04_lda_with_sklearn.ipynb
2923 views
Topic Modeling: Latent Dirichlet Allocation with sklearn
Imports & Settings
Load BBC data
Using the BBC data as before, we use sklearn.decomposition.LatentDirichletAllocation to train an LDA model with five topics.
Convert to DataFrame
Create Train & Test Sets
Vectorize train & test sets
LDA with sklearn
Persist model
The model tracks the in-sample perplexity during training and stops iterating once this measure stops improving. We can persist and load the result as usual with sklearn objects:
Explore topics & word distributions
Evaluate Fit on Train Set
Evaluate Fit on Test Set
Retrain until perplexity no longer decreases
Compare Train & Test Topic Assignments
Explore misclassified articles
PyLDAVis
LDAvis helps you interpret LDA results by answer 3 questions:
What is the meaning of each topic?
How prevalent is each topic?
How do topics relate to each other?
Topic visualization facilitates the evaluation of topic quality using human judgment. pyLDAvis is a python port of LDAvis, developed in R and D3.js. We will introduce the key concepts; each LDA implementation notebook contains examples.
pyLDAvis displays the global relationships among topics while also facilitating their semantic evaluation by inspecting the terms most closely associated with each individual topic and, inversely, the topics associated with each term. It also addresses the challenge that terms that are frequent in a corpus tend to dominate the multinomial distribution over words that define a topic. LDAVis introduces the relevance r of term w to topic t to produce a flexible ranking of key terms using a weight parameter 0<=ƛ<=1.
With as the model’s probability estimate of observing the term w for topic t, and as the marginal probability of w in the corpus:
The first term measures the degree of association of term t with topic w, and the second term measures the lift or saliency, i.e., how much more likely the term is for the topic than in the corpus.
The tool allows the user to interactively change ƛ to adjust the relevance, which updates the ranking of terms. User studies have found that ƛ=0.6 produces the most plausible results.
Refit using all data
Lambda
= 0: how probable is a word to appear in a topic - words are ranked on lift P(word | topic) / P(word)
= 1: how exclusive is a word to a topic - words are purely ranked on P(word | topic)
The ranking formula is
User studies suggest works for most people.
Topics as WordClouds
Visualize topic-word assocations per document
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
Japanese mogul arrested for fraud
One of Japan's best-known businessmen was arrested on Thursday on charges of falsifying shareholder information and selling shares based on the false data. Yoshiaki Tsutsumi was once ranked as the world's richest man and ran a business spanning hotels, railways, construction and a baseball team. His is the latest in a series of arrests of top executives in Japan over business scandals. He was taken away in a van outside one of his Prince hotels in Tokyo. There was a time when Mr Tsutsumi seemed untouchable. Inheriting a large property business from his father in the 1960s, he became one of Japan's most powerful industrialists, with close connections to many of the country's leading politicians. He used his wealth and influence to bring the Winter Olympic Games to Nagano in 1998. But last year, he was forced to resign from all the posts he held in his business empire, after being accused of falsifying the share-ownership structure of Seibu Railways, one of his companies. Under Japanese stock market rules, no listed company can be more than 80% owned by its 10 largest shareholders. Now Mr Tsutsumi faces criminal charges and the possibility of a prison sentence because he made it look as if the 10 biggest shareholders owned less than this amount. Seibu Railways has been delisted from the stock exchange, its share value has plunged and it is the target of a takeover bid. Mr Tsutsumi's fall from grace follows the arrests of several other top executives in Japan as the authorities try to curb the murky business practices which were once widespread in Japanese companies. His determination to stay at the top at all costs may have had its roots in his childhood. The illegitimate third son of a rich father, who made his money buying up property as Japan rebuilt after World War II, he has described the demands his father made. "I felt enormous pressure when I dined with him and it was nothing but pain," Tsutsumi told a weekly magazine in 1987. "He scolded me for pouring too much soy sauce or told me fruit was not for children. He didn't let me use the silk futon, saying it's a luxury." There have been corporate governance issues at some other Japanese companies too. Last year, twelve managers from Mitsubishi Motors were charged with covering up safety defects in their vehicles and three executives from Japan's troubled UFJ bank were charged with concealing the extent of the bank's bad loans.