Path: blob/master/15_topic_modeling/06_lda_earnings_calls.ipynb
2923 views
Topic Modeling with Earnings Call Transcripts
Imports & Settings
Load Earnings Call Transcripts
The document are the result of scraping the SeekingAlpha Earnings Transcripts as described in n Chapter 3 on Alternative Data.
The transcripts consist of individual statements by company representative, an operator and usually a Q&A session with analysts. We will treat each of these statements as separate documents, ignoring operator statements, to obtain 22,766 items with mean and median word counts of 144 and 64, respectively (or as many as you were able to scrape):
Explore Data
Tokens per document
Most frequent tokens
Preprocess Transcripts
We use spaCy to preprocess these documents as illustrated in Chapter 13 - Working with Text Data and store the cleaned and lemmatized text as a new text file.
Data exploration reveals domain-specific stopwords like ’year’ and ‘quarter’ that we remove in a second step, where we also filter out statements with fewer than 10 words so that some 16,150 remain.
Vectorize data
Train & Evaluate LDA Model
Vocab Settings
For illustration, we create a document-term matrix containing terms appearing in between 0.5% and 50% of documents for around 1,560 features.
Model Settings
Training a 15 topic model using 25 passes over the corpus takes a bit over two minutes on a 4-core i7. The top 10 words per topic identify several distinct themes that range from obvious financial information to clinical trials (topic 4) and supply chain issues (12).
Topic Coherence
pyLDAVis
Show documents most represenative of each topic
Review Experiment Results
To illustrate the impact of different parameter settings, we run a few hundred experiments for different DTM constraints and model parameters. More specifically, we let the min_df and max_df parameters range from 50-500 words and 10% to 100% of documents, respectively using alternatively binary and absolute counts. We then train LDA models with 3 to 50 topics, using 1 and 25 passes over the corpus.
The script run_experiments.py lets you train many topic models with different hyperparameters to explore how they impact the results. The script collect_experiments.py combines the results into a results.h5
HDF store.
These results are not included in the repository due to their size, but the results are displayed and you can rerun these experiments with earnings call transcripts or other text documents of your choice.
Parameter Settings: Impact on Perplexity
Parameter Settings: Impact on Coherence
Hyperparameter Impact on Perplexity
Hyperparameter Impact on Topic Coherence
The following chart illustrate the results in terms of topic coherence (higher is better) ,and perplexity (lower is better). Coherence drops after 25-30 topics, and perplexity similarly increases.