Path: blob/master/15_topic_modeling/07_lda_financial_news.ipynb
2923 views
Kernel: Python 3
Topic Modeling: Financial News
This notebook contains an example of LDA applied to financial news articles.
Imports & Settings
In [1]:
In [2]:
In [3]:
In [4]:
Helper Viz Functions
In [6]:
In [7]:
In [8]:
Load Financial News
The data is avaialble from Kaggle.
Download and unzip into data directory in repository root folder, then rename the enclosing folder to us-financial-news
and the subfolders so you get the following directory structure:
In [ ]:
We limit the article selection to the following sections in the dataset:
In [9]:
In [10]:
In [11]:
Out[11]:
Done loading 125,964 articles
In [12]:
In [13]:
Out[13]:
Preprocessing with SpaCy
In [24]:
In [14]:
In [15]:
Out[15]:
[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f7b43da6fa0>)]
In [16]:
Out[16]:
['tagger', 'parser']
In [17]:
In [18]:
Out[18]:
/home/stefan/.pyenv/versions/miniconda3-latest/envs/ml4t/lib/python3.7/site-packages/spacy/language.py:771: DeprecationWarning: [W016] The keyword argument `n_threads` is now deprecated. As of v2.2.2, the argument `n_process` controls parallel inference via multiprocessing.
warnings.warn(Warnings.W016, DeprecationWarning)
0.79% 1.59% 2.38% 3.18% 3.97% 4.76% 5.56% 6.35% 7.14% 7.94% 8.73% 9.53% 10.32% 11.11% 11.91% 12.70% 13.50% 14.29% 15.08% 15.88% 16.67% 17.47% 18.26% 19.05% 19.85% 20.64% 21.43% 22.23% 23.02% 23.82% 24.61% 25.40% 26.20% 26.99% 27.79% 28.58% 29.37% 30.17% 30.96% 31.76% 32.55% 33.34% 34.14% 34.93% 35.72% 36.52% 37.31% 38.11% 38.90% 39.69% 40.49% 41.28% 42.08% 42.87% 43.66% 44.46% 45.25% 46.04% 46.84% 47.63% 48.43% 49.22% 50.01% 50.81% 51.60% 52.40% 53.19% 53.98% 54.78% 55.57% 56.37% 57.16% 57.95% 58.75% 59.54% 60.33% 61.13% 61.92% 62.72% 63.51% 64.30% 65.10% 65.89% 66.69% 67.48% 68.27% 69.07% 69.86% 70.66% 71.45% 72.24% 73.04% 73.83% 74.62% 75.42% 76.21% 77.01% 77.80% 78.59% 79.39% 80.18% 80.98% 81.77% 82.56% 83.36% 84.15% 84.94% 85.74% 86.53% 87.33% 88.12% 88.91% 89.71% 90.50% 91.30% 92.09% 92.88% 93.68% 94.47% 95.27% 96.06% 96.85% 97.65% 98.44% 99.23%
In [19]:
Out[19]:
333868354
Vectorize data
In [20]:
Out[20]:
125964
Explore cleaned data
In [21]:
In [25]:
Out[25]:
In [26]:
Out[26]:
count 125964.000000
mean 354.514091
std 534.782734
min 1.000000
10% 48.000000
20% 85.000000
30% 135.000000
40% 180.000000
50% 225.000000
60% 267.000000
70% 324.000000
80% 413.000000
90% 622.000000
max 17838.000000
dtype: float64
In [27]:
In [28]:
Out[28]:
'washington reuters treasury secretary steven mnuchin tuesday say want know consumer financial protection bureau handling probe hack credit bureau equifax report agency acting director pull investigate matter equifax disclose september hacker steal personal datum collect million americans monday reuters report act cfpb chief mick mulvaney brakes agency equifax investigation speak director mulvaney mnuchin tell house representatives financial services committee go discuss tuesday cfpb say examine equifax breach decline detail bureau look equifax datum breach response agency say statement reuters cite people familiar matter report monday cfpb open investigation equifax mulvaney rein work begin predecessor richard cordray mulvaney order subpoenas equifax seek swear testimony executive routine step scale probe source say add cfpb shelve plan ground test equifax protect datum idea back cordray cfpb recently rebuff bank regulator federal reserve federal deposit insurance corp office comptroller currency offer help site exam credit bureau source say republican president donald trump administration seek curb power cfpb create democratic predecessor barack obama protect consumer financial industry abuse agency criticize fiercely industry mulvaney seek new operating fund agency opt instead finance slimme budget shrink reserve fund establish cordray report patrick rucker lindsay dunsmuir editing franklin paul'
Set vocab parameters
In [29]:
In [30]:
Out[30]:
(125964, 3736)
In [31]:
Train & Evaluate LDA Model
In [32]:
Train models with 5-25 topics
In [33]:
In [36]:
Out[36]:
5
10
15
20
25
Evaluate results
We show results for one model using a vocabulary of 3,800 tokens based on min_df=0.1% and max_df=25% with a single pass to avoid length training time for 20 topics. We can use pyldavis topic_info attribute to compute relevance values for lambda=0.6 that produces the following word list
In [37]:
In [ ]:
Perplexity
In [41]:
Out[41]:
PyLDAVis for 15 Topics
In [42]:
Out[42]:
LDAMultiCore Timing
In [43]:
Out[43]:
In [45]:
Out[45]:
In [ ]: