Path: blob/master/lessons/lesson_15/Topic Modeling Workbook - (done).ipynb
1904 views
Kernel: Python 3
Topic Modeling Workbook
In [54]:
Some Helper Code to load in one of the prebuilt sklearn datasets
The code below loads in the dataset. These data are from "newsgroups" - primordial blogs, where our ancestors on the internet used to go to converse about various subjects.
In [55]:
In [56]:
Out[56]:
Loading dataset...
We have 5000 posts to these various newsgroups
In [57]:
Out[57]:
5000
A sample article
In [58]:
Out[58]:
talk.religion.misc
In [59]:
Out[59]:
/(hudson)
/If someone inflicts pain on themselves, whether they enjoy it or not, they
/are hurting themselves. They may be permanently damaging their body.
That is true. It is also none of your business.
Some people may also reason that by reading the bible and being a Xtian
you are permanently damaging your brain. By your logic, it would be OK
for them to come into your home, take away your bible, and send you off
to "re-education camps" to save your mind from ruin. Are you ready for
that?
/(hudson)
/And why is there nothing wrong with it? Because you say so? Who gave you
/the authority to say that, and set the standard for morality?
Why?
Because:
I am a living, thinking person able to make choices for myself.
I do not "need" you to show me what you think is the way; I have observed
too many errors in your thinking already to trust you to make up the
rules for me.
Because:
I set the standard for my *own* morality, and I permit you to do
the same for yourself. I also do not try to force you to accept my rules.
Because:
Simply because you don't like what other people are doing doesn't give you
the right to stop it, Hudson. We are all aware that you would like for
everyone to be like you. However, it is obnoxious, arrogant thinking like
yours, the "I-know-I'm-morally-right-so-I-can-force-it-on-you" bullshit
that has brought us religious wars, pogroms against Jews, gay-bashing,
and other atrocities by other people who, like you, "knew" they were
morally right.
(me)
/(hudson)
/Aren't you? Aren't you indicating that I should not tell other people what
to do? Aren't you telling me it is wrong for me to do that?
It is not a moral standard that I am presenting you with, Hudson. It is
a key to getting along in life with other people. It is also a point of
respect: I trust other people to be intelligent enough to make their
own choices, and I expect the same to be returned. You, on the other
hand, do not trust them, and want to make the choice for them--whether
they like it or not.
It is also a way to avoid an inconsistency: if you believe that you have
the right to set moral standards for others and interfere in their lives,
then you must, by logic, admit that other people have the same right of
interference in your life.
(Yes, I know; you will say that your religion is correct and tells you that
only agents acting in behalf of your religion have the right of interference.
However, other people will say that you have misinterpreted the Word of
God and that *they* are the actual true believers, and that you are
acting on your own authority. And so it goes).
(hudson)
/Who gave
/you the authority to set such a moral standard for me to tell me that I
/cannot set a moral standard for others?
You can set all the standards that you want, actually. But don't be surprised
if people don't follow you like rats after the Pied Piper.
At the most basic form, I am not going to LET you tell me what to do;
and if necessary, I will beat you to a bloody pulp before I let you actually
interfere in my life.
Lets Create our Count matrix
In [60]:
Out[60]:
Extracting tf-idf features...
Its pretty easy to fit an LDA model
In [61]:
Out[61]:
Fitting LDA models with tf features, n_samples=5000 and n_features=10000...
/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:294: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21
DeprecationWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:536: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
DeprecationWarning)
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7, learning_method=None,
learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
mean_change_tol=0.001, n_components=10, n_jobs=1, n_topics=12,
perp_tol=0.1, random_state=11, topic_word_prior=None,
total_samples=1000000.0, verbose=0)
Here is a bit of code to extract the actual words in our topics
Remember that these are just words - it is up to you to interpret the topics!
In [62]:
Out[62]:
Topics in LDA model:
Topic #0: edu image format pub images pt package free data version file ray university ed convert bit library jpeg use book
Topic #1: don people know just think like say really said ll ve did didn good make says tell going way time
Topic #2: year team play league new san 00 hockey season traded captain st win games nhl vs division period pittsburgh chicago
Topic #3: year good game just think time right best got team better did like didn don way hit years players player
Topic #4: god does jesus believe true bible christian religion fact life people argument point evidence question law religious atheism way example
Topic #5: people president state think states don government make going mr american rights know money support time want new work countries
Topic #6: like just new use know car don need good thanks ve problem time ground current line used does list want
Topic #7: people israel armenian jews armenians turkish said israeli arab war killed children went government human jewish turks years armenia turkey
Topic #8: greek greece henrik bm island greeks cyprus har turkey kk rockefeller judas den p2 georgia p3 magi p1 db bullets
Topic #9: graphics mail files information send file com color jpeg available thanks ftp gif edu use help code program does address
Topic #10: 25 10 55 16 11 14 12 15 20 18 13 21 17 24 19 27 23 30 37 33
Topic #11: car bike used speed use engine dod cars fast interested software oil ride drive work data power high com driving
Excercise:
Look at the categories above, and then look at the topics, and then create a one-two word description of that topic in a list.
In [63]:
LDA spits out what percent of each document is about each topic
In [64]:
Out[64]:
<5000x10000 sparse matrix of type '<class 'numpy.int64'>'
with 252582 stored elements in Compressed Sparse Row format>
In [65]:
Out[65]:
array([[4.23016138e-04, 4.97047207e-01, 4.23012612e-04, 4.23031861e-04,
1.82924850e-01, 4.23047659e-04, 4.23022698e-04, 1.58711391e-01,
4.23046526e-04, 1.57932333e-01, 4.23012153e-04, 4.23028856e-04]])
We can throw these into a data frame for easy processing
In [66]:
Out[66]:
In [67]:
Out[67]:
Why do this? Well now perhaps we can cluster the articles!
In [68]:
In [69]:
Out[69]:
Now lets compare
In [70]:
Out[70]:
In [71]:
Out[71]:
In [72]:
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x110f14d30>
In [ ]: