GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_15/Topic Modeling Workbook - (done).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Topic Modeling Workbook

In [54]:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation #there are two LDAs
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

n_samples = 5000
n_features = 10000
n_topics = 12
n_top_words = 20

Some Helper Code to load in one of the prebuilt sklearn datasets

The code below loads in the dataset. These data are from "newsgroups" - primordial blogs, where our ancestors on the internet used to go to converse about various subjects.

In [55]:

categories = ['comp.graphics',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.electronics',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc',
 'alt.atheism']

In [56]:

print("Loading dataset...")
dataset = fetch_20newsgroups(shuffle=True, random_state=1, 
                             remove=('headers', 'footers', 'quotes'), 
                             categories = categories )
data_samples = dataset.data[:n_samples]

Out[56]:

Loading dataset...

We have 5000 posts to these various newsgroups

In [57]:

len(data_samples)

Out[57]:

5000

A sample article

In [58]:

print(dataset.target_names[dataset.target[1]])

Out[58]:

talk.religion.misc

In [59]:

print(data_samples[1])

Out[59]:

/(hudson)
/If someone inflicts pain on themselves, whether they enjoy it or not, they
/are hurting themselves.  They may be permanently damaging their body.

That is true.  It is also none of your business.  

Some people may also reason that by reading the bible and being a Xtian
you are permanently damaging your brain.  By your logic, it would be OK
for them to come into your home, take away your bible, and send you off
to "re-education camps" to save your mind from ruin.  Are you ready for
that?  





/(hudson)
/And why is there nothing wrong with it?  Because you say so?  Who gave you
/the authority to say that, and set the standard for morality?

Why?

Because: 
I am a living, thinking person able to make choices for myself.
I do not "need" you to show me what you think is the way; I have observed
too many errors in your thinking already to trust you to make up the
rules for me.

Because:
I set the standard for my *own* morality, and I permit you to do 
the same for yourself.  I also do not try to force you to accept my rules.

Because:
Simply because you don't like what other people are doing doesn't give you
the right to stop it, Hudson.  We are all aware that you would like for 
everyone to be like you.  However, it is obnoxious, arrogant thinking like 
yours, the "I-know-I'm-morally-right-so-I-can-force-it-on-you" bullshit 
that has brought us religious wars, pogroms against Jews, gay-bashing,
and other atrocities by other people who, like you, "knew" they were
morally right.





(me)

/(hudson)
/Aren't you?  Aren't you indicating that I should not tell other people what
to do?  Aren't you telling me it is wrong for me to do that? 

It is not a moral standard that I am presenting you with, Hudson.  It is
a key to getting along in life with other people.  It is also a point of
respect:  I trust other people to be intelligent enough to make their
own choices, and I expect the same to be returned.  You, on the other
hand, do not trust them, and want to make the choice for them--whether
they like it or not.

It is also a way to avoid an inconsistency:  if you believe that you have 
the right to set moral standards for others and interfere in their lives, 
then you must, by logic, admit that other people have the same right of 
interference in your life.  
(Yes, I know; you will say that your religion is correct and tells you that
only agents acting in behalf of your religion have the right of interference.
However, other people will say that you have misinterpreted the Word of
God and that *they* are the actual true believers, and that you are
acting on your own authority.  And so it goes).





(hudson)
/Who gave
/you the authority to set such a moral standard for me to tell me that I 
/cannot set a moral standard for others?


You can set all the standards that you want, actually.  But don't be surprised
if people don't follow you like rats after the Pied Piper.  

At the most basic form, I am not going to LET you tell me what to do;
and if necessary, I will beat you to a bloody pulp before I let you actually
interfere in my life.

Lets Create our Count matrix

In [60]:

print("Extracting tf-idf features...")
tf_vectorizer = CountVectorizer(max_features=n_features,
                                stop_words='english')

tf = tf_vectorizer.fit_transform(data_samples)

Out[60]:

Extracting tf-idf features...

Its pretty easy to fit an LDA model

In [61]:

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))

lda = LatentDirichletAllocation(n_topics=n_topics, random_state=11)

lda.fit(tf)

Out[61]:

Fitting LDA models with tf features, n_samples=5000 and n_features=10000...

/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:294: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21
  DeprecationWarning)
/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:536: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
  DeprecationWarning)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1, n_topics=12,
             perp_tol=0.1, random_state=11, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

Here is a bit of code to extract the actual words in our topics

Remember that these are just words - it is up to you to interpret the topics!

In [62]:

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message, '\n')
    print('\n')

print("\nTopics in LDA model:")

tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Out[62]:

Topics in LDA model:
Topic #0: edu image format pub images pt package free data version file ray university ed convert bit library jpeg use book 

Topic #1: don people know just think like say really said ll ve did didn good make says tell going way time 

Topic #2: year team play league new san 00 hockey season traded captain st win games nhl vs division period pittsburgh chicago 

Topic #3: year good game just think time right best got team better did like didn don way hit years players player 

Topic #4: god does jesus believe true bible christian religion fact life people argument point evidence question law religious atheism way example 

Topic #5: people president state think states don government make going mr american rights know money support time want new work countries 

Topic #6: like just new use know car don need good thanks ve problem time ground current line used does list want 

Topic #7: people israel armenian jews armenians turkish said israeli arab war killed children went government human jewish turks years armenia turkey 

Topic #8: greek greece henrik bm island greeks cyprus har turkey kk rockefeller judas den p2 georgia p3 magi p1 db bullets 

Topic #9: graphics mail files information send file com color jpeg available thanks ftp gif edu use help code program does address 

Topic #10: 25 10 55 16 11 14 12 15 20 18 13 21 17 24 19 27 23 30 37 33 

Topic #11: car bike used speed use engine dod cars fast interested software oil ride drive work data power high com driving

Excercise:

Look at the categories above, and then look at the topics, and then create a one-two word description of that topic in a list.

In [63]:

topics = ['software', '?', 'hockey', 'baseball', 'religion', 'politics', 'auto mechanics', 'israel', 'greece', 'software', '?', 'auto mechanics' ]

LDA spits out what percent of each document is about each topic

In [64]:

tf

Out[64]:

<5000x10000 sparse matrix of type '<class 'numpy.int64'>'
	with 252582 stored elements in Compressed Sparse Row format>

In [65]:

first_example = tf.getrow(1)
topic_probabilities = lda.transform(first_example)
topic_probabilities

Out[65]:

array([[4.23016138e-04, 4.97047207e-01, 4.23012612e-04, 4.23031861e-04,
        1.82924850e-01, 4.23047659e-04, 4.23022698e-04, 1.58711391e-01,
        4.23046526e-04, 1.57932333e-01, 4.23012153e-04, 4.23028856e-04]])

We can throw these into a data frame for easy processing

In [66]:

articles = []
for doc in range(tf.shape[0]):
    articles.append(list(lda.transform(tf.getrow(doc))[0]))

# this is without our topics added in
df = pd.DataFrame.from_records(articles) # columns = topics
df['article'] = data_samples
df.head()

Out[66]:

In [67]:

articles = []
for doc in range(tf.shape[0]):
    articles.append(list(lda.transform(tf.getrow(doc))[0]))

# added in our topics columns 
df = pd.DataFrame.from_records(articles) # columns = topics
df.columns = topics
df['article'] = data_samples
df.head()

Out[67]:

Why do this? Well now perhaps we can cluster the articles!

In [68]:

from sklearn.cluster import KMeans
km = KMeans(n_clusters = 10, random_state=1)
km.fit(df.drop('article', axis=1))

## Lets add these back to our  data frame
df['clusters'] = km.predict(df.drop('article',axis=1))

## Lets also add in our original labels, so we can compare the clusters to the labels:
df['labels'] = dataset.target[:n_samples]
num_targets = range(len(dataset.target_names))
mapping_dict = dict(zip(num_targets,dataset.target_names))

df['labels'] = df['labels'].map(mapping_dict)

In [69]:

df[['article','clusters','labels']].head()

Out[69]:

Now lets compare

In [70]:

results = df[['article','clusters','labels']].groupby('labels').apply(lambda x: x['clusters'].value_counts())
pd.DataFrame(results)

Out[70]:

In [71]:

final_df = pd.DataFrame(results).reset_index().pivot(columns = 'level_1', values = 'clusters', index='labels').fillna(0)
final_df

Out[71]:

In [72]:

%matplotlib inline
import seaborn as sns
sns.heatmap(final_df)

Out[72]:

<matplotlib.axes._subplots.AxesSubplot at 0x110f14d30>

In [ ]: