Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_15/Topic Modeling Workbook - (done).ipynb
1904 views
Kernel: Python 3

Topic Modeling Workbook

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.decomposition import LatentDirichletAllocation #there are two LDAs from sklearn.datasets import fetch_20newsgroups import pandas as pd n_samples = 5000 n_features = 10000 n_topics = 12 n_top_words = 20

Some Helper Code to load in one of the prebuilt sklearn datasets

The code below loads in the dataset. These data are from "newsgroups" - primordial blogs, where our ancestors on the internet used to go to converse about various subjects.

categories = ['comp.graphics', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.electronics', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc', 'alt.atheism']
print("Loading dataset...") dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'), categories = categories ) data_samples = dataset.data[:n_samples]
Loading dataset...

We have 5000 posts to these various newsgroups

len(data_samples)
5000

A sample article

print(dataset.target_names[dataset.target[1]])
talk.religion.misc
print(data_samples[1])
/(hudson) /If someone inflicts pain on themselves, whether they enjoy it or not, they /are hurting themselves. They may be permanently damaging their body. That is true. It is also none of your business. Some people may also reason that by reading the bible and being a Xtian you are permanently damaging your brain. By your logic, it would be OK for them to come into your home, take away your bible, and send you off to "re-education camps" to save your mind from ruin. Are you ready for that? /(hudson) /And why is there nothing wrong with it? Because you say so? Who gave you /the authority to say that, and set the standard for morality? Why? Because: I am a living, thinking person able to make choices for myself. I do not "need" you to show me what you think is the way; I have observed too many errors in your thinking already to trust you to make up the rules for me. Because: I set the standard for my *own* morality, and I permit you to do the same for yourself. I also do not try to force you to accept my rules. Because: Simply because you don't like what other people are doing doesn't give you the right to stop it, Hudson. We are all aware that you would like for everyone to be like you. However, it is obnoxious, arrogant thinking like yours, the "I-know-I'm-morally-right-so-I-can-force-it-on-you" bullshit that has brought us religious wars, pogroms against Jews, gay-bashing, and other atrocities by other people who, like you, "knew" they were morally right. (me) /(hudson) /Aren't you? Aren't you indicating that I should not tell other people what to do? Aren't you telling me it is wrong for me to do that? It is not a moral standard that I am presenting you with, Hudson. It is a key to getting along in life with other people. It is also a point of respect: I trust other people to be intelligent enough to make their own choices, and I expect the same to be returned. You, on the other hand, do not trust them, and want to make the choice for them--whether they like it or not. It is also a way to avoid an inconsistency: if you believe that you have the right to set moral standards for others and interfere in their lives, then you must, by logic, admit that other people have the same right of interference in your life. (Yes, I know; you will say that your religion is correct and tells you that only agents acting in behalf of your religion have the right of interference. However, other people will say that you have misinterpreted the Word of God and that *they* are the actual true believers, and that you are acting on your own authority. And so it goes). (hudson) /Who gave /you the authority to set such a moral standard for me to tell me that I /cannot set a moral standard for others? You can set all the standards that you want, actually. But don't be surprised if people don't follow you like rats after the Pied Piper. At the most basic form, I am not going to LET you tell me what to do; and if necessary, I will beat you to a bloody pulp before I let you actually interfere in my life.

Lets Create our Count matrix

print("Extracting tf-idf features...") tf_vectorizer = CountVectorizer(max_features=n_features, stop_words='english') tf = tf_vectorizer.fit_transform(data_samples)
Extracting tf-idf features...

Its pretty easy to fit an LDA model

print("Fitting LDA models with tf features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features)) lda = LatentDirichletAllocation(n_topics=n_topics, random_state=11) lda.fit(tf)
Fitting LDA models with tf features, n_samples=5000 and n_features=10000...
/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:294: DeprecationWarning: n_topics has been renamed to n_components in version 0.19 and will be removed in 0.21 DeprecationWarning) /anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:536: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18. DeprecationWarning)
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None, evaluate_every=-1, learning_decay=0.7, learning_method=None, learning_offset=10.0, max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001, n_components=10, n_jobs=1, n_topics=12, perp_tol=0.1, random_state=11, topic_word_prior=None, total_samples=1000000.0, verbose=0)

Here is a bit of code to extract the actual words in our topics

Remember that these are just words - it is up to you to interpret the topics!

def print_top_words(model, feature_names, n_top_words): for topic_idx, topic in enumerate(model.components_): message = "Topic #%d: " % topic_idx message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]) print(message, '\n') print('\n') print("\nTopics in LDA model:") tf_feature_names = tf_vectorizer.get_feature_names() print_top_words(lda, tf_feature_names, n_top_words)
Topics in LDA model: Topic #0: edu image format pub images pt package free data version file ray university ed convert bit library jpeg use book Topic #1: don people know just think like say really said ll ve did didn good make says tell going way time Topic #2: year team play league new san 00 hockey season traded captain st win games nhl vs division period pittsburgh chicago Topic #3: year good game just think time right best got team better did like didn don way hit years players player Topic #4: god does jesus believe true bible christian religion fact life people argument point evidence question law religious atheism way example Topic #5: people president state think states don government make going mr american rights know money support time want new work countries Topic #6: like just new use know car don need good thanks ve problem time ground current line used does list want Topic #7: people israel armenian jews armenians turkish said israeli arab war killed children went government human jewish turks years armenia turkey Topic #8: greek greece henrik bm island greeks cyprus har turkey kk rockefeller judas den p2 georgia p3 magi p1 db bullets Topic #9: graphics mail files information send file com color jpeg available thanks ftp gif edu use help code program does address Topic #10: 25 10 55 16 11 14 12 15 20 18 13 21 17 24 19 27 23 30 37 33 Topic #11: car bike used speed use engine dod cars fast interested software oil ride drive work data power high com driving

Excercise:

Look at the categories above, and then look at the topics, and then create a one-two word description of that topic in a list.

topics = ['software', '?', 'hockey', 'baseball', 'religion', 'politics', 'auto mechanics', 'israel', 'greece', 'software', '?', 'auto mechanics' ]

LDA spits out what percent of each document is about each topic

tf
<5000x10000 sparse matrix of type '<class 'numpy.int64'>' with 252582 stored elements in Compressed Sparse Row format>
first_example = tf.getrow(1) topic_probabilities = lda.transform(first_example) topic_probabilities
array([[4.23016138e-04, 4.97047207e-01, 4.23012612e-04, 4.23031861e-04, 1.82924850e-01, 4.23047659e-04, 4.23022698e-04, 1.58711391e-01, 4.23046526e-04, 1.57932333e-01, 4.23012153e-04, 4.23028856e-04]])

We can throw these into a data frame for easy processing

articles = [] for doc in range(tf.shape[0]): articles.append(list(lda.transform(tf.getrow(doc))[0])) # this is without our topics added in df = pd.DataFrame.from_records(articles) # columns = topics df['article'] = data_samples df.head()
articles = [] for doc in range(tf.shape[0]): articles.append(list(lda.transform(tf.getrow(doc))[0])) # added in our topics columns df = pd.DataFrame.from_records(articles) # columns = topics df.columns = topics df['article'] = data_samples df.head()

Why do this? Well now perhaps we can cluster the articles!

from sklearn.cluster import KMeans km = KMeans(n_clusters = 10, random_state=1) km.fit(df.drop('article', axis=1)) ## Lets add these back to our data frame df['clusters'] = km.predict(df.drop('article',axis=1)) ## Lets also add in our original labels, so we can compare the clusters to the labels: df['labels'] = dataset.target[:n_samples] num_targets = range(len(dataset.target_names)) mapping_dict = dict(zip(num_targets,dataset.target_names)) df['labels'] = df['labels'].map(mapping_dict)
df[['article','clusters','labels']].head()

Now lets compare

results = df[['article','clusters','labels']].groupby('labels').apply(lambda x: x['clusters'].value_counts()) pd.DataFrame(results)
final_df = pd.DataFrame(results).reset_index().pivot(columns = 'level_1', values = 'clusters', index='labels').fillna(0) final_df
%matplotlib inline import seaborn as sns sns.heatmap(final_df)
<matplotlib.axes._subplots.AxesSubplot at 0x110f14d30>
Image in a Jupyter notebook