Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_13/practice/solution-code/intro_to_nlp-lab-solutions.ipynb
1904 views
Kernel: Python 2

Natural Language Processing Lab

Authors: Dave Yerrington (SF)


In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

# Standard Data Science Imports import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline
# Getting that SKLearn Dataset from sklearn.datasets import fetch_20newsgroups

1. Use the fetch_20newsgroups function to download a training and testing set.

Look up the function documentation for how to grab the data.

You should pull these categories:

  • alt.atheism

  • talk.religion.misc

  • comp.graphics

  • sci.space

Also remove the headers, footers, and quotes using the remove keyword argument of the function.

#Extracting Information from the Data's Dictionary format # Categories of emails we want categories = [ 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space', ] # Setting out training data data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes')) # Setting our testing data data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))
Downloading 20news dataset. This may take a few minutes. Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)

2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

  1. What data taype is data_train

  • Is it like a list? Or like a Dictionary? or what?

  • How many data points does it contain?

  • Inspect the first data point, what does it look like?

type(data_train)
sklearn.utils.Bunch
list(data_train.keys())
['data', 'filenames', 'target_names', 'target', 'DESCR', 'description']
# Making sure our Data and Target columns are equal length len(data_train['data'])
2034
len(data_train['target'])
2034
# Lets checkmeowt what our data actually looks like. data_train['data'][0]
"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation. But if you save\nto a .PRJ file their positions/orientation are preserved. Does anyone\nknow why this information is not stored in the .3DS file? Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

3. Bag of Words model

Let's train a model using a simple count vectorizer.

  1. Initialize a standard CountVectorizer and fit the training data

  • how big is the feature dictionary?

  • repeat eliminating english stop words

  • is the dictionary smaller?

  • transform the training data using the trained vectorizer

  • evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer

    • you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it

BONUS:

  • try a couple modifications:

    • restrict the max_features

    • change max_df and min_df

# What does the target variable look like data_train['target']
array([1, 3, 2, ..., 1, 0, 1])
# NLP Using a count vectorizer. from sklearn.feature_extraction.text import CountVectorizer
# Setting the vectorizer just like we would set a model cvec = CountVectorizer() # Fitting the vectorizer on our training data cvec.fit(data_train['data'])
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
# Lets check the length of our data that is in a vectorized state len(cvec.get_feature_names())
26879
# Lets use the stop_words argument to remove words like "and, the, a" cvec = CountVectorizer(stop_words='english') # Fit our vectorizer using our train data cvec.fit(data_train['data']) # and check out the length of the vectorized data after len(cvec.get_feature_names())
26576
# Transforming our x_train data using our fit cvec. # And converting the result to a DataFrame. X_train = pd.DataFrame(cvec.transform(data_train['data']).todense(), columns=cvec.get_feature_names())
# We still have the same number of rows but the vectorization has converted every word, # or what is believed to be a word, from our test data into a feature. Like dummy coded # variables for words (except counts rather than just occurances).
X_train.shape
(2034, 26576)
# Which words appear the most? word_counts = X_train.sum(axis=0) word_counts.sort_values(ascending = False).head(20)
space 1061 people 793 god 745 don 730 like 682 just 675 does 600 know 592 think 584 time 546 image 534 edu 501 use 468 good 449 data 444 nasa 419 graphics 414 jesus 411 say 409 way 387 dtype: int64
names = data_train['target_names'] names
['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
# What are we trying to predict y_train = data_train['target']
# Lets look through some of the categories common words common_words = [] for i in range(4): word_count = X_train[y_train==i].sum(axis=0) print(names[i], "most common words") cw = word_count.sort_values(ascending = False).head(20) print(cw) common_words.extend(cw.index) print()
alt.atheism most common words god 405 people 330 don 262 think 215 just 209 does 207 atheism 199 say 174 believe 163 like 162 atheists 162 religion 156 jesus 155 know 154 argument 148 time 135 said 131 true 131 bible 121 way 120 dtype: int64 comp.graphics most common words image 484 graphics 410 edu 297 jpeg 267 file 265 use 225 data 219 files 217 images 212 software 212 program 199 ftp 189 available 185 format 178 color 174 like 167 know 165 pub 161 gif 160 does 157 dtype: int64 sci.space most common words space 989 nasa 374 launch 267 earth 222 like 222 data 216 orbit 201 time 197 shuttle 192 just 189 satellite 187 lunar 182 moon 168 new 158 program 156 don 151 year 146 people 142 mission 141 use 134 dtype: int64 talk.religion.misc most common words god 329 people 267 jesus 256 don 162 bible 160 just 159 think 151 christian 151 say 149 know 149 does 147 did 132 like 131 good 131 life 118 way 118 believe 117 said 103 point 101 time 99 dtype: int64
# Converting out vectorized test data to a dataframe # Using the CVEC which we fit earlier X_test = pd.DataFrame(cvec.transform(data_test['data']).todense(), columns=cvec.get_feature_names())
# Getting our Y test information y_test = data_test['target']
#Import and fit our logistic regression and test it too from sklearn.linear_model import LogisticRegression lr = LogisticRegression() lr.fit(X_train, y_train) lr.score(X_test, y_test)
0.74501108647450109

4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

  1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features

  • does the score improve with respect to the count vectorizer?

  • print out the number of features for this model

  • Initialize a TF-IDF Vectorizer and repeat the analysis above

  • print out the number of features for this model

BONUS:

  • Change the parameters of either (or both!) models to improve your score

from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer from sklearn.pipeline import make_pipeline from sklearn.metrics import accuracy_score
# A pipeline is a way for us to construct a function to execute # the same tasks continuously # In our variable model we fit a vectorizer, and a model # our Model variable is stored with the fit vectorizer and model # so we we call model.xxxx it uses that information stored model = make_pipeline(HashingVectorizer(stop_words='english', non_negative=True, n_features=2**16), LogisticRegression(), ) model.fit(data_train['data'], y_train) y_pred = model.predict(data_test['data']) print(accuracy_score(y_test, y_pred)) print("Number of features:", 2**16)
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/sklearn/feature_extraction/hashing.py:94: DeprecationWarning: the option non_negative=True has been deprecated in 0.19 and will be removed in version 0.21. " in version 0.21.", DeprecationWarning) /Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/sklearn/feature_extraction/hashing.py:94: DeprecationWarning: the option non_negative=True has been deprecated in 0.19 and will be removed in version 0.21. " in version 0.21.", DeprecationWarning)
0.743532889874 Number of features: 65536
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/sklearn/feature_extraction/hashing.py:94: DeprecationWarning: the option non_negative=True has been deprecated in 0.19 and will be removed in version 0.21. " in version 0.21.", DeprecationWarning)
model = make_pipeline(TfidfVectorizer(stop_words='english', sublinear_tf=True, max_df=0.5, max_features=1000), LogisticRegression(), ) model.fit(data_train['data'], y_train) y_pred = model.predict(data_test['data']) print(accuracy_score(y_test, y_pred)) print("Number of features:", len(model.steps[0][1].get_feature_names()))
0.728011825573 Number of features: 1000