GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_13/practice/solution-code/intro_to_nlp-lab-solutions.ipynb
¹⁹⁰⁴ views

Kernel: Python 2

Natural Language Processing Lab

Authors: Dave Yerrington (SF)

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [1]:

# Standard Data Science Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

# Getting that SKLearn Dataset
from sklearn.datasets import fetch_20newsgroups

1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the function documentation for how to grab the data.

You should pull these categories:

alt.atheism
talk.religion.misc
comp.graphics
sci.space

Also remove the headers, footers, and quotes using the remove keyword argument of the function.

In [3]:

#Extracting Information from the Data's Dictionary format 
# Categories of emails we want
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

Out[3]:

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)

2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

What data taype is data_train

Is it like a list? Or like a Dictionary? or what?
How many data points does it contain?
Inspect the first data point, what does it look like?

In [4]:

type(data_train)

Out[4]:

sklearn.utils.Bunch

In [5]:

list(data_train.keys())

Out[5]:

['data', 'filenames', 'target_names', 'target', 'DESCR', 'description']

In [6]:

# Making sure our  Data and Target columns are equal length
len(data_train['data'])

Out[6]:

2034

In [7]:

len(data_train['target'])

Out[7]:

2034

In [8]:

# Lets checkmeowt what our data actually looks like.
data_train['data'][0]

Out[8]:

"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

3. Bag of Words model

Let's train a model using a simple count vectorizer.

Initialize a standard CountVectorizer and fit the training data

how big is the feature dictionary?
repeat eliminating english stop words
is the dictionary smaller?
transform the training data using the trained vectorizer
evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer
- you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it

BONUS:

try a couple modifications:
- restrict the max_features
- change max_df and min_df

In [9]:

# What does the target variable look like
data_train['target']

Out[9]:

array([1, 3, 2, ..., 1, 0, 1])

In [10]:

# NLP Using a count vectorizer.  
from sklearn.feature_extraction.text import CountVectorizer

In [11]:

# Setting the vectorizer just like we would set a model
cvec = CountVectorizer()

# Fitting the vectorizer on our training data
cvec.fit(data_train['data'])

Out[11]:

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [12]:

# Lets check the length of our data that is in a vectorized state
len(cvec.get_feature_names())

Out[12]:

26879

In [13]:

# Lets use the stop_words argument to remove words like "and, the, a"
cvec = CountVectorizer(stop_words='english')

# Fit our vectorizer using our train data
cvec.fit(data_train['data'])

# and check out the length of the vectorized data after
len(cvec.get_feature_names())

Out[13]:

26576

In [14]:

# Transforming our x_train data using our fit cvec.
# And converting the result to a DataFrame.
X_train = pd.DataFrame(cvec.transform(data_train['data']).todense(),
                       columns=cvec.get_feature_names())

In [15]:

# We still have the same number of rows but the vectorization has converted every word, 
# or what is believed to be a word, from our test data into a feature.  Like dummy coded
# variables for words (except counts rather than just occurances).

In [16]:

X_train.shape

Out[16]:

(2034, 26576)

In [17]:

# Which words appear the most?
word_counts = X_train.sum(axis=0)
word_counts.sort_values(ascending = False).head(20)

Out[17]:

space       1061
people       793
god          745
don          730
like         682
just         675
does         600
know         592
think        584
time         546
image        534
edu          501
use          468
good         449
data         444
nasa         419
graphics     414
jesus        411
say          409
way          387
dtype: int64

In [18]:

names = data_train['target_names']
names

Out[18]:

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

In [19]:

# What are we trying to predict
y_train = data_train['target']

In [20]:

# Lets look through some of the categories common words
common_words = []
for i in range(4):
    word_count = X_train[y_train==i].sum(axis=0)
    print(names[i], "most common words")
    cw = word_count.sort_values(ascending = False).head(20)
    print(cw)
    common_words.extend(cw.index)
    print()

Out[20]:

alt.atheism most common words
god         405
people      330
don         262
think       215
just        209
does        207
atheism     199
say         174
believe     163
like        162
atheists    162
religion    156
jesus       155
know        154
argument    148
time        135
said        131
true        131
bible       121
way         120
dtype: int64

comp.graphics most common words
image        484
graphics     410
edu          297
jpeg         267
file         265
use          225
data         219
files        217
images       212
software     212
program      199
ftp          189
available    185
format       178
color        174
like         167
know         165
pub          161
gif          160
does         157
dtype: int64

sci.space most common words
space        989
nasa         374
launch       267
earth        222
like         222
data         216
orbit        201
time         197
shuttle      192
just         189
satellite    187
lunar        182
moon         168
new          158
program      156
don          151
year         146
people       142
mission      141
use          134
dtype: int64

talk.religion.misc most common words
god          329
people       267
jesus        256
don          162
bible        160
just         159
think        151
christian    151
say          149
know         149
does         147
did          132
like         131
good         131
life         118
way          118
believe      117
said         103
point        101
time          99
dtype: int64

In [21]:

# Converting out vectorized test data to a dataframe
# Using the CVEC which we fit earlier
X_test = pd.DataFrame(cvec.transform(data_test['data']).todense(),
                      columns=cvec.get_feature_names())

In [22]:

# Getting our Y test information
y_test = data_test['target']

In [23]:

#Import and fit our logistic regression and test it too
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

Out[23]:

0.74501108647450109

4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

Initialize a HashingVectorizer and repeat the test with no restriction on the number of features

does the score improve with respect to the count vectorizer?
print out the number of features for this model
Initialize a TF-IDF Vectorizer and repeat the analysis above
print out the number of features for this model

BONUS:

Change the parameters of either (or both!) models to improve your score

In [24]:

from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

In [25]:

# A pipeline is a way for us to construct a function to execute
# the same tasks continuously
# In our variable model we fit a vectorizer, and a model
# our Model variable is stored with the fit vectorizer and model
# so we we call model.xxxx it uses that information stored
model = make_pipeline(HashingVectorizer(stop_words='english',
                                        non_negative=True,
                                        n_features=2**16),
                      LogisticRegression(),
                      )
model.fit(data_train['data'], y_train)
y_pred = model.predict(data_test['data'])
print(accuracy_score(y_test, y_pred))
print("Number of features:", 2**16)

Out[25]:

/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/sklearn/feature_extraction/hashing.py:94: DeprecationWarning: the option non_negative=True has been deprecated in 0.19 and will be removed in version 0.21.
  " in version 0.21.", DeprecationWarning)
/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/sklearn/feature_extraction/hashing.py:94: DeprecationWarning: the option non_negative=True has been deprecated in 0.19 and will be removed in version 0.21.
  " in version 0.21.", DeprecationWarning)

0.743532889874
Number of features: 65536

/Users/ricky.hennessy/anaconda/lib/python3.6/site-packages/sklearn/feature_extraction/hashing.py:94: DeprecationWarning: the option non_negative=True has been deprecated in 0.19 and will be removed in version 0.21.
  " in version 0.21.", DeprecationWarning)

In [26]:

model = make_pipeline(TfidfVectorizer(stop_words='english',
                                      sublinear_tf=True,
                                      max_df=0.5,
                                      max_features=1000),
                      LogisticRegression(),
                      )
model.fit(data_train['data'], y_train)
y_pred = model.predict(data_test['data'])
print(accuracy_score(y_test, y_pred))
print("Number of features:", len(model.steps[0][1].get_feature_names()))

Out[26]:

0.728011825573
Number of features: 1000

Natural Language Processing Lab

1. Use the `fetch_20newsgroups` function to download a training and testing set.

2. Data inspection

3. Bag of Words model

4. Hashing and TF-IDF

Product

Resources

Company

Natural Language Processing Lab

1. Use the fetch_20newsgroups function to download a training and testing set.

2. Data inspection

3. Bag of Words model

4. Hashing and TF-IDF

1. Use the `fetch_20newsgroups` function to download a training and testing set.