Natural Language Processing Lab

Authors: Dave Yerrington (SF)

In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

In [1]:

# Standard Data Science Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

# Getting that SKLearn Dataset
from sklearn.datasets import fetch_20newsgroups

1. Use the `fetch_20newsgroups` function to download a training and testing set.

Look up the function documentation for how to grab the data.

You should pull these categories:

alt.atheism
talk.religion.misc
comp.graphics
sci.space

Also remove the headers, footers, and quotes using the remove keyword argument of the function.

In [3]:

# A:

2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

What data taype is data_train

Is it like a list? Or like a Dictionary? or what?
How many data points does it contain?
Inspect the first data point, what does it look like?

In [4]:

# A:

3. Bag of Words model

Let's train a model using a simple count vectorizer.

Initialize a standard CountVectorizer and fit the training data

how big is the feature dictionary?
repeat eliminating english stop words
is the dictionary smaller?
transform the training data using the trained vectorizer
evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer
- you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it

BONUS:

try a couple modifications:
- restrict the max_features
- change max_df and min_df

In [5]:

# A:

4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

Initialize a HashingVectorizer and repeat the test with no restriction on the number of features

does the score improve with respect to the count vectorizer?
print out the number of features for this model
Initialize a TF-IDF Vectorizer and repeat the analysis above
print out the number of features for this model

BONUS:

Change the parameters of either (or both!) models to improve your score

In [6]:

# A:

Natural Language Processing Lab

1. Use the `fetch_20newsgroups` function to download a training and testing set.

2. Data inspection

3. Bag of Words model

4. Hashing and TF-IDF

Product

Resources

Company

Natural Language Processing Lab

1. Use the fetch_20newsgroups function to download a training and testing set.

2. Data inspection

3. Bag of Words model

4. Hashing and TF-IDF

1. Use the `fetch_20newsgroups` function to download a training and testing set.