Path: blob/master/lessons/lesson_13/practice/solution-code/intro_to_nlp-lab-solutions.ipynb
1904 views
Natural Language Processing Lab
Authors: Dave Yerrington (SF)
In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.
1. Use the fetch_20newsgroups
function to download a training and testing set.
Look up the function documentation for how to grab the data.
You should pull these categories:
alt.atheism
talk.religion.misc
comp.graphics
sci.space
Also remove the headers, footers, and quotes using the remove
keyword argument of the function.
2. Data inspection
We have downloaded a few newsgroup categories and removed headers, footers and quotes.
Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).
Let's inspect them.
What data taype is
data_train
Is it like a list? Or like a Dictionary? or what?
How many data points does it contain?
Inspect the first data point, what does it look like?
3. Bag of Words model
Let's train a model using a simple count vectorizer.
Initialize a standard CountVectorizer and fit the training data
how big is the feature dictionary?
repeat eliminating english stop words
is the dictionary smaller?
transform the training data using the trained vectorizer
evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer
you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it
BONUS:
try a couple modifications:
restrict the max_features
change max_df and min_df
4. Hashing and TF-IDF
Let's see if Hashing or TF-IDF improves the accuracy.
Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
does the score improve with respect to the count vectorizer?
print out the number of features for this model
Initialize a TF-IDF Vectorizer and repeat the analysis above
print out the number of features for this model
BONUS:
Change the parameters of either (or both!) models to improve your score