Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_13/practice/intro_to_nlp-lab.ipynb
1904 views
Kernel: Python 2

Natural Language Processing Lab

Authors: Dave Yerrington (SF)


In this lab we will further explore sklearn and NLTK's capabilities for processing text. We will use the 20 Newsgroup dataset, which is provided by sklearn.

# Standard Data Science Imports import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline
# Getting that SKLearn Dataset from sklearn.datasets import fetch_20newsgroups

1. Use the fetch_20newsgroups function to download a training and testing set.

Look up the function documentation for how to grab the data.

You should pull these categories:

  • alt.atheism

  • talk.religion.misc

  • comp.graphics

  • sci.space

Also remove the headers, footers, and quotes using the remove keyword argument of the function.

# A:

2. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an sklearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

  1. What data taype is data_train

  • Is it like a list? Or like a Dictionary? or what?

  • How many data points does it contain?

  • Inspect the first data point, what does it look like?

# A:

3. Bag of Words model

Let's train a model using a simple count vectorizer.

  1. Initialize a standard CountVectorizer and fit the training data

  • how big is the feature dictionary?

  • repeat eliminating english stop words

  • is the dictionary smaller?

  • transform the training data using the trained vectorizer

  • evaluate the performance of a Logistic Regression on the features extracted by the CountVectorizer

    • you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it

BONUS:

  • try a couple modifications:

    • restrict the max_features

    • change max_df and min_df

# A:

4. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

  1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features

  • does the score improve with respect to the count vectorizer?

  • print out the number of features for this model

  • Initialize a TF-IDF Vectorizer and repeat the analysis above

  • print out the number of features for this model

BONUS:

  • Change the parameters of either (or both!) models to improve your score

# A: