GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_13/practice/solution-code/nlp_review-lab-solutions.ipynb
¹⁹⁰⁴ views

Kernel: Python 2

Natural Language Processing (NLP) Review Lab

Authors: Joseph Nelson (DC)

Note: This lab is intended to be done as a walkthrough with the instructor.

Introduction

Adapted from NLP Crash Course by Charlie Greenbacker, Introduction to NLP by Dan Jurafsky, Kevin Markham's Data School Curriculum

What is NLP?

Using computers to process (analyze, understand, generate) natural human languages
Most knowledge created by humans is unstructured text, and we need a way to make sense of it
Build probabilistic model using data about a language

What are some of the higher level task areas?

Information retrieval: Find relevant results and similar results
- Google
Information extraction: Structured information from unstructured documents
- Events from Gmail
Machine translation: One language to another
- Google Translate
Text simplification: Preserve the meaning of text, but simplify the grammar and vocabulary
- Rewordify
- Simple English Wikipedia
Predictive text input: Faster or easier typing
- My application
- A much better application
Sentiment analysis: Attitude of speaker
- Hater News
Automatic summarization: Extractive or abstractive summarization
- autotldr
Natural Language Generation: Generate text from data
- How a computer describes a sports match
- Publishers withdraw more than 120 gibberish papers
Speech recognition and generation: Speech-to-text, text-to-speech
- Google's Web Speech API demo
- Vocalware Text-to-Speech demo
Question answering: Determine the intent of the question, match query with knowledge base, evaluate hypotheses

What are some of the lower level components?

Tokenization: breaking text into tokens (words, sentences, n-grams)
Stopword removal: a/an/the
Stemming and lemmatization: root word
TF-IDF: word importance
Part-of-speech tagging: noun/verb/adjective
Named entity recognition: person/organization/location
Spelling correction: "New Yrok City"
Word sense disambiguation: "buy a mouse"
Segmentation: "New York City subway"
Language detection: "translate this page"
Machine learning

Why is NLP hard?

Ambiguity:
- Hospitals are Sued by 7 Foot Doctors
- Juvenile Court to Try Shooting Defendant
- Local High School Dropouts Cut in Half
Non-standard English: text messages
Idioms: "throw in the towel"
Newly coined words: "retweet"
Tricky entity names: "Where is A Bug's Life playing?"
World knowledge: "Mary and Sue are sisters", "Mary and Sue are mothers"

NLP requires an understanding of the language and the world.

Part 1: Reading in the Yelp Reviews

"corpus" = collection of documents
"corpora" = plural form of corpus

In [1]:

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

Out[1]:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-1c86b8428141> in <module>()
      7 from sklearn.linear_model import LogisticRegression
      8 from sklearn import metrics
----> 9 from textblob import TextBlob, Word
     10 from nltk.stem.snowball import SnowballStemmer
     11 get_ipython().magic('matplotlib inline')
ModuleNotFoundError: No module named 'textblob'

In [3]:

csv_file = '../../data/yelp.csv'

In [4]:

yelp = pd.read_csv(csv_file)

In [5]:

yelp.head(3)

Out[5]:

1.1 Subset the reviews to best and worst.

Select only 5-star and 1-star reviews.
The text will be the features, the stars will be the target.
Create a train-test split.

In [6]:

# read yelp.csv into a DataFrame
yelp = pd.read_csv(csv_file)

# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

# split the new DataFrame into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [7]:

yelp_best_worst.ix[1].text

Out[7]:

'I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\n\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we\'ll be seated when the girl comes back from seating someone else. We were seated at 5:52 and the waiter came and got our drink orders. Everyone was very pleasant from the host that seated us to the waiter to the server. The prices were very good as well. We placed our orders once we decided what we wanted at 6:02. We shared the baked spaghetti calzone and the small "Here\'s The Beef" pizza so we can both try them. The calzone was huge and we got the smallest one (personal) and got the small 11" pizza. Both were awesome! My friend liked the pizza better and I liked the calzone better. The calzone does have a sweetish sauce but that\'s how I like my sauce!\n\nWe had to box part of the pizza to take it home and we were out the door by 6:42. So, everything was great and not like these bad reviewers. That goes to show you that  you have to try these things yourself because all these bad reviewers have some serious issues.'

In [8]:

print(len(X_train))
print(len(X_test))

Out[8]:

3064
1022

Part 2: Tokenization

What: Separate text into units such as sentences or words
Why: Gives structure to previously unstructured text
Notes: Relatively easy with English language text, not easy with some languages

2.1 Use CountVectorizer to convert the training and testing text data.

CountVectorizer documentation

lowercase: boolean, True by default
- Convert all characters to lowercase before tokenizing.
ngram_range: tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [13]:

# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [14]:

# rows are documents, columns are terms (aka "tokens" or "features")
X_train_dtm.shape

Out[14]:

(3064, 16825)

In [15]:

# last 50 features
print(vect.get_feature_names()[-50:])

Out[15]:

[u'yyyyy', u'z11', u'za', u'zabba', u'zach', u'zam', u'zanella', u'zankou', u'zappos', u'zatsiki', u'zen', u'zero', u'zest', u'zexperience', u'zha', u'zhou', u'zia', u'zihuatenejo', u'zilch', u'zin', u'zinburger', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipper', u'zippers', u'zipps', u'ziti', u'zoe', u'zombi', u'zombies', u'zone', u'zones', u'zoning', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel', u'zzed', u'\xe9clairs', u'\xe9cole', u'\xe9m']

In [16]:

# show vectorizer options
vect

Out[16]:

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [17]:

# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

Out[17]:

(3064, 20838)

In [18]:

# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

Out[18]:

(3064, 169847)

In [19]:

# last 50 features
print(vect.get_feature_names()[-50:])

Out[19]:

[u'zone out', u'zone when', u'zones', u'zones dolls', u'zoning', u'zoning issues', u'zoo', u'zoo and', u'zoo is', u'zoo not', u'zoo the', u'zoo ve', u'zoyo', u'zoyo for', u'zucca', u'zucca appetizer', u'zucchini', u'zucchini and', u'zucchini bread', u'zucchini broccoli', u'zucchini carrots', u'zucchini fries', u'zucchini pieces', u'zucchini strips', u'zucchini veal', u'zucchini very', u'zucchini with', u'zuchinni', u'zuchinni again', u'zuchinni the', u'zumba', u'zumba class', u'zumba or', u'zumba yogalates', u'zupa', u'zupa flavors', u'zuzu', u'zuzu in', u'zuzu is', u'zuzu the', u'zwiebel', u'zwiebel kr\xe4uter', u'zzed', u'zzed in', u'\xe9clairs', u'\xe9clairs napoleons', u'\xe9cole', u'\xe9cole len\xf4tre', u'\xe9m', u'\xe9m all']

2.2 Predict the star rating with the new features from CountVectorizer.

Validate on the test set.

In [20]:

# use default options for CountVectorizer
vect = CountVectorizer()

# create document-term matrices
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# use Naive Bayes  to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class))

Out[20]:

0.918786692759

In [21]:

# calculate null accuracy
y_test_binary = np.where(y_test==5, 1, 0)
max(y_test_binary.mean(), 1 - y_test_binary.mean())

Out[21]:

0.81996086105675148

In [22]:

# define a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [23]:

# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

Out[23]:

Features:  169847
Accuracy:  0.854207436399

Part 3: Stopword Removal

What: Remove common words that will likely appear in any text
Why: They don't tell you much about your text

3.1 Recreate your features with CountVectorizer removing stopwords.

stop_words: string {'english'}, list, or None (default)
If 'english', a built-in stop word list for English is used.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [24]:

# show vectorizer options
vect

Out[24]:

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [26]:

# remove English stop words
vect = CountVectorizer(stop_words='english')

3.2 Validate your model using the features with stopwords removed.

In [27]:

tokenize_test(vect)

Out[27]:

Features:  16528
Accuracy:  0.915851272016

In [28]:

# set of stop words
print(vect.get_stop_words())

Out[28]:

frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])

Part 4: Other CountVectorizer Options

4.1 Shrink the maximum number of features and re-test the model.

max_features: int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [29]:

# remove English stop words and only keep 100 features
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

Out[29]:

Features:  100
Accuracy:  0.869863013699

In [30]:

# all 100 features
print(vect.get_feature_names())

Out[30]:

[u'amazing', u'area', u'atmosphere', u'awesome', u'bad', u'bar', u'best', u'better', u'big', u'came', u'cheese', u'chicken', u'clean', u'coffee', u'come', u'day', u'definitely', u'delicious', u'did', u'didn', u'dinner', u'don', u'eat', u'excellent', u'experience', u'favorite', u'feel', u'food', u'free', u'fresh', u'friendly', u'friends', u'going', u'good', u'got', u'great', u'happy', u'home', u'hot', u'hour', u'just', u'know', u'like', u'little', u'll', u'location', u'long', u'looking', u'lot', u'love', u'lunch', u'make', u'meal', u'menu', u'minutes', u'need', u'new', u'nice', u'night', u'order', u'ordered', u'people', u'perfect', u'phoenix', u'pizza', u'place', u'pretty', u'prices', u'really', u'recommend', u'restaurant', u'right', u'said', u'salad', u'sandwich', u'sauce', u'say', u'service', u'staff', u'store', u'sure', u'table', u'thing', u'things', u'think', u'time', u'times', u'took', u'town', u'tried', u'try', u've', u'wait', u'want', u'way', u'went', u'wine', u'work', u'worth', u'years']

In [31]:

# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)
tokenize_test(vect)

Out[31]:

Features:  100000
Accuracy:  0.885518590998

4.2 Change the minimum document frequency for terms and test the model's performance.

min_df: float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [32]:

# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)

Out[32]:

Features:  43957
Accuracy:  0.932485322896

Part 5: Introduction to TextBlob

TextBlob: "Simplified Text Processing"

5.1 Use `TextBlob` to convert the text in the first review in the dataset.

In [36]:

# print the first review
print(yelp_best_worst.text[0])

Out[36]:

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!

In [37]:

# save it as a TextBlob object
review = TextBlob(yelp_best_worst.text[0])

5.2 List the words in the `TextBlob` object.

In [38]:

# list the words
review.words

Out[38]:

WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'and', 'it', 'was', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'pieces', 'of', 'their', 'griddled', 'bread', 'with', 'was', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'was', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back'])

5.3 List the sentences in the `TextBlob` object.

In [39]:

# list the sentences
review.sentences

Out[39]:

[Sentence("My wife took me here on my birthday for breakfast and it was excellent."),
 Sentence("The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure."),
 Sentence("Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning."),
 Sentence("It looked like the place fills up pretty quickly so the earlier you get here the better."),
 Sentence("Do yourself a favor and get their Bloody Mary."),
 Sentence("It was phenomenal and simply the best I've ever had."),
 Sentence("I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it."),
 Sentence("It was amazing."),
 Sentence("While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious."),
 Sentence("It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete."),
 Sentence("It was the best "toast" I've ever had."),
 Sentence("Anyway, I can't wait to go back!")]

In [40]:

# some string methods are available
review.lower()

Out[40]:

TextBlob("my wife took me here on my birthday for breakfast and it was excellent.  the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  our waitress was excellent and our food arrived quickly on the semi-busy saturday morning.  it looked like the place fills up pretty quickly so the earlier you get here the better.

do yourself a favor and get their bloody mary.  it was phenomenal and simply the best i've ever had.  i'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  it was amazing.

while everything on the menu looks excellent, i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  it was the best "toast" i've ever had.

anyway, i can't wait to go back!")

Part 6: Stemming and Lemmatization

Stemming:

What: Reduce a word to its base/stem/root form
Why: Often makes sense to treat related words the same way
Notes:
- Uses a "simple" and fast rule-based approach
- Stemmed words are usually not shown to users (used for analysis/indexing)
- Some search engines treat words with the same stem as synonyms

6.1 Initialize the `SnowballStemmer` and stem the words in the first review.

In [41]:

# initialize stemmer
stemmer = SnowballStemmer('english')

# stem each word
print([stemmer.stem(word) for word in review.words])

Out[41]:

[u'my', u'wife', u'took', u'me', u'here', u'on', u'my', u'birthday', u'for', u'breakfast', u'and', u'it', u'was', u'excel', u'the', u'weather', u'was', u'perfect', u'which', u'made', u'sit', u'outsid', u'overlook', u'their', u'ground', u'an', u'absolut', u'pleasur', u'our', u'waitress', u'was', u'excel', u'and', u'our', u'food', u'arriv', u'quick', u'on', u'the', u'semi-busi', u'saturday', u'morn', u'it', u'look', u'like', u'the', u'place', u'fill', u'up', u'pretti', u'quick', u'so', u'the', u'earlier', u'you', u'get', u'here', u'the', u'better', u'do', u'yourself', u'a', u'favor', u'and', u'get', u'their', u'bloodi', u'mari', u'it', u'was', u'phenomen', u'and', u'simpli', u'the', u'best', u'i', u've', u'ever', u'had', u'i', u"'m", u'pretti', u'sure', u'they', u'onli', u'use', u'ingredi', u'from', u'their', u'garden', u'and', u'blend', u'them', u'fresh', u'when', u'you', u'order', u'it', u'it', u'was', u'amaz', u'while', u'everyth', u'on', u'the', u'menu', u'look', u'excel', u'i', u'had', u'the', u'white', u'truffl', u'scrambl', u'egg', u'veget', u'skillet', u'and', u'it', u'was', u'tasti', u'and', u'delici', u'it', u'came', u'with', u'2', u'piec', u'of', u'their', u'griddl', u'bread', u'with', u'was', u'amaz', u'and', u'it', u'absolut', u'made', u'the', u'meal', u'complet', u'it', u'was', u'the', u'best', u'toast', u'i', u've', u'ever', u'had', u'anyway', u'i', u'ca', u"n't", u'wait', u'to', u'go', u'back']

6.2 Use the built-in `lemmatize` function on the words of the first review (parsed by `TextBlob`)

Lemmatization

What: Derive the canonical form ('lemma') of a word
Why: Can be better than stemming
Notes: Uses a dictionary-based approach (slower than stemming)

In [42]:

# assume every word is a noun
print([word.lemmatize() for word in review.words])

Out[42]:

['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'wa', 'excellent', 'The', 'weather', u'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', u'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', u'egg', 'vegetable', 'skillet', 'and', 'it', u'wa', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', u'piece', 'of', 'their', 'griddled', 'bread', 'with', u'wa', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', u'wa', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

In [44]:

# assume every word is a verb
print([word.lemmatize(pos='v') for word in review.words])

Out[44]:

['My', 'wife', u'take', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'be', 'excellent', 'The', 'weather', u'be', 'perfect', 'which', u'make', u'sit', 'outside', u'overlook', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'be', 'excellent', 'and', 'our', 'food', u'arrive', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', u'look', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'be', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', u'have', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'be', u'amaze', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', u'have', 'the', 'white', 'truffle', u'scramble', u'egg', 'vegetable', 'skillet', 'and', 'it', u'be', 'tasty', 'and', 'delicious', 'It', u'come', 'with', '2', u'piece', 'of', 'their', u'griddle', 'bread', 'with', u'be', u'amaze', 'and', 'it', 'absolutely', u'make', 'the', 'meal', 'complete', 'It', u'be', 'the', 'best', 'toast', 'I', "'ve", 'ever', u'have', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

6.3 Write a function that uses `TextBlob` and `lemmatize` to lemmatize text.

In [45]:

# define a function that accepts text and returns a list of lemmas
def split_into_lemmas(text):
    text = str(text, 'utf-8').lower()
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]

6.4 Provide your function to `CountVectorizer` as the `analyzer` and test the performance of your model.

In [46]:

# use split_into_lemmas as the feature extraction function (WARNING: SLOW!)
vect = CountVectorizer(analyzer=split_into_lemmas)
tokenize_test(vect)

Out[46]:

Features:  16452
Accuracy:  0.920743639922

In [47]:

# last 50 features
print(vect.get_feature_names()[-50:])

Out[47]:

[u'yuyuyummy', u'yuzu', u'z', u'z-grill', u'z11', u'zach', u'zam', u'zanella', u'zankou', u'zappos', u'zatsiki', u'zen', u'zen-like', u'zero', u'zero-star', u'zest', u'zexperience', u'zha', u'zhou', u'zia', u'zilch', u'zin', u'zinburger', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipper', u'zipps', u'ziti', u'zoe', u'zombi', u'zombie', u'zone', u'zoning', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel-kr\xe4uter', u'zzed', u'\xe9clairs', u'\xe9cole', u'\xe9m']

Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

What: Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
Why: More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
Notes: Used for search engine scoring, text summarization, document clustering

7.1 Build a simple TF-IDF using CountVectorizer

Term Frequency can be calulated with default CountVectorizer.
Inverse Document Frequency can be calculated with CountVectorizer and argument binary=True.

More details: TF-IDF is about what matters

In [43]:

# example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [44]:

# Term Frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf

Out[44]:

In [45]:

# Document Frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

Out[45]:

In [46]:

# Term Frequency-Inverse Document Frequency (simple version)
tf/df

Out[46]:

In [47]:

# TfidfVectorizer
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

Out[47]:

Part 8: Using TF-IDF to Summarize a Yelp Review

Note: Reddit's autotldr uses the SMMRY algorithm, which is based on TF-IDF!

8.1 Build a TF-IDF predictor matrix excluding stopwords with `TfidfVectorizer`

In [48]:

# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(yelp.text)
features = vect.get_feature_names()
dtm.shape

Out[48]:

(10000, 28880)

8.2 Write a function to pull out the top 5 words by TF-IDF score from a review

In [49]:

def summarize():
    
    # choose a random review that is at least 300 characters
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = str(yelp.text[review_id], 'utf-8')
        review_length = len(review_text)
    
    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]
    
    # print words with the top 5 TF-IDF scores
    print('TOP SCORING WORDS:')
    top_scores = sorted(list(word_scores.items()), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print(word)
    
    # print 5 random words
    print('\n' + 'RANDOM WORDS:')
    random_words = np.random.choice(list(word_scores.keys()), size=5, replace=False)
    for word in random_words:
        print(word)
    
    # print the review
    print('\n' + review_text)

In [50]:

summarize()

Out[50]:

TOP SCORING WORDS:
philly
casella
33rd
family
celebrating

RANDOM WORDS:
fool
lbs
immediately
deli
oven

I was inspired to visit Casella's after a coworker told me about their authentically Italian goodness. As my tummy growled today, I was in the neighborhood so I knew just where to go. I walked into a simple yet clean deli. Immediately, I was welcomed by the entire staff (I believe they are all related or at least family through friendship). They asked me if it was my first time in and I said yes, so right away they started giving me a history of the shop and the tasty cuisine. Celebrating their 33rd anniversary (congrats!!), I learned that everything is homemade -- from the meatballs to the chicken salad. The owner said he goes through 40 lbs of chicken a week. He gave me a sample and the chicken was chunky and delicious. I'm not surprised he has to prepare so much chicken every week! The meatballs and freshly cut meat for Philly Cheese-steak sandwiches looked great too! The owner mentioned he is from Philly so better be ready for some authentic Philly and Italian style cookin'! 

While I was there, the owner was slicing some fresh provolone, so I added that to the warmed turkey sandwich I ordered. The turkey was tasty, the provolone was yummy and the sourdough roll tasted like it was fresh from the oven. It was $8 for a sandwich and soda but the sandwich was hefty enough that I saved half for later. As I walked out, everyone said goodbye to me and I felt like I was leaving a family dinner. 

In summary, here's what made Casella's stand out: 

- The very friendly staff
- It's a local family-owned neighborhood deli celebrating their 33rd anniversary. Gotta support the local businesses! 
- Clean decor 
- Tasty, homemade food that's on display for visitors to see 
- A good family-style atmosphere that greets you the second you walk inside. 

I'm not sure how it's taken my so long to find this local treat, but now that I've found it, I plan on becoming a regular. 

Directions: Note that this is hidden in the Basha's strip mall (on the side of Granite Reef) but don't let that fool you!

Part 9: Sentiment Analysis

9.1 Extract sentiment from a review parsed with `TextBlob`

Sentiment polarity ranges from -1, the most negative, to 1, the most positive. A parsed TextBlob object has sentiment which can be accessed with:

review.sentiment.polarity

In [51]:

print(review)

Out[51]:

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!

In [52]:

# polarity ranges from -1 (most negative) to 1 (most positive)
review.sentiment.polarity

Out[52]:

0.40246913580246907

9.2 Calculate the sentiment for every review in the full Yelp dataset as a new column.

In [53]:

# define a function that accepts text and returns the polarity
def detect_sentiment(text):
    return TextBlob(text.decode('utf-8')).sentiment.polarity

In [54]:

# create a new DataFrame column for sentiment (WARNING: SLOW!)
yelp['sentiment'] = yelp.text.apply(detect_sentiment)

In [55]:

yelp.columns

Out[55]:

Index([u'business_id', u'date', u'review_id', u'stars', u'text', u'type',
       u'user_id', u'cool', u'useful', u'funny', u'sentiment'],
      dtype='object')

In [56]:

yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

9.3 Create a boxplot of sentiment by star rating

In [57]:

# box plot of sentiment grouped by stars
yelp.boxplot(column='sentiment', by='stars')

Out[57]:

<matplotlib.axes._subplots.AxesSubplot at 0x11a4254d0>

9.4 Print reviews with the highest and lowest sentiment.

In [58]:

# reviews with most positive sentiment
yelp[yelp.sentiment == 1].text.head()

Out[58]:

  Our server Gary was awesome. Food was amazing....
  3 syllables for this place. \nA-MAZ-ING!\n\nTh...
                                  LOVE the food!!!!
  Love it!!! Wish we still lived in Arizona as C...
                                   Excellent burger
Name: text, dtype: object

In [59]:

# reviews with most negative sentiment
yelp[yelp.sentiment == -1].text.head()

Out[59]:

   This was absolutely horrible. I got the suprem...
                Nasty workers and over priced trash
  Absolutely awful... these guys have NO idea wh...
                                     Very bad food!
      I wouldn't send my worst enemy to this place.
Name: text, dtype: object

10. [Bonus] Explore fun TextBlob features

10.1 Correct spelling with `.correct()`

In [67]:

# spelling correction
TextBlob('15 minuets late').correct()

Out[67]:

TextBlob("15 minutes late")

10.2 Perform spellchecking with `.spellcheck()`

In [68]:

# spellcheck
Word('parot').spellcheck()

Out[68]:

[('part', 0.9929478138222849), (u'parrot', 0.007052186177715092)]

10.3 Extract definitions with `.define()`

In [60]:

# definitions
Word('bank').define('n')

Out[60]:

[u'sloping land (especially the slope beside a body of water)',
 u'a financial institution that accepts deposits and channels the money into lending activities',
 u'a long ridge or pile',
 u'an arrangement of similar objects in a row or in tiers',
 u'a supply or stock held in reserve for future use (especially in emergencies)',
 u'the funds held by a gambling house or the dealer in some gambling games',
 u'a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force',
 u'a container (usually with a slot in the top) for keeping money at home',
 u'a building in which the business of banking transacted',
 u'a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)']

Conclusion

NLP is a gigantic field
Understanding the basics broadens the types of data you can work with
Simple techniques go a long way
Use scikit-learn for NLP whenever possible