GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_13/natural-language-processing NLP (done).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Natural Language Processing

Authors: Kiefer Katovich (San Francisco), Joseph Nelson (Washington, D.C.)

Learning Objectives

Discuss the major tasks involved with natural language processing.
Discuss, on a low level, the components of natural language processing.
Identify why natural language processing is difficult.
Demonstrate text classification.
Demonstrate common text preprocessing techniques.

How Do We Use NLP in Data Science?

In data science, we are often asked to analyze unstructured text or make a predictive model using it. Unfortunately, most data science techniques require numeric data. NLP libraries provide a tool set of methods to convert unstructured text into meaningful numeric data.

Analysis: NLP techniques provide tools to allow us to understand and analyze large amounts of text. For example:
- Analyze the positivity/negativity of comments on different websites.
- Extract key words from meeting notes and visualize how meeting topics change over time.
Vectorizing for machine learning: When building a machine learning model, we typically must transform our data into numeric features. This process of transforming non-numeric data such as natural language into numeric features is called vectorization. For example:
- Understanding related words. Using stemming, NLP lets us know that "swim", "swims", and "swimming" all refer to the same base word. This allows us to reduce the number of features used in our model.
- Identifying important and unique words. Using TF-IDF (term frequency-inverse document frequency), we can identify which words are most likely to be meaningful in a document.

Install TextBlob

The TextBlob Python library provides a simplified interface for exploring common NLP tasks including part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

To proceed with the lesson, first install TextBlob, as explained below. We tend to prefer Anaconda-based installations, since they tend to be tested with our other Anaconda packages.

To install textblob run:

conda install -c conda-forge textblob

Or:

pip install textblob

python -m textblob.download_corpora lite

Lesson Guide

Introduction

Adapted from NLP Crash Course by Charlie Greenbacker and Introduction to NLP by Dan Jurafsky

Introduction

What Is Natural Language Processing (NLP)?

Using computers to process (analyze, understand, generate) natural human languages.
Making sense of human knowledge stored as unstructured text.
Building probabilistic models using data about a language.

What Are Some of the Higher-Level Task Areas?

Objective: Discuss the major tasks involved with natural language processing.

We often hope that computers can solve many high-level problems involving natural language. Unfortunately, due to the difficulty of understanding human language, many of these problems are still not well solved. That said, existing solutions to these problems all involve utilizing the lower-level components of NLP discussed in the next section. Some higher-level tasks include:

Chatbots: Understand natural language from the user and return intelligent responses.
- Api.ai
Information retrieval: Find relevant results and similar results.
- Google
Information extraction: Structured information from unstructured documents.
- Events from Gmail
Machine translation: One language to another.
- Google Translate
Text simplification: Preserve the meaning of text, but simplify the grammar and vocabulary.
- Rewordify
- Simple English Wikipedia
Predictive text input: Faster or easier typing.
- Phrase completion application
- A much better application
Sentiment analysis: Attitude of speaker.
- Hater News
Automatic summarization: Extractive or abstractive summarization.
- autotldr
Natural language generation: Generate text from data.
- How a computer describes a sports match
- Publishers withdraw more than 120 gibberish papers
Speech recognition and generation: Speech-to-text, text-to-speech.
- Google's Web Speech API demo
- Vocalware Text-to-Speech demo
Question answering: Determine the intent of the question, match query with knowledge base, evaluate hypotheses.

What Are Some of the Lower-Level Components?

Objective: Discuss, on a low level, the components of natural language processing.

Unfortunately, the NLP programming libraries typically do not provide direct solutions for the high-level tasks above. Instead, they provide low-level building blocks that enable us to craft our own solutions. These include:

Tokenization: Breaking text into tokens (words, sentences, n-grams)
Stop-word removal: a/an/the
Stemming and lemmatization: root word
TF-IDF: word importance
Part-of-speech tagging: noun/verb/adjective
Named entity recognition: person/organization/location
Spelling correction: "New Yrok City"
Word sense disambiguation: "buy a mouse"
Segmentation: "New York City subway"
Language detection: "translate this page"
Machine learning: specialized models that work well with text

Why is NLP hard?

Objective: Identify why natural language processing is difficult.

Natural language processing requires an understanding of the language and the world. Several limitations of NLP are:

Ambiguity:
- Hospitals Are Sued by 7 Foot Doctors
- Juvenile Court to Try Shooting Defendant
- Local High School Dropouts Cut in Half
Non-standard English: text messages
Idioms: "throw in the towel"
Newly coined words: "retweet"
Tricky entity names: "Where is A Bug's Life playing?"
World knowledge: "Mary and Sue are sisters", "Mary and Sue are mothers"

Reading in the Yelp Reviews

Throughout this lesson, we will use Yelp reviews to practice and discover common low-level NLP techniques.

You should be familiar with these terms, as they are frequently used in NLP:

corpus: a collection of documents (derived from the Latin word for "body")
corpora: plural form of corpus

Throughout this lesson, we will use a model very popular for text classification called Naive Bayes (the "NB" in BinonmialNB and MultinomialNB below). If you are unfamiliar with it, know that it works exactly the same as all other models in scikit-learn! We will look extensively at the mechanics behind Naive Bayes later in the course. However, see the appendix at the end of this notebook for a quick introduction.

In [2]:

!conda install -y -c conda-forge textblob

Out[2]:

Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.4.10
  latest version: 4.5.4

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /anaconda3

  added / updated specs: 
    - textblob


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    textblob-0.15.1            |             py_0         597 KB  conda-forge
    certifi-2018.1.18          |           py36_0         143 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         741 KB

The following NEW packages will be INSTALLED:

    textblob: 0.15.1-py_0      conda-forge

The following packages will be UPDATED:

    certifi:  2018.1.18-py36_0             --> 2018.1.18-py36_0 conda-forge


Downloading and Extracting Packages
textblob 0.15.1: ####################################################### | 100% 
certifi 2018.1.18: ##################################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

In [3]:

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB         # Naive Bayes
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

%matplotlib inline

In [4]:

# Read yelp.csv into a DataFrame.
path = r'./data/yelp.csv'
yelp = pd.read_csv(path)

# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# Define X and y.
X = yelp_best_worst.text
y = yelp_best_worst.stars

# Split the new DataFrame into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [13]:

pd.Series([True, True, False, False]) | pd.Series([True, False, True, False])

# | takes the first item in the first list and compares it with the second list wit an or condition

Out[13]:

   True
   True
   True
  False
dtype: bool

In [14]:

pd.Series([True, True, False, False]) & pd.Series([True, False, True, False])

# & takes the first item in the first list and compares it with the second list with an and condition

Out[14]:

   True
  False
  False
  False
dtype: bool

In [11]:

# The head of the original data
yelp.head()

Out[11]:

Introduction: Text Classification

As you proceed through this section, note that text classification is done in the same way as all other classification models. First, the text is vectorized into a set of numeric features. Then, a standard machine learning classifier is applied. NLP libraries often include vectorizers and ML models that work particularly well with text.

We will refer to each piece of text we are trying to classify as a document.
For example, a document could refer to an email, book chapter, tweet, article, or text message.

Text classification is the task of predicting which category or topic a text sample is from.

We may want to identify:

Is an article a sports or business story?
Does an email have positive or negative sentiment?
Is the rating of a recipe 1, 2, 3, 4, or 5 stars?

Predictions are often made by using the words as features and the label as the target output.

Starting out, we will make each unique word (across all documents) a single feature. In any given corpora, we may have hundreds of thousands of unique words, so we may have hundreds of thousands of features!

For a given document, the numeric value of each feature could be the number of times the word appears in the document.
- So, most features will have a value of zero, resulting in a sparse matrix of features.
This technique for vectorizing text is referred to as a bag-of-words model.
- It is called bag of words because the document's structure is lost — as if the words are all jumbled up in a bag.
- The first step to creating a bag-of-words model is to create a vocabulary of all possible words in the corpora.

Alternatively, we could make each column an indicator column, which is 1 if the word is present in the document (no matter how many times) and 0 if not. This vectorization could be used to reduce the importance of repeated words. For example, a website search engine would be susceptible to spammers who load websites with repeated words. So, the search engine might use indicator columns as features rather than word counts.

We need to consider several things to decide if bag-of-words is appropriate.

Does order of words matter?
Does punctuation matter?
Does upper or lower case matter?

Demo: Text Processing in scikit-learn

Objective: Demonstrate text classification.

Creating Features Using CountVectorizer

What: Converts each document into a set of words and their counts.
Why: To use a machine learning model, we must convert unstructured text into numeric features.
Notes: Relatively easy with English language text, not as easy with some languages.

In [26]:

# Use CountVectorizer to create document-term matrices from X_train and X_test.
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train) #fit and transform in one line on the training data
X_test_dtm = vect.transform(X_test) #only transform on test

In [21]:

vect.transform(X_train)

Out[21]:

<3064x16825 sparse matrix of type '<class 'numpy.int64'>'
	with 237720 stored elements in Compressed Sparse Row format>

In [25]:

# 3064 rows
# 16825 words

In [24]:

237720 / 3064 / 16825 # % of values in matric that are not zero

Out[24]:

0.004611284184063408

In [22]:

vect.transform(X_train).todense()

Out[22]:

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [27]:

# Rows are documents, columns are terms (aka "tokens" or "features", individual words in this situation).
X_train_dtm.shape

Out[27]:

(3064, 16825)

In [28]:

# Last 50 features
print((vect.get_feature_names()[-50:]))

Out[28]:

['yyyyy', 'z11', 'za', 'zabba', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zero', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zihuatenejo', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zippers', 'zipps', 'ziti', 'zoe', 'zombi', 'zombies', 'zone', 'zones', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel', 'zzed', 'éclairs', 'école', 'ém']

In [29]:

pd.DataFrame(X_train_dtm.todense(), columns=vect.get_feature_names())

Out[29]:

In [30]:

# Show vectorizer options.
vect

Out[30]:

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

CountVectorizer documentation

One common method of reducing the number of features is converting all text to lowercase before generating features! Note that to a computer, aPPle is a different token/"word" than apple. So, by converting both to lowercase letters, it ensures fewer features will be generated. It might be useful not to convert them to lowercase if capitalization matters.

In [32]:

# Don't convert to lowercase.
vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
print(X_train_dtm.shape)
vect.get_feature_names()[-10:]

Out[32]:

(3064, 20838)

['zoning',
 'zoo',
 'zucchini',
 'zuchinni',
 'zupa',
 'zwiebel',
 'zzed',
 'École',
 'éclairs',
 'ém']

In [33]:

X_train.head()

Out[33]:

  FILLY-B's!!!!!  only 8 reviews?? NINE now!!!\n...
  My husband and I absolutely LOVE this restaura...
  We went today after lunch. I got my usual of l...
   Totally dissapointed.  I had purchased a coupo...
  Costco Travel - My husband and I recently retu...
Name: text, dtype: object

In [34]:

X_train.loc[0]

Out[34]:

'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

In [36]:

X_train_dtm.todense()[:5,:] # equivilent to .head forscipy matrix

Out[36]:

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [37]:

vect.vocabulary_

Out[37]:

{'FILLY': 2376,
 'only': 15338,
 'reviews': 17176,
 'NINE': 4360,
 'now': 15195,
 'wow': 20695,
 'do': 10716,
 'miss': 14737,
 'THIS': 6193,
 'place': 16020,
 '24hrs': 138,
 'drive': 10863,
 'thru': 19445,
 'or': 15383,
 'walk': 20329,
 'up': 20039,
 'ridiculously': 17210,
 'cheap': 9111,
 'tasty': 19188,
 'of': 15276,
 'course': 9900,
 'the': 19346,
 'arizona': 7682,
 'burritos': 8704,
 'are': 7671,
 'good': 12370,
 'everything': 11312,
 'is': 13548,
 'used': 20085,
 'to': 19518,
 'LOVE': 3677,
 'one': 15331,
 'combos': 9488,
 'you': 20778,
 'get': 12263,
 'beef': 8122,
 'burrito': 8703,
 'taco': 19096,
 'rice': 17193,
 'and': 7507,
 'beans': 8092,
 'for': 11932,
 'UNDER': 6576,
 'color': 9469,
 'me': 14518,
 'silly': 17978,
 'call': 8799,
 'sally': 17425,
 'they': 19378,
 'have': 12743,
 'bomb': 8394,
 'horchata': 12998,
 'too': 19563,
 'really': 16775,
 'fresh': 12035,
 'flautas': 11801,
 'rolled': 17282,
 'tacos': 19097,
 'breakfast': 8532,
 'damn': 10175,
 'here': 12846,
 'whether': 20503,
 'drunk': 10889,
 'sober': 18233,
 'My': 4335,
 'husband': 13116,
 'absolutely': 7127,
 'this': 19403,
 'restaurant': 17123,
 'Anytime': 573,
 'find': 11717,
 'myself': 14968,
 'craving': 9967,
 'Mexican': 4159,
 'food': 11912,
 'first': 11753,
 'that': 19341,
 'pops': 16166,
 'in': 13241,
 'my': 14966,
 'head': 12756,
 'Salsa': 5596,
 'Blanca': 970,
 'We': 6858,
 'always': 7455,
 'encountered': 11141,
 'friendly': 12051,
 'welcoming': 20469,
 'staff': 18543,
 'amazing': 7467,
 'fulfilling': 12115,
 'What': 6891,
 'more': 14860,
 'could': 9878,
 'ask': 7744,
 'went': 20474,
 'today': 19526,
 'after': 7318,
 'lunch': 14266,
 'got': 12395,
 'usual': 20091,
 'lime': 14067,
 'basil': 8053,
 'real': 16766,
 'mint': 14706,
 'chip': 9202,
 'which': 20504,
 'love': 14231,
 'leaves': 13951,
 'hubby': 13063,
 'chocolate': 9217,
 'guiness': 12580,
 'four': 11988,
 'peaks': 15761,
 'hop': 12987,
 'knot': 13783,
 'Best': 920,
 'ice': 13143,
 'cream': 9976,
 'Phoenix': 4917,
 'The': 6356,
 'super': 18942,
 'nice': 15089,
 'They': 6371,
 'give': 12298,
 'us': 20082,
 'bags': 7951,
 'take': 19115,
 'our': 15441,
 'home': 12949,
 'Love': 3849,
 'Totally': 6460,
 'dissapointed': 10679,
 'had': 12620,
 'purchased': 16568,
 'coupon': 9897,
 'from': 12070,
 'TravelZoo': 6494,
 'try': 19777,
 'out': 15444,
 'given': 12300,
 'its': 13566,
 'location': 14160,
 'would': 20689,
 'thought': 19414,
 'it': 13560,
 'was': 20383,
 'going': 12357,
 'be': 8087,
 'very': 20200,
 'upscale': 20065,
 'we': 20427,
 'were': 20475,
 'expecting': 11403,
 'WHOLE': 6781,
 'LOT': 3674,
 'MORE': 3930,
 'Service': 5725,
 'great': 12476,
 'but': 8727,
 'not': 15167,
 'worth': 20685,
 'itself': 13567,
 'outdated': 15448,
 '80': 294,
 'crab': 9933,
 'cakes': 8790,
 'appertizer': 7611,
 'cold': 9450,
 'when': 20498,
 'served': 17742,
 'told': 19539,
 'waiter': 20317,
 'who': 20531,
 'turn': 19800,
 'chef': 9148,
 'message': 14612,
 'passed': 15687,
 'back': 7920,
 'she': 17816,
 'appologized': 7632,
 'Husband': 3235,
 'ordred': 15403,
 '12oz': 54,
 'New': 4446,
 'York': 7050,
 'Strip': 6070,
 'ala': 7388,
 'carte': 8937,
 '29': 155,
 '95': 319,
 'oz': 15543,
 'pure': 16572,
 'fat': 11572,
 'price': 16352,
 'expect': 11398,
 'meat': 14534,
 'than': 19332,
 'ordered': 15397,
 'wild': 20562,
 'mushroom': 14947,
 'pizza': 16012,
 'OK': 4541,
 'This': 6382,
 'needs': 15037,
 'an': 7499,
 'over': 15477,
 'haul': 12737,
 'major': 14332,
 'way': 20424,
 'if': 13171,
 'want': 20353,
 'make': 14335,
 'any': 7562,
 'money': 14826,
 'at': 7798,
 'Saturday': 5634,
 'night': 15098,
 '7pm': 292,
 'empty': 11125,
 'think': 19393,
 'highlight': 12876,
 'meal': 14520,
 'bottle': 8458,
 'cabinet': 8772,
 'Costco': 1695,
 'Travel': 6493,
 'recently': 16805,
 'returned': 17156,
 'trip': 19746,
 'Big': 945,
 'Island': 3356,
 'HI': 2959,
 'arranged': 7696,
 'throught': 19439,
 'After': 445,
 'shopping': 17884,
 'around': 7692,
 'found': 11982,
 'their': 19354,
 'prices': 16355,
 'best': 8201,
 'phone': 15908,
 'able': 7116,
 'all': 7406,
 'arrangements': 7698,
 'airfair': 7370,
 'condo': 9644,
 'car': 8883,
 'didn': 10520,
 'with': 20617,
 'Bs': 1135,
 'on': 15329,
 'Hilo': 3126,
 'adjust': 7248,
 'travel': 19693,
 'dates': 10210,
 'Everything': 2322,
 'according': 7170,
 'plan': 16029,
 'being': 8155,
 'accurate': 7177,
 'Condo': 1645,
 'outstanding': 15472,
 'value': 20126,
 'gas': 12203,
 'did': 10518,
 'some': 18277,
 'souvenier': 18349,
 'Kona': 3615,
 'again': 7327,
 'saved': 17515,
 've': 20148,
 'shopped': 17881,
 'years': 20745,
 'becoming': 8113,
 'groupie': 12538,
 'Not': 4508,
 'And': 540,
 'been': 8124,
 'couple': 9894,
 'There': 6367,
 'advantages': 7280,
 'rarely': 16714,
 'busy': 8726,
 'larger': 13869,
 'groups': 12542,
 'But': 1181,
 'last': 13879,
 'few': 11657,
 'times': 19489,
 'there': 19370,
 'experienced': 11411,
 'average': 7878,
 'lousy': 14229,
 'service': 17747,
 'When': 6893,
 'joining': 13638,
 'already': 7440,
 'seated': 17653,
 'group': 12537,
 '28': 154,
 'high': 12871,
 'school': 17572,
 'hostess': 13020,
 'wagged': 20311,
 'her': 12840,
 'finger': 11727,
 'face': 11487,
 'like': 14056,
 'East': 2197,
 'German': 2763,
 'border': 8437,
 'guard': 12564,
 'waffle': 20307,
 'house': 13040,
 'waitress': 20320,
 'poked': 16119,
 'shoulder': 17899,
 'attention': 7831,
 'wife': 20552,
 'chicken': 9168,
 'dish': 10637,
 'sent': 17721,
 'as': 7731,
 'measly': 14530,
 'pieces': 15950,
 'Only': 4615,
 'beers': 8128,
 'tap': 19150,
 'Garbage': 2727,
 'muffler': 14922,
 'shop': 17877,
 'town': 19630,
 'VERY': 6656,
 'shops': 17885,
 'trust': 19771,
 'Mighty': 4184,
 'Muffler': 4319,
 'them': 19357,
 'Greg': 2872,
 'has': 12723,
 'worked': 20664,
 'multiple': 14930,
 'cars': 8935,
 'done': 10757,
 'job': 13629,
 'If': 3280,
 'ever': 11304,
 'issues': 13558,
 'he': 12755,
 'takes': 19120,
 'care': 8902,
 'problem': 16397,
 'no': 15121,
 'questions': 16629,
 'asked': 7745,
 'just': 13684,
 'start': 18591,
 'off': 15277,
 'by': 8754,
 'saying': 17530,
 'egg': 11042,
 'salad': 17409,
 'sandwiches': 17468,
 'probably': 16395,
 'tried': 19736,
 'sandwich': 17466,
 'anywhere': 7574,
 'serves': 17746,
 'Sacks': 5567,
 'BY': 773,
 'FAR': 2366,
 'BEST': 729,
 'entitled': 11208,
 'Dali': 1892,
 'life': 14036,
 'live': 14126,
 'North': 4502,
 'will': 20569,
 'literally': 14122,
 'many': 14396,
 'miles': 14670,
 'eat': 10977,
 'On': 4610,
 'top': 19575,
 'wonderful': 20642,
 'menu': 14596,
 'whenever': 20499,
 'order': 15396,
 'comes': 9492,
 'little': 14124,
 'cookie': 9794,
 'cookies': 9795,
 'can': 8830,
 'purchase': 16567,
 'dough': 10784,
 'also': 7442,
 'other': 15433,
 'delicious': 10352,
 'dessert': 10460,
 'bars': 8038,
 'salads': 17410,
 'sale': 17415,
 'well': 20470,
 'eating': 10986,
 'least': 13947,
 '14': 58,
 'hope': 12988,
 'NEVER': 4351,
 'go': 12342,
 'away': 7897,
 'Saw': 5644,
 'Triple': 6509,
 'so': 18226,
 'decided': 10271,
 'expectations': 11401,
 'arrived': 7704,
 'shocked': 17863,
 'Thursday': 6396,
 'morning': 14864,
 'line': 14080,
 'extra': 11456,
 'long': 14180,
 'Had': 2995,
 'wait': 20315,
 'patiently': 15720,
 'don': 10752,
 'needless': 15036,
 'say': 17528,
 'because': 8108,
 'quickly': 16637,
 'pork': 16176,
 'chop': 9237,
 'eggs': 11049,
 'hash': 12724,
 'browns': 8610,
 'fabulous': 11485,
 'toast': 19519,
 'made': 14300,
 'grape': 12446,
 'jelly': 13608,
 'recommend': 16829,
 'anyone': 7567,
 'downtown': 10802,
 'area': 7672,
 'Make': 3999,
 'sure': 18976,
 'bacon': 7934,
 'hit': 12915,
 'Wine': 6932,
 'Down': 2097,
 'Wednesday': 6863,
 'happenin': 12688,
 'Tastings': 6287,
 'courtesy': 9906,
 'KYOT': 3515,
 'Here': 3105,
 'what': 20486,
 'wrong': 20719,
 'event': 11301,
 'minutes': 14712,
 'maybe': 14510,
 'spot': 18479,
 'case': 8945,
 'hoppin': 12995,
 'walked': 20331,
 'looks': 14192,
 'possibly': 16206,
 'manager': 14367,
 'owner': 15534,
 'then': 19362,
 'pointed': 16110,
 'towards': 19624,
 'room': 17298,
 'where': 20500,
 'held': 12818,
 'about': 7119,
 'tables': 19088,
 'people': 15808,
 'Talk': 6254,
 'early': 10957,
 'bird': 8255,
 'gets': 12265,
 'worm': 20675,
 'finish': 11735,
 'table': 19086,
 'sit': 18019,
 'patio': 15722,
 'She': 5752,
 'said': 17403,
 'yes': 20764,
 'right': 17215,
 'help': 12824,
 'In': 3301,
 'meantime': 14529,
 'another': 7539,
 'woman': 20636,
 'friend': 12047,
 'took': 19565,
 'nearby': 15022,
 '20': 109,
 'later': 13887,
 'looking': 14191,
 'came': 8816,
 'helped': 12825,
 'yet': 20766,
 'left': 13962,
 'return': 17155,
 'small': 18138,
 'happy': 12696,
 'hour': 13038,
 'flier': 11820,
 'handed': 12653,
 'distance': 10686,
 'even': 11298,
 'come': 9489,
 'know': 13786,
 're': 16751,
 'wondering': 20645,
 'why': 20544,
 '15': 62,
 'wanted': 20355,
 'see': 17679,
 'how': 13053,
 'bad': 7936,
 'frankly': 12006,
 'wine': 20585,
 'It': 3361,
 'same': 17443,
 'full': 12116,
 'detailed': 10470,
 'menus': 14598,
 'deals': 10236,
 'ready': 16765,
 'Um': 6605,
 'gave': 12219,
 '10': 16,
 'seconds': 17662,
 'ago': 7340,
 'By': 1193,
 'enjoy': 11174,
 'look': 14189,
 'app': 7592,
 'specialty': 18402,
 'appetizer': 7615,
 'much': 14917,
 'whatever': 20487,
 'deal': 10230,
 'Then': 6364,
 'drink': 10855,
 'section': 17669,
 'claim': 9306,
 'glasses': 12314,
 'excited': 11353,
 'new': 15072,
 'wines': 20586,
 'reality': 16768,
 'consisted': 9706,
 'three': 19426,
 'different': 10534,
 'weren': 20476,
 'women': 20637,
 'outside': 15470,
 'leave': 13950,
 'having': 12746,
 'enough': 11183,
 'shenanigans': 17835,
 '30': 164,
 'since': 17994,
 'next': 15084,
 'door': 10766,
 'Sprouts': 5993,
 'apps': 7653,
 'own': 15532,
 'bet': 8204,
 'plenty': 16082,
 'experiences': 11412,
 'continue': 9751,
 'visit': 20255,
 'happily': 12694,
 'experience': 11410,
 'caused': 8984,
 'write': 20712,
 'usually': 20092,
 'instance': 13414,
 'wish': 20612,
 'actually': 7216,
 'review': 17171,
 'poor': 16156,
 'substandard': 18853,
 'giving': 12303,
 'understand': 19909,
 'isn': 13554,
 'better': 8207,
 'talented': 19123,
 'others': 15434,
 'However': 3215,
 'business': 8715,
 'should': 17898,
 'customer': 10136,
 'cliched': 9353,
 'less': 13998,
 'star': 18579,
 'Oh': 4591,
 'yeah': 20740,
 'slapped': 18080,
 'down': 10791,
 'You': 7053,
 'Yelped': 7024,
 'card': 8894,
 'Take': 6246,
 'pepperoni': 15816,
 'AMAZED': 362,
 'herb': 12841,
 'vegetable': 20155,
 'garden': 12191,
 'front': 12072,
 'FRESH': 2395,
 'ingredients': 13360,
 'these': 19377,
 'days': 10220,
 'dishes': 10638,
 'cheese': 9138,
 'platters': 16057,
 'mouth': 14899,
 'watering': 20411,
 'speak': 18385,
 'day': 10217,
 'll': 14137,
 'keep': 13704,
 'adding': 7235,
 'STAY': 5543,
 'TUNED': 6227,
 'BTW': 761,
 'decor': 10289,
 'reminds': 16991,
 'VIG': 6660,
 'definitely': 10322,
 'spacious': 18359,
 'dining': 10563,
 'sushi': 19002,
 'time': 19483,
 'garlic': 12195,
 'knots': 13785,
 'favorite': 11591,
 'ALOT': 356,
 'Panda': 4778,
 'Garden': 2729,
 'family': 11532,
 'run': 17369,
 'Chinese': 1496,
 'Food': 2563,
 'Restaurant': 5312,
 'most': 14874,
 'importantly': 13220,
 'style': 18830,
 'meals': 14521,
 'personal': 15861,
 'Empress': 2255,
 'Chicken': 1478,
 'sweeter': 19036,
 'Orange': 4629,
 'flavor': 11803,
 'melts': 14574,
 'your': 20782,
 'Kung': 3638,
 'Pao': 4784,
 'Two': 6560,
 'Shrimp': 5795,
 'leftovers': 13964,
 'thanks': 19339,
 'generous': 12246,
 'portions': 16187,
 'selection': 17695,
 'sake': 17407,
 'beer': 8127,
 'everyone': 11310,
 'else': 11083,
 'checks': 9129,
 'Man': 4009,
 'oh': 15298,
 'Since': 5817,
 'leaving': 13952,
 'six': 18027,
 'half': 12635,
 'search': 17642,
 'gotten': 12400,
 'point': 16109,
 'settled': 17759,
 'decent': 10266,
 'work': 20663,
 'colleague': 9459,
 'suggested': 18903,
 'dine': 10556,
 'corporate': 9840,
 'meeting': 14560,
 'While': 6899,
 'realized': 16770,
 'native': 15012,
 'Metro': 4157,
 'Area': 610,
 'never': 15071,
 'thing': 19388,
 'trusted': 19772,
 'his': 12910,
 'palate': 15588,
 'sense': 17715,
 'pleased': 16076,
 'saw': 17525,
 'diners': 10559,
 'predominately': 16272,
 'Asian': 653,
 'knew': 13773,
 'market': 14425,
 'things': 19390,
 'signaled': 17963,
 'thrilled': 19432,
 'familiar': 11530,
 'Dim': 2022,
 'Sum': 6107,
 'carts': 8941,
 'rolling': 17284,
 'through': 19437,
 'party': 15682,
 'shared': 17803,
 'fantastic': 11547,
 'sum': 18915,
 'succulent': 18876,
 'Our': 4660,
 'orders': 15400,
 'equally': 11234,
 'veggies': 20160,
 'brightly': 8574,
 'colored': 9471,
 'tastes': 19180,
 'distinctive': 10692,
 'Hot': 3203,
 'Sour': 5932,
 'Soup': 5930,
 'perfect': 15826,
 'filled': 11695,
 'kinds': 13754,
 'mushrooms': 14948,
 'etc': 11278,
 'delighted': 10358,
 'reminisced': 16992,
 'dinners': 10567,
 'Chinatown': 1495,
 'NY': 4386,
 'comparable': 9543,
 'close': 9376,
 'regular': 16920,
 'Chandler': 1427,
 'clear': 9339,
 'side': 17949,
 'World': 6963,
 'figure': 11683,
 'Marco': 4028,
 'Polo': 5008,
 'Palace': 4764,
 'REAL': 5169,
 'Thing': 6374,
 'Friendly': 2628,
 'prompt': 16463,
 'rounded': 17332,
 'Hooray': 3181,
 'stayed': 18621,
 'hotel': 13028,
 'Tuscany': 6551,
 'Delicious': 1964,
 'Scallops': 5650,
 'pasta': 15699,
 'Also': 508,
 'excellent': 11341,
 'server': 17743,
 'Pure': 5116,
 'Bliss': 983,
 'haven': 12744,
 'quality': 16608,
 'dinner': 10566,
 'stupendous': 18822,
 'every': 11306,
 'cent': 9019,
 'MUST': 3941,
 'Yin': 7035,
 'Yang': 7013,
 'Martini': 4062,
 'creative': 9989,
 'presented': 16315,
 'smiles': 18160,
 'Smores': 5866,
 'Chocolate': 1506,
 'desert': 10433,
 'hands': 12666,
 'connosiuer': 9686,
 'critical': 10029,
 'So': 5876,
 'college': 9465,
 'kids': 13737,
 'RA': 5154,
 'mediocre': 14554,
 'list': 14108,
 'irk': 13532,
 'during': 10931,
 'co': 9411,
 'workers': 20666,
 'PACKED': 4687,
 'Good': 2827,
 'WRONG': 6802,
 'Extremely': 2356,
 'meet': 14559,
 'coworkers': 9926,
 'tired': 19510,
 'chill': 9189,
 'past': 15698,
 'friends': 12052,
 'alone': 7432,
 'quite': 16647,
 'bit': 8268,
 'servings': 17751,
 'OBVIOUSLY': 4533,
 'LOW': 3680,
 'Zen': 7086,
 '32': 175,
 'choice': 9224,
 'SakeBomber': 5581,
 'BETTER': 730,
 'Be': 866,
 'loud': 14220,
 'noises': 15129,
 'trying': 19778,
 'talk': 19125,
 'louder': 14221,
 'music': 14951,
 'person': 15859,
 'Just': 3501,
 'overall': 15478,
 'tip': 19504,
 'priced': 16353,
 'works': 20671,
 'Horrible': 3195,
 'Twice': 6555,
 'gone': 12367,
 'messing': 14616,
 'Starbucks': 6016,
 'hard': 12701,
 'Believe': 900,
 'All': 493,
 'attitude': 7836,
 'One': 4612,
 'lady': 13830,
 'doesn': 10727,
 'ordering': 15398,
 'someone': 18282,
 'fix': 11768,
 'known': 13791,
 'Ya': 7010,
 'employees': 11119,
 'rude': 17354,
 'makes': 14339,
 'anymore': 7566,
 'stopped': 18710,
 'still': 18676,
 'mom': 14818,
 'goes': 12356,
 'coffee': 9442,
 'As': 643,
 'east': 10973,
 'coast': 9416,
 'deli': 10343,
 'west': 20478,
 'Appalachians': 582,
 'creams': 9981,
 'Pricy': 5075,
 'apparently': 7599,
 'free': 12019,
 'summer': 18919,
 'kept': 13712,
 'stars': 18590,
 'Experience': 2344,
 'Hostess': 3202,
 'frown': 12079,
 'birthday': 8259,
 'spirits': 18447,
 'written': 20717,
 'clarity': 9315,
 'margaritas': 14408,
 'strong': 18789,
 'check': 9118,
 'ok': 15306,
 'shrimp': 17926,
 '18': 77,
 'two': 19833,
 'mahi': 14313,
 'each': 10949,
 '24': 137,
 'RANCID': 5157,
 'Yup': 7069,
 'fajitas': 11517,
 'jumbo': 13673,
 'charge': 9091,
 'entree': 11213,
 'pay': 15745,
 'onions': 15336,
 'Poor': 5016,
 'Mediocre': 4125,
 'Never': 4445,
 'once': 15330,
 'week': 20447,
 'Asada': 644,
 'Enchilada': 2260,
 'avocado': 7884,
 'lover': 14236,
 'ripest': 17230,
 'avocados': 7885,
 'F724': 2359,
 'bother': 8453,
 'asking': 7747,
 'entering': 11193,
 'reply': 17047,
 'spoke': 18466,
 'acknowledged': 7189,
 'presence': 16312,
 'ticket': 19458,
 'happened': 12687,
 'effort': 11039,
 'tell': 19263,
 'horrible': 13002,
 'scathing': 17552,
 'deplorable': 10412,
 'Sadly': 5571,
 'DMV': 1850,
 'modeled': 14785,
 'agonizing': 7341,
 'depths': 10421,
 'Hell': 3092,
 'With': 6941,
 'musac': 14940,
 'score': 17590,
 'playing': 16065,
 'voracious': 20287,
 'stench': 18649,
 'air': 7368,
 'spastic': 18380,
 'children': 9181,
 'running': 17372,
 'version': 20197,
 'politicians': 16134,
 'campaigning': 8823,
 'late': 13885,
 'complete': 9583,
 'emissions': 11107,
 'test': 19315,
 'Those': 6385,
 'DMVs': 1851,
 'open': 15350,
 'Saturdays': 5635,
 'map': 14397,
 'How': 3212,
 'precious': 16265,
 'fortunate': 11971,
 'near': 15021,
 'Seriously': 5719,
 'Three': 6388,
 'entire': 11206,
 'Valley': 6676,
 'receiving': 16803,
 'such': 18878,
 'magnificent': 14309,
 'welcome': 20466,
 'among': 7487,
 'painful': 15569,
 'To': 6423,
 'simplify': 17989,
 'let': 14002,
 'feel': 11624,
 'constitutes': 9717,
 'acceptable': 7143,
 'behavior': 8150,
 'public': 16521,
 'Bathe': 857,
 'combination': 9482,
 'water': 20407,
 'soap': 18230,
 'along': 7433,
 'massaging': 14469,
 'skin': 18058,
 'creates': 9985,
 'outcome': 15447,
 'pleasing': 16077,
 'scrub': 17624,
 'behind': 8151,
 'ears': 10963,
 'reference': 16871,
 'website': 20439,
 'notorious': 15183,
 'http': 13061,
 'www': 20724,
 'craigslist': 9950,
 ...}

Using CountVectorizer in a Model

In [38]:

# Use default options for CountVectorizer.
vect = CountVectorizer()

# Create document-term matrices.
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# Use Naive Bayes to predict the star rating.
nb = MultinomialNB() #best for binary classification outcome 
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# Calculate accuracy.
print((metrics.accuracy_score(y_test, y_pred_class)))

Out[38]:

0.9187866927592955

In [46]:

from sklearn.ensemble import RandomForestClassifier

# Use default options for CountVectorizer.
vect = CountVectorizer()

# Create document-term matrices.
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# Use RFC to predict the star rating.
rf = RandomForestClassifier()
rf.fit(X_train_dtm, y_train)
y_pred_class = rf.predict(X_test_dtm)

# Calculate accuracy.
print((metrics.accuracy_score(y_test, y_pred_class)))

Out[46]:

0.8688845401174168

In [47]:

from sklearn.linear_model import LogisticRegressionCV

# Use default options for CountVectorizer.
vect = CountVectorizer()

# Create document-term matrices.
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# Use LR to predict the star rating.
lr = LogisticRegressionCV()
lr.fit(X_train_dtm, y_train)
y_pred_class = lr.predict(X_test_dtm)

# Calculate accuracy.
print((metrics.accuracy_score(y_test, y_pred_class)))

Out[47]:

0.9266144814090019

In [48]:

print(metrics.classification_report(y_test, y_pred_class))

Out[48]:

             precision    recall  f1-score   support

          1       0.80      0.79      0.79       184
          5       0.95      0.96      0.96       838

avg / total       0.93      0.93      0.93      1022

In [50]:

metrics.confusion_matrix(y_test, y_pred_class)

Out[50]:

array([[145,  39],
       [ 36, 802]])

In [56]:

y_test.value_counts()

Out[56]:

5    838
1    184
Name: stars, dtype: int64

In [62]:

y_test.value_counts(normalize=True) #normalize=True is great for vc as it gives percents, same as both the below

Out[62]:

5    0.819961
1    0.180039
Name: stars, dtype: float64

In [61]:

y_test.value_counts() /  y_test.value_counts().sum()

Out[61]:

5    0.819961
1    0.180039
Name: stars, dtype: float64

In [52]:

# Calculate null accuracy.
y_test_binary = np.where(y_test==5, 1, 0) # five stars become 1, one stars become 0
print('Percent 5 Stars:', y_test_binary.mean())
print('Percent 1 Stars:', 1 - y_test_binary.mean())

Out[52]:

Percent 5 Stars: 0.8199608610567515
Percent 1 Stars: 0.18003913894324852

Our model predicted ~92% accuracy, which is an improvement over this baseline 82% accuracy (assuming our model always predicts 5 stars).

Let's look more into how the vectorizer works.

In [63]:

# Notice how the data was transformed into this sparse matrix with 1,022 datapoints and 16,825 features!
#   - Recall that vectorizations of text will be mostly zeros, since only a few unique words are in each document.
#   - For that reason, instead of storing all the zeros we only store non-zero values (inside the 'sparse matrix' data structure!).
#   - We have 3064 Yelp reviews in our training set.
#   - 16,825 unique words were found across all documents.

X_train_dtm

Out[63]:

<3064x16825 sparse matrix of type '<class 'numpy.int64'>'
	with 237720 stored elements in Compressed Sparse Row format>

In [64]:

# Let's take a look at the vocabulary that was generated, containing 16,825 unique words.
#   'vocabulary_' is a dictionary that converts each word to its index in the sparse matrix.
#   - For example, the word "four" is index #3230 in the sparse matrix.

vect.vocabulary_

Out[64]:

{'filly': 5773,
 'only': 10362,
 'reviews': 12465,
 'nine': 10069,
 'now': 10180,
 'wow': 16612,
 'do': 4631,
 'miss': 9578,
 'this': 15093,
 'place': 11186,
 '24hrs': 136,
 'drive': 4809,
 'thru': 15136,
 'or': 10413,
 'walk': 16195,
 'up': 15834,
 'ridiculously': 12514,
 'cheap': 2789,
 'tasty': 14838,
 'of': 10286,
 'course': 3679,
 'the': 15032,
 'arizona': 1018,
 'burritos': 2286,
 'are': 1003,
 'good': 6571,
 'everything': 5342,
 'is': 7956,
 'used': 15885,
 'to': 15228,
 'love': 8899,
 'one': 10354,
 'combos': 3233,
 'you': 16727,
 'get': 6433,
 'beef': 1564,
 'burrito': 2285,
 'taco': 14720,
 'rice': 12489,
 'and': 805,
 'beans': 1528,
 'for': 6028,
 'under': 15683,
 'color': 3213,
 'me': 9301,
 'silly': 13462,
 'call': 2398,
 'sally': 12808,
 'they': 15067,
 'have': 7023,
 'bomb': 1902,
 'horchata': 7345,
 'too': 15281,
 'really': 12038,
 'fresh': 6154,
 'flautas': 5882,
 'rolled': 12622,
 'tacos': 14721,
 'breakfast': 2069,
 'damn': 3999,
 'here': 7149,
 'whether': 16394,
 'drunk': 4837,
 'sober': 13745,
 'my': 9868,
 'husband': 7476,
 'absolutely': 353,
 'restaurant': 12410,
 'anytime': 889,
 'find': 5790,
 'myself': 9871,
 'craving': 3757,
 'mexican': 9440,
 'food': 6006,
 'first': 5828,
 'that': 15027,
 'pops': 11347,
 'in': 7619,
 'head': 7042,
 'salsa': 12814,
 'blanca': 1780,
 'we': 16306,
 'always': 735,
 'encountered': 5144,
 'friendly': 6171,
 'welcoming': 16355,
 'staff': 14107,
 'amazing': 752,
 'fulfilling': 6251,
 'what': 16377,
 'more': 9735,
 'could': 3655,
 'ask': 1101,
 'went': 16361,
 'today': 15237,
 'after': 563,
 'lunch': 8945,
 'got': 6605,
 'usual': 15893,
 'lime': 8707,
 'basil': 1477,
 'real': 12029,
 'mint': 9540,
 'chip': 2896,
 'which': 16396,
 'leaves': 8570,
 'hubby': 7420,
 'chocolate': 2914,
 'guiness': 6820,
 'four': 6089,
 'peaks': 10866,
 'hop': 7329,
 'knot': 8348,
 'best': 1657,
 'ice': 7503,
 'cream': 3766,
 'phoenix': 11045,
 'super': 14545,
 'nice': 10041,
 'give': 6481,
 'us': 15879,
 'bags': 1345,
 'take': 14743,
 'our': 10493,
 'home': 7283,
 'totally': 15327,
 'dissapointed': 4585,
 'had': 6868,
 'purchased': 11785,
 'coupon': 3676,
 'from': 6195,
 'travelzoo': 15435,
 'try': 15520,
 'out': 10496,
 'given': 6483,
 'its': 7980,
 'location': 8814,
 'would': 16606,
 'thought': 15104,
 'it': 7971,
 'was': 16252,
 'going': 6551,
 'be': 1522,
 'very': 16032,
 'upscale': 15861,
 'were': 16362,
 'expecting': 5440,
 'whole': 16429,
 'lot': 8881,
 'service': 13193,
 'great': 6693,
 'but': 2312,
 'not': 10148,
 'worth': 16601,
 'itself': 7981,
 'outdated': 10500,
 '80': 284,
 'crab': 3721,
 'cakes': 2387,
 'appertizer': 933,
 'cold': 3192,
 'when': 16389,
 'served': 13188,
 'told': 15254,
 'waiter': 16181,
 'who': 16425,
 'turn': 15556,
 'chef': 2828,
 'message': 9416,
 'passed': 10776,
 'back': 1310,
 'she': 13272,
 'appologized': 957,
 'ordred': 10434,
 '12oz': 52,
 'new': 10013,
 'york': 16724,
 'strip': 14358,
 'ala': 646,
 'carte': 2573,
 '29': 153,
 '95': 307,
 'oz': 10602,
 'pure': 11789,
 'fat': 5622,
 'price': 11555,
 'expect': 5435,
 'meat': 9318,
 'than': 15015,
 'ordered': 10428,
 'wild': 16461,
 'mushroom': 9844,
 'pizza': 11174,
 'ok': 10322,
 'needs': 9972,
 'an': 797,
 'over': 10532,
 'haul': 7014,
 'major': 9048,
 'way': 16300,
 'if': 7534,
 'want': 16220,
 'make': 9051,
 'any': 879,
 'money': 9689,
 'at': 1157,
 'saturday': 12905,
 'night': 10052,
 '7pm': 282,
 'empty': 5125,
 'think': 15082,
 'highlight': 7186,
 'meal': 9303,
 'bottle': 1981,
 'cabinet': 2365,
 'costco': 3635,
 'travel': 15428,
 'recently': 12073,
 'returned': 12444,
 'trip': 15483,
 'big': 1704,
 'island': 7958,
 'hi': 7168,
 'arranged': 1038,
 'throught': 15130,
 'shopping': 13361,
 'around': 1033,
 'found': 6083,
 'their': 15039,
 'prices': 11559,
 'phone': 11049,
 'able': 340,
 'all': 679,
 'arrangements': 1040,
 'airfair': 621,
 'condo': 3394,
 'car': 2505,
 'didn': 4411,
 'with': 16526,
 'bs': 2197,
 'on': 10352,
 'hilo': 7206,
 'adjust': 485,
 'dates': 4053,
 'according': 398,
 'plan': 11196,
 'being': 1599,
 'accurate': 405,
 'outstanding': 10527,
 'value': 15939,
 'gas': 6356,
 'did': 4409,
 'some': 13798,
 'souvenier': 13885,
 'kona': 8363,
 'again': 573,
 'saved': 12923,
 've': 15968,
 'shopped': 13358,
 'years': 16679,
 'becoming': 1555,
 'groupie': 6766,
 'been': 1567,
 'couple': 3672,
 'there': 15057,
 'advantages': 520,
 'rarely': 11967,
 'busy': 2311,
 'larger': 8478,
 'groups': 6770,
 'last': 8490,
 'few': 5720,
 'times': 15190,
 'experienced': 5448,
 'average': 1253,
 'lousy': 8897,
 'joining': 8119,
 'already': 719,
 'seated': 13085,
 'group': 6765,
 '28': 152,
 'high': 7181,
 'school': 12990,
 'hostess': 7369,
 'wagged': 16174,
 'her': 7142,
 'finger': 5800,
 'face': 5528,
 'like': 8695,
 'east': 4943,
 'german': 6428,
 'border': 1953,
 'guard': 6800,
 'waffle': 16169,
 'house': 7390,
 'waitress': 16184,
 'poked': 11295,
 'shoulder': 13376,
 'attention': 1197,
 'wife': 16449,
 'chicken': 2855,
 'dish': 4540,
 'sent': 13162,
 'as': 1080,
 'measly': 9314,
 'pieces': 11099,
 'beers': 1571,
 'tap': 14792,
 'garbage': 6337,
 'muffler': 9815,
 'shop': 13354,
 'town': 15357,
 'shops': 13362,
 'trust': 15513,
 'mighty': 9482,
 'them': 15042,
 'greg': 6714,
 'has': 6999,
 'worked': 16580,
 'multiple': 9824,
 'cars': 2571,
 'done': 4687,
 'job': 8105,
 'ever': 5333,
 'issues': 7968,
 'he': 7041,
 'takes': 14748,
 'care': 2525,
 'problem': 11605,
 'no': 10084,
 'questions': 11856,
 'asked': 1102,
 'just': 8189,
 'start': 14159,
 'off': 10287,
 'by': 2346,
 'saying': 12940,
 'egg': 5023,
 'salad': 12788,
 'sandwiches': 12862,
 'probably': 11603,
 'tried': 15473,
 'sandwich': 12860,
 'anywhere': 892,
 'serves': 13192,
 'sacks': 12762,
 'far': 5600,
 'entitled': 5217,
 'dali': 3991,
 'life': 8673,
 'live': 8775,
 'north': 10136,
 'will': 16470,
 'literally': 8770,
 'many': 9120,
 'miles': 9493,
 'eat': 4949,
 'top': 15293,
 'wonderful': 16554,
 'menu': 9396,
 'whenever': 16390,
 'order': 10427,
 'comes': 3237,
 'little': 8773,
 'cookie': 3558,
 'cookies': 3559,
 'can': 2435,
 'purchase': 11784,
 'dough': 4722,
 'also': 721,
 'other': 10485,
 'delicious': 4220,
 'dessert': 4342,
 'bars': 1458,
 'salads': 12789,
 'sale': 12796,
 'well': 16356,
 'eating': 4958,
 'least': 8566,
 '14': 56,
 'hope': 7330,
 'never': 10012,
 'go': 6532,
 'away': 1275,
 'saw': 12934,
 'triple': 15485,
 'so': 13738,
 'decided': 4131,
 'expectations': 5438,
 'arrived': 1048,
 'shocked': 13339,
 'thursday': 15145,
 'morning': 9742,
 'line': 8724,
 'extra': 5494,
 'long': 8842,
 'wait': 16179,
 'patiently': 10810,
 'don': 4682,
 'needless': 9971,
 'say': 12938,
 'because': 1548,
 'quickly': 11864,
 'pork': 11358,
 'chop': 2938,
 'eggs': 5030,
 'hash': 7000,
 'browns': 2168,
 'fabulous': 5526,
 'toast': 15229,
 'made': 8998,
 'grape': 6659,
 'jelly': 8061,
 'recommend': 12097,
 'anyone': 885,
 'downtown': 4743,
 'area': 1004,
 'sure': 14583,
 'bacon': 1324,
 'hit': 7232,
 'wine': 16492,
 'down': 4732,
 'wednesday': 16325,
 'happenin': 6955,
 'tastings': 14837,
 'courtesy': 3685,
 'kyot': 8401,
 'wrong': 16639,
 'event': 5330,
 'minutes': 9546,
 'maybe': 9272,
 'spot': 14035,
 'case': 2583,
 'hoppin': 7339,
 'walked': 16197,
 'looks': 8855,
 'possibly': 11391,
 'manager': 9088,
 'owner': 10590,
 'then': 15047,
 'pointed': 11285,
 'towards': 15350,
 'room': 12641,
 'where': 16391,
 'held': 7114,
 'about': 343,
 'tables': 14711,
 'people': 10927,
 'talk': 14757,
 'early': 4926,
 'bird': 1735,
 'gets': 6435,
 'worm': 16591,
 'finish': 5808,
 'table': 14709,
 'sit': 13510,
 'patio': 10812,
 'said': 12778,
 'yes': 16698,
 'right': 12522,
 'help': 7124,
 'meantime': 9312,
 'another': 850,
 'woman': 16548,
 'friend': 6167,
 'took': 15283,
 'nearby': 9954,
 '20': 107,
 'later': 8498,
 'looking': 8853,
 'came': 2417,
 'helped': 7125,
 'yet': 16701,
 'left': 8583,
 'return': 12443,
 'small': 13638,
 'happy': 6963,
 'hour': 7388,
 'flier': 5905,
 'handed': 6914,
 'distance': 4592,
 'even': 5327,
 'come': 3234,
 'know': 8351,
 're': 12013,
 'wondering': 16557,
 'why': 16440,
 '15': 60,
 'wanted': 16222,
 'see': 13113,
 'how': 7404,
 'bad': 1326,
 'frankly': 6118,
 'same': 12833,
 'full': 6252,
 'detailed': 4352,
 'menus': 9398,
 'deals': 4093,
 'ready': 12028,
 'um': 15635,
 'gave': 6375,
 '10': 16,
 'seconds': 13096,
 'ago': 587,
 'enjoy': 5179,
 'look': 8851,
 'app': 913,
 'specialty': 13945,
 'appetizer': 938,
 'much': 9807,
 'whatever': 16378,
 'deal': 4087,
 'drink': 4801,
 'section': 13103,
 'claim': 3029,
 'glasses': 6497,
 'excited': 5387,
 'wines': 16493,
 'reality': 12031,
 'consisted': 3467,
 'three': 15116,
 'different': 4427,
 'weren': 16363,
 'women': 16549,
 'outside': 10525,
 'leave': 8569,
 'having': 7026,
 'enough': 5189,
 'shenanigans': 13295,
 '30': 162,
 'since': 13482,
 'next': 10031,
 'door': 4699,
 'sprouts': 14064,
 'apps': 978,
 'own': 10588,
 'bet': 1660,
 'plenty': 11250,
 'experiences': 5449,
 'continue': 3514,
 'visit': 16107,
 'happily': 6961,
 'experience': 5447,
 'caused': 2628,
 'write': 16632,
 'usually': 15894,
 'instance': 7807,
 'wish': 16521,
 'actually': 447,
 'review': 12460,
 'poor': 11337,
 'substandard': 14446,
 'giving': 6486,
 'understand': 15696,
 'isn': 7962,
 'better': 1667,
 'talented': 14753,
 'others': 10486,
 'however': 7407,
 'business': 2300,
 'should': 13375,
 'customer': 3946,
 'cliched': 3082,
 'less': 8626,
 'star': 14146,
 'oh': 10310,
 'yeah': 16674,
 'slapped': 13575,
 'yelped': 16689,
 'card': 2516,
 'pepperoni': 10937,
 'amazed': 749,
 'herb': 7143,
 'vegetable': 15976,
 'garden': 6340,
 'front': 6197,
 'ingredients': 7747,
 'these': 15065,
 'days': 4067,
 'dishes': 4541,
 'cheese': 2818,
 'platters': 11225,
 'mouth': 9782,
 'watering': 16285,
 'speak': 13926,
 'day': 4064,
 'll': 8788,
 'keep': 8244,
 'adding': 469,
 'stay': 14186,
 'tuned': 15547,
 'btw': 2198,
 'decor': 4149,
 'reminds': 12272,
 'vig': 16066,
 'definitely': 4188,
 'spacious': 13898,
 'dining': 4462,
 'sushi': 14612,
 'time': 15184,
 'garlic': 6346,
 'knots': 8350,
 'favorite': 5643,
 'alot': 716,
 'panda': 10687,
 'family': 5581,
 'run': 12728,
 'chinese': 2894,
 'most': 9757,
 'importantly': 7596,
 'style': 14421,
 'meals': 9304,
 'personal': 10987,
 'empress': 5122,
 'sweeter': 14655,
 'orange': 10415,
 'flavor': 5884,
 'melts': 9371,
 'your': 16732,
 'kung': 8396,
 'pao': 10705,
 'two': 15598,
 'shrimp': 13403,
 'leftovers': 8585,
 'thanks': 15024,
 'generous': 6407,
 'portions': 11369,
 'selection': 13131,
 'sake': 12784,
 'beer': 1570,
 'everyone': 5340,
 'else': 5078,
 'checks': 2809,
 'man': 9083,
 'leaving': 8571,
 'six': 13519,
 'half': 6889,
 'search': 13073,
 'gotten': 6610,
 'point': 11283,
 'settled': 13205,
 'decent': 4125,
 'work': 16579,
 'colleague': 3202,
 'suggested': 14496,
 'dine': 4454,
 'corporate': 3610,
 'meeting': 9348,
 'while': 16398,
 'realized': 12033,
 'native': 9933,
 'metro': 9435,
 'thing': 15077,
 'trusted': 15514,
 'his': 7227,
 'palate': 10656,
 'sense': 13156,
 'pleased': 11243,
 'diners': 4458,
 'predominately': 11467,
 'asian': 1098,
 'knew': 8337,
 'market': 9165,
 'things': 15079,
 'signaled': 13445,
 'thrilled': 15123,
 'familiar': 5579,
 'dim': 4448,
 'sum': 14510,
 'carts': 2578,
 'rolling': 12624,
 'through': 15128,
 'party': 10768,
 'shared': 13254,
 'fantastic': 5596,
 'succulent': 14469,
 'orders': 10431,
 'equally': 5244,
 'veggies': 15981,
 'brightly': 2120,
 'colored': 3215,
 'tastes': 14830,
 'distinctive': 4598,
 'hot': 7374,
 'sour': 13871,
 'soup': 13868,
 'perfect': 10949,
 'filled': 5766,
 'kinds': 8310,
 'mushrooms': 9845,
 'etc': 5301,
 'delighted': 4226,
 'reminisced': 12273,
 'dinners': 4466,
 'chinatown': 2893,
 'ny': 10224,
 'comparable': 3290,
 'close': 3108,
 'regular': 12197,
 'chandler': 2741,
 'clear': 3067,
 'side': 13428,
 'world': 16588,
 'figure': 5750,
 'marco': 9133,
 'polo': 11314,
 'palace': 10653,
 'prompt': 11671,
 'rounded': 12678,
 'hooray': 7326,
 'stayed': 14189,
 'hotel': 7377,
 'tuscany': 15566,
 'scallops': 12951,
 'pasta': 10788,
 'excellent': 5374,
 'server': 13189,
 'bliss': 1820,
 'haven': 7024,
 'quality': 11833,
 'dinner': 4465,
 'stupendous': 14413,
 'every': 5336,
 'cent': 2674,
 'must': 9856,
 'yin': 16705,
 'yang': 16663,
 'martini': 9195,
 'creative': 3779,
 'presented': 11514,
 'smiles': 13664,
 'smores': 13684,
 'desert': 4315,
 'hands': 6928,
 'connosiuer': 3444,
 'critical': 3821,
 'college': 3209,
 'kids': 8288,
 'ra': 11887,
 'mediocre': 9341,
 'list': 8755,
 'irk': 7938,
 'during': 4891,
 'co': 3145,
 'workers': 16582,
 'packed': 10616,
 'extremely': 5503,
 'meet': 9347,
 'coworkers': 3708,
 'tired': 15214,
 'chill': 2878,
 'past': 10787,
 'friends': 6172,
 'alone': 710,
 'quite': 11877,
 'bit': 1749,
 'servings': 13197,
 'obviously': 10256,
 'low': 8911,
 'zen': 16785,
 '32': 172,
 'choice': 2922,
 'sakebomber': 12785,
 'loud': 8886,
 'noises': 10098,
 'trying': 15521,
 'louder': 8887,
 'music': 9848,
 'person': 10985,
 'overall': 10533,
 'tip': 15207,
 'priced': 11556,
 'works': 16587,
 'horrible': 7351,
 'twice': 15586,
 'gone': 6564,
 'messing': 9420,
 'starbucks': 14148,
 'hard': 6969,
 'believe': 1604,
 'attitude': 1202,
 'lady': 8427,
 'doesn': 4648,
 'ordering': 10429,
 'someone': 13803,
 'fix': 5846,
 'known': 8356,
 'ya': 16655,
 'employees': 5117,
 'rude': 12708,
 'makes': 9055,
 'anymore': 884,
 'stopped': 14290,
 'still': 14252,
 'mom': 9679,
 'goes': 6549,
 'coffee': 3182,
 'coast': 3151,
 'deli': 4210,
 'west': 16365,
 'appalachians': 914,
 'creams': 3771,
 'pricy': 11566,
 'apparently': 921,
 'free': 6134,
 'summer': 14515,
 'kept': 8256,
 'stars': 14158,
 'frown': 6207,
 'birthday': 1739,
 'spirits': 13998,
 'written': 16637,
 'clarity': 3042,
 'margaritas': 9140,
 'strong': 14376,
 'check': 2797,
 '18': 75,
 'mahi': 9025,
 'each': 4917,
 '24': 135,
 'rancid': 11948,
 'yup': 16765,
 'fajitas': 5564,
 'jumbo': 8173,
 'charge': 2763,
 'entree': 5222,
 'pay': 10844,
 'onions': 10360,
 'once': 10353,
 'week': 16329,
 'asada': 1081,
 'enchilada': 5138,
 'avocado': 1260,
 'lover': 8905,
 'ripest': 12542,
 'avocados': 1261,
 'f724': 5521,
 'bother': 1975,
 'asking': 1104,
 'entering': 5201,
 'reply': 12331,
 'spoke': 14020,
 'acknowledged': 419,
 'presence': 11511,
 'ticket': 15153,
 'happened': 6954,
 'effort': 5020,
 'tell': 14925,
 'scathing': 12969,
 'deplorable': 4288,
 'sadly': 12768,
 'dmv': 4628,
 'modeled': 9641,
 'agonizing': 588,
 'depths': 4297,
 'hell': 7117,
 'musac': 9837,
 'score': 13009,
 'playing': 11233,
 'voracious': 16143,
 'stench': 14221,
 'air': 619,
 'spastic': 13921,
 'children': 2869,
 'running': 12731,
 'version': 16028,
 'politicians': 11310,
 'campaigning': 2426,
 'late': 8496,
 'complete': 3331,
 'emissions': 5104,
 'test': 14990,
 'those': 15101,
 'dmvs': 4629,
 'open': 10379,
 'saturdays': 12906,
 'map': 9121,
 'precious': 11458,
 'fortunate': 6071,
 'near': 9953,
 'seriously': 13180,
 'entire': 5215,
 'valley': 15936,
 'receiving': 12071,
 'such': 14471,
 'magnificent': 9018,
 'welcome': 16352,
 'among': 781,
 'painful': 10633,
 'simplify': 13476,
 'let': 8630,
 'feel': 5678,
 'constitutes': 3479,
 'acceptable': 371,
 'behavior': 1594,
 'public': 11732,
 'bathe': 1492,
 'combination': 3227,
 'water': 16280,
 'soap': 13742,
 'along': 711,
 'massaging': 9221,
 'skin': 13553,
 'creates': 3775,
 'outcome': 10499,
 'pleasing': 11244,
 'scrub': 13050,
 'behind': 1595,
 'ears': 4932,
 'reference': 12142,
 'website': 16318,
 'notorious': 10164,
 'http': 7416,
 'www': 16645,
 'craigslist': 3740,
 'org': 10443,
 'htf': 7414,
 '755891987': 278,
 'html': 7415,
 'waiting': 16183,
 'tapping': 14800,
 'foot': 6022,
 'rocking': 12609,
 'forth': 6070,
 'chair': 2712,
 'bouncing': 1999,
 'leg': 8586,
 'aren': 1006,
 'faster': 5620,
 'watching': 16279,
 'clock': 3102,
 'anxious': 877,
 'leash': 8563,
 'cannot': 2465,
 'quietly': 11868,
 'winter': 16506,
 'fine': 5794,
 'cute': 3953,
 'monster': 9702,
 'aisles': 634,
 'honey': 7307,
 'bun': 2261,
 'squeezing': 14083,
 'between': 1671,
 'fingers': 5805,
 'fake': 5566,
 'falls': 5575,
 'five': 5845,
 'row': 12686,
 'maintain': 9039,
 'quiet': 11866,
 'tones': 15272,
 'ensure': 5197,
 'appropriate': 973,
 'conversations': 3541,
 'taking': 14749,
 'making': 9060,
 'calls': 2404,
 'sorry': 13851,
 'stripping': 14366,
 'popped': 11342,
 'into': 7877,
 'mind': 9515,
 'failed': 5550,
 'audition': 1218,
 'le': 8539,
 'girl': 6470,
 'continued': 3515,
 'conversation': 3539,
 'woodworker': 16571,
 'acknowledge': 418,
 'humans': 7440,
 'fact': 5540,
 'sideways': 13435,
 'disappear': 4490,
 'quit': 11876,
 'bumping': 2259,
 'hitting': 7235,
 'unless': 15770,
 'flirting': 5917,
 'need': 9966,
 'level': 8636,
 'brilliant': 2122,
 'recipe': 12083,
 'dashing': 4050,
 'dreams': 4778,
 'gutting': 6850,
 'humanity': 7439,
 'manage': 9084,
 'painfully': 10634,
 'lit': 8766,
 'reeks': 12137,
 'brother': 2158,
 'old': 10328,
 'gym': 6853,
 'shoes': 13345,
 'dad': 3979,
 'dirty': 4486,
 'underwear': 15703,
 'award': 1271,
 ...}

In [65]:

# Finally, let's convert the sparse matrix to a typical ndarray using .toarray()
#   - Remember, this takes up a lot more memory than the sparse matrix! However, this conversion is sometimes necessary.

X_test_dtm.toarray()

Out[65]:

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [66]:

# We will use this function below for simplicity.

# Define a function that accepts a vectorizer and calculates the accuracy.
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print(('Features: ', X_train_dtm.shape[1]))
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print(('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class)))

In [70]:

# min_df ignores words that occur less than twice ('df' means "document frequency").
vect = CountVectorizer(min_df=2, max_features=10000)
tokenize_test(vect)

Out[70]:

('Features: ', 8783)
('Accuracy: ', 0.9246575342465754)

Let's take a look next at other ways of preprocessing text!

Objective: Demonstrate common text preprocessing techniques.

N-Grams

N-grams are features which consist of N consecutive words. This is useful because using the bag-of-words model, treating data scientist as a single feature has more meaning than having two independent features data and scientist!

Example:

my cat is awesome
Unigrams (1-grams): 'my', 'cat', 'is', 'awesome'
Bigrams (2-grams): 'my cat', 'cat is', 'is awesome'
Trigrams (3-grams): 'my cat is', 'cat is awesome'
4-grams: 'my cat is awesome'

ngram_range: tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [71]:

# Include 1-grams and 2-grams.
vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

Out[71]:

(3064, 169847)

We can start to see how supplementing our features with n-grams can lead to more feature columns. When we produce n-grams from a document with $W$ words, we add an additional $(n-W+1)$ features (at most). That said, be careful — when we compute n-grams from an entire corpus, the number of unique n-grams could be vastly higher than the number of unique unigrams! This could cause an undesired feature explosion.

Although we sometimes add important new features that have meaning such as data scientist, many of the new features will just be noise. So, particularly if we do not have much data, adding n-grams can actually decrease model performance. This is because if each n-gram is only present once or twice in the training set, we are effectively adding mostly noisy features to the mix.

In [72]:

# Last 50 features
print((vect.get_feature_names()[-50:]))

Out[72]:

['zone out', 'zone when', 'zones', 'zones dolls', 'zoning', 'zoning issues', 'zoo', 'zoo and', 'zoo is', 'zoo not', 'zoo the', 'zoo ve', 'zoyo', 'zoyo for', 'zucca', 'zucca appetizer', 'zucchini', 'zucchini and', 'zucchini bread', 'zucchini broccoli', 'zucchini carrots', 'zucchini fries', 'zucchini pieces', 'zucchini strips', 'zucchini veal', 'zucchini very', 'zucchini with', 'zuchinni', 'zuchinni again', 'zuchinni the', 'zumba', 'zumba class', 'zumba or', 'zumba yogalates', 'zupa', 'zupa flavors', 'zuzu', 'zuzu in', 'zuzu is', 'zuzu the', 'zwiebel', 'zwiebel kräuter', 'zzed', 'zzed in', 'éclairs', 'éclairs napoleons', 'école', 'école lenôtre', 'ém', 'ém all']

In [73]:

# Show vectorizer options.
vect

Out[73]:

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

stop_words: string {english}, list, or None (default)
If english, a built-in stop word list for English is used.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms. (If max_df = 0.7, then if > 70% of documents contain a word it will not be included in the feature set!)

Stop-Word Removal

What: This process is used to remove common words that will likely appear in any text.
Why: Because common words exist in most documents, they likely only add noise to your model and should be removed.

What are stop words? Stop words are some of the most common words in a language. They are used so that a sentence makes sense grammatically, such as prepositions and determiners, e.g., "to," "the," "and." However, they are so commonly used that they are generally worthless for predicting the class of a document. Since "a" appears in spam and non-spam emails, for example, it would only contribute noise to our model.

Example:

Original sentence: "The dog jumped over the fence"
After stop-word removal: "dog jumped over fence"

The fact that there is a fence and a dog jumped over it can be derived with or without stop words.

In [74]:

# Remove English stop words.
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)
vect.get_params()

Out[74]:

('Features: ', 16528)
('Accuracy: ', 0.9158512720156555)

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': 'english',
 'strip_accents': None,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'vocabulary': None}

In [78]:

# Set of stop words
print((vect.get_stop_words()))

Out[78]:

frozenset({'nevertheless', 'then', 'along', 'there', 'third', 'hasnt', 'hence', 'into', 'below', 'call', 'anything', 'wherever', 'become', 'wherein', 'not', 'find', 'always', 'thereby', 'an', 'toward', 'what', 'none', 'be', 'both', 'have', 'nine', 'otherwise', 'when', 'bottom', 'co', 'amongst', 'will', 'couldnt', 'becomes', 'so', 'anyway', 'also', 'ie', 'some', 'upon', 'that', 'throughout', 'why', 'detail', 'perhaps', 'above', 'give', 'everyone', 'mill', 'she', 'except', 'was', 'they', 'your', 'name', 'down', 'much', 'around', 'con', 'often', 'others', 'back', 'we', 'whatever', 'describe', 'himself', 'without', 'had', 'under', 'them', 'please', 'for', 'ten', 'thick', 'one', 'are', 'neither', 'cannot', 'during', 'the', 'latter', 'these', 'hereafter', 'ever', 'every', 're', 'any', 'inc', 'most', 'beyond', 'seeming', 'rather', 'how', 'in', 'ltd', 'and', 'fifty', 'while', 'within', 'been', 'yourself', 'became', 'however', 'mine', 'were', 'next', 'very', 'you', 'forty', 'whereafter', 'other', 'go', 'alone', 'still', 'moreover', 'myself', 'seems', 'thereupon', 'somehow', 'first', 'can', 'my', 'over', 'hereby', 'see', 'take', 'than', 'us', 'among', 'whoever', 'its', 'too', 'mostly', 'everywhere', 'bill', 'now', 'ourselves', 'another', 'seem', 'anyone', 'herself', 'indeed', 'whereas', 'should', 'fifteen', 'nor', 'system', 'meanwhile', 'found', 'itself', 'which', 'towards', 'until', 'as', 'cant', 'interest', 'nowhere', 'several', 'afterwards', 'since', 'whom', 'serious', 'latterly', 'whenever', 'me', 'her', 'namely', 'all', 'already', 'fill', 'amount', 'due', 'between', 'eleven', 'a', 'fire', 'further', 'noone', 'de', 'or', 'this', 'almost', 'two', 'whole', 'less', 'off', 'do', 'yet', 'never', 'last', 'again', 'before', 'herein', 'therein', 'could', 'even', 'against', 'three', 'eight', 'whereupon', 'front', 'at', 'else', 'hundred', 'of', 'six', 'across', 'eg', 'out', 'up', 'is', 'anyhow', 'same', 'only', 'it', 'thin', 'would', 'anywhere', 'enough', 'per', 'well', 'our', 'whether', 'twelve', 'nothing', 'done', 'beside', 'he', 'to', 'if', 'either', 'full', 'yourselves', 'here', 'must', 'him', 'part', 'side', 'etc', 'whose', 'may', 'everything', 'ours', 'no', 'show', 'someone', 'former', 'formerly', 'nobody', 'has', 'empty', 'from', 'being', 'whereby', 'with', 'because', 'made', 'via', 'top', 'twenty', 'more', 'elsewhere', 'on', 'own', 'get', 'sometime', 'who', 'becoming', 'something', 'keep', 'might', 'about', 'put', 'five', 'am', 'besides', 'move', 'but', 'onto', 'yours', 'those', 'their', 'cry', 'thru', 'least', 'behind', 'thereafter', 'where', 'amoungst', 'though', 'whence', 'sometimes', 'un', 'few', 'seemed', 'hereupon', 'themselves', 'each', 'thence', 'once', 'together', 'therefore', 'such', 'through', 'i', 'by', 'four', 'hers', 'his', 'many', 'sincere', 'somewhere', 'thus', 'whither', 'sixty', 'although', 'after', 'beforehand'})

Other CountVectorizer Options

max_features: int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This allows us to keep more common n-grams and remove ones that may appear once. If we include words that only occur once, this can lead to said features being highly associated with a class and cause overfitting.

In [79]:

# Remove English stop words and only keep 100 features.
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

Out[79]:

('Features: ', 100)
('Accuracy: ', 0.8698630136986302)

In [80]:

# All 100 features
print((vect.get_feature_names()))

Out[80]:

['amazing', 'area', 'atmosphere', 'awesome', 'bad', 'bar', 'best', 'better', 'big', 'came', 'cheese', 'chicken', 'clean', 'coffee', 'come', 'day', 'definitely', 'delicious', 'did', 'didn', 'dinner', 'don', 'eat', 'excellent', 'experience', 'favorite', 'feel', 'food', 'free', 'fresh', 'friendly', 'friends', 'going', 'good', 'got', 'great', 'happy', 'home', 'hot', 'hour', 'just', 'know', 'like', 'little', 'll', 'location', 'long', 'looking', 'lot', 'love', 'lunch', 'make', 'meal', 'menu', 'minutes', 'need', 'new', 'nice', 'night', 'order', 'ordered', 'people', 'perfect', 'phoenix', 'pizza', 'place', 'pretty', 'prices', 'really', 'recommend', 'restaurant', 'right', 'said', 'salad', 'sandwich', 'sauce', 'say', 'service', 'staff', 'store', 'sure', 'table', 'thing', 'things', 'think', 'time', 'times', 'took', 'town', 'tried', 'try', 've', 'wait', 'want', 'way', 'went', 'wine', 'work', 'worth', 'years']

Just like with all other models, more features does not mean a better model. So, we must tune our feature generator to remove features whose predictive capability is none or very low.

In this case, there is roughly a 1.6% increase in accuracy when we double the n-gram size and increase our max features by 1,000-fold. Note that if we restrict it to only unigrams, then the accuracy increases even more! So, bigrams were very likely adding more noise than signal.

In the end, by only using 16,000 unigram features we came away with a much smaller, simpler, and easier-to-think-about model which also resulted in higher accuracy.

In [81]:

# Include 1-grams and 2-grams, and limit the number of features.

print('1-grams and 2-grams, up to 100K features:')
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)
tokenize_test(vect)

print()
print('1-grams only, up to 100K features:')
vect = CountVectorizer(ngram_range=(1, 1), max_features=100000)
tokenize_test(vect)

Out[81]:

1-grams and 2-grams, up to 100K features:
('Features: ', 100000)
('Accuracy: ', 0.8855185909980431)

1-grams only, up to 100K features:
('Features: ', 16825)
('Accuracy: ', 0.9187866927592955)

min_df: Float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [82]:

# Include 1-grams and 2-grams, and only include terms that appear at least two times.
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)

Out[82]:

('Features: ', 43957)
('Accuracy: ', 0.9324853228962818)

Introduction to TextBlob

You should already have downloaded TextBlob, a Python library used to explore common NLP tasks. If you haven’t, please return to this step for instructions on how to do so. We’ll be using this to organize our corpora for analysis.

As mentioned earlier, you can read more on the TextBlob website.

search for command in conda

In [88]:

# Print the first review.
print((yelp_best_worst.text[0]))

Out[88]:

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!

In [94]:

# Save it as a TextBlob object.
review = TextBlob(yelp_best_worst.text[0])

In [95]:

# List the words.
review.words

Out[95]:

WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'and', 'it', 'was', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'pieces', 'of', 'their', 'griddled', 'bread', 'with', 'was', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'was', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back'])

In [96]:

# List the sentences.
review.sentences

Out[96]:

[Sentence("My wife took me here on my birthday for breakfast and it was excellent."),
 Sentence("The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure."),
 Sentence("Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning."),
 Sentence("It looked like the place fills up pretty quickly so the earlier you get here the better."),
 Sentence("Do yourself a favor and get their Bloody Mary."),
 Sentence("It was phenomenal and simply the best I've ever had."),
 Sentence("I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it."),
 Sentence("It was amazing."),
 Sentence("While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious."),
 Sentence("It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete."),
 Sentence("It was the best "toast" I've ever had."),
 Sentence("Anyway, I can't wait to go back!")]

In [97]:

# Some string methods are available.
review.lower()

Out[97]:

TextBlob("my wife took me here on my birthday for breakfast and it was excellent.  the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  our waitress was excellent and our food arrived quickly on the semi-busy saturday morning.  it looked like the place fills up pretty quickly so the earlier you get here the better.

do yourself a favor and get their bloody mary.  it was phenomenal and simply the best i've ever had.  i'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  it was amazing.

while everything on the menu looks excellent, i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  it was the best "toast" i've ever had.

anyway, i can't wait to go back!")

Stemming and Lemmatization

Stemming is a crude process of removing common endings from sentences, such as "s", "es", "ly", "ing", and "ed".

What: Reduce a word to its base/stem/root form.
Why: This intelligently reduces the number of features by grouping together (hopefully) related words.
Notes:
- Stemming uses a simple and fast rule-based approach.
- Stemmed words are usually not shown to users (used for analysis/indexing).
- Some search engines treat words with the same stem as synonyms.

In [98]:

# Initialize stemmer.
stemmer = SnowballStemmer('english')

# Stem each word.
print([stemmer.stem(word) for word in review.words])

Out[98]:

['my', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excel', 'the', 'weather', 'was', 'perfect', 'which', 'made', 'sit', 'outsid', 'overlook', 'their', 'ground', 'an', 'absolut', 'pleasur', 'our', 'waitress', 'was', 'excel', 'and', 'our', 'food', 'arriv', 'quick', 'on', 'the', 'semi-busi', 'saturday', 'morn', 'it', 'look', 'like', 'the', 'place', 'fill', 'up', 'pretti', 'quick', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'bloodi', 'mari', 'it', 'was', 'phenomen', 'and', 'simpli', 'the', 'best', 'i', 've', 'ever', 'had', 'i', "'m", 'pretti', 'sure', 'they', 'onli', 'use', 'ingredi', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'it', 'was', 'amaz', 'while', 'everyth', 'on', 'the', 'menu', 'look', 'excel', 'i', 'had', 'the', 'white', 'truffl', 'scrambl', 'egg', 'veget', 'skillet', 'and', 'it', 'was', 'tasti', 'and', 'delici', 'it', 'came', 'with', '2', 'piec', 'of', 'their', 'griddl', 'bread', 'with', 'was', 'amaz', 'and', 'it', 'absolut', 'made', 'the', 'meal', 'complet', 'it', 'was', 'the', 'best', 'toast', 'i', 've', 'ever', 'had', 'anyway', 'i', 'ca', "n't", 'wait', 'to', 'go', 'back']

Some examples you can see are "excellent" stemmed to "excel" and "amazing" stemmed to "amaz".

Lemmatization is a more refined process that uses specific language and grammar rules to derive the root of a word.

This is useful for words that do not share an obvious root such as "better" and "best".

What: Lemmatization derives the canonical form ("lemma") of a word.
Why: It can be better than stemming.
Notes: Uses a dictionary-based approach (slower than stemming).

In [99]:

# Assume every word is a noun.
print([word.lemmatize() for word in review.words])

Out[99]:

['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'wa', 'excellent', 'The', 'weather', 'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'egg', 'vegetable', 'skillet', 'and', 'it', 'wa', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'piece', 'of', 'their', 'griddled', 'bread', 'with', 'wa', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'wa', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

Some examples you can see are "filled" lemmatized to "fill" and "was" lemmatized to "wa".

In [100]:

# Assume every word is a verb.
print([word.lemmatize(pos='v') for word in review.words])

Out[100]:

['My', 'wife', 'take', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'be', 'excellent', 'The', 'weather', 'be', 'perfect', 'which', 'make', 'sit', 'outside', 'overlook', 'their', 'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'be', 'excellent', 'and', 'our', 'food', 'arrive', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'look', 'like', 'the', 'place', 'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'be', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'have', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'be', 'amaze', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'look', 'excellent', 'I', 'have', 'the', 'white', 'truffle', 'scramble', 'egg', 'vegetable', 'skillet', 'and', 'it', 'be', 'tasty', 'and', 'delicious', 'It', 'come', 'with', '2', 'piece', 'of', 'their', 'griddle', 'bread', 'with', 'be', 'amaze', 'and', 'it', 'absolutely', 'make', 'the', 'meal', 'complete', 'It', 'be', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'have', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

Some examples you can see are "was" lemmatized to "be" and "arrived" lemmatized to "arrive".

More Lemmatization and Stemming Examples

Lemmatization	Stemming
shouted → shout	badly → bad
best → good	computing → comput
better → good	computed → comput
good → good	wipes → wip
wiping → wipe	wiped → wip
hidden → hide	wiping → wip

Activity: Knowledge Check

What other words or phrases might cause problems with stemming? Why?
What other words or phrases might cause problems with lemmatization? Why?

In [101]:

# Define a function that accepts text and returns a list of lemmas.
def split_into_lemmas(text):
    text = text.lower()
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]

In [102]:

# Use split_into_lemmas as the feature extraction function (Warning: SLOW!).
vect = CountVectorizer(analyzer=split_into_lemmas, decode_error='replace')
tokenize_test(vect)

Out[102]:

('Features: ', 16452)
('Accuracy: ', 0.9207436399217221)

In [103]:

# Last 50 features
print((vect.get_feature_names()[-50:]))

Out[103]:

['yuyuyummy', 'yuzu', 'z', 'z-grill', 'z11', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zen-like', 'zero', 'zero-star', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zipps', 'ziti', 'zoe', 'zombi', 'zombie', 'zone', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel-kräuter', 'zzed', 'éclairs', 'école', 'ém']

With all the available options for CountVectorizer(), you may wonder how to decide which to use! It's true that you can sometimes reason about which preprocessing techniques might work best. However, you will often not know for sure without trying out many different combinations and comparing their accuracies.

Keep in mind that you should constantly be thinking about the result of each preprocessing step instead of blindly trying them without thinking. Does each type of preprocessing "makes sense" with the input data you are using? Is it likely to keep intact the signal and remove noise?

Term Frequency–Inverse Document Frequency (TF–IDF)

While a Count Vectorizer simply totals up the number of times a "word" appears in a document, the more complex TF-IDF Vectorizer analyzes the uniqueness of words between documents to find distinguishing characteristics.

What: Term frequency–inverse document frequency (TF–IDF) computes the "relative frequency" with which a word appears in a document, compared to its frequency across all documents.
Why: It's more useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents).
Notes: It's used for search-engine scoring, text summarization, and document clustering.

In [104]:

# Example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [105]:

# Term frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf

Out[105]:

In [106]:

# Document frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

Out[106]:

In [107]:

# Term frequency–inverse document frequency (simple version)
tf/df

Out[107]:

The higher the TF–IDF value, the more "important" the word is to that specific document. Here, "cab" is the most important and unique word in document 1, while "please" is the most important and unique word in document 2. TF–IDF is often used for training as a replacement for word count.

In [108]:

# TfidfVectorizer
vect = TfidfVectorizer() #this has a built in logarithm to help avoid getting rounding errors and avoid giving weight to very small numbers
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

Out[108]:

More details: TF–IDF is about what matters

Using TF–IDF to Summarize a Yelp Review

Reddit's autotldr uses the SMMRY algorithm, which is based on TF–IDF.

In [109]:

# Create a document-term matrix using TF–IDF.
vect = TfidfVectorizer(stop_words='english')

# Fit transform Yelp data.
dtm = vect.fit_transform(yelp.text)
features = vect.get_feature_names()
dtm.shape

Out[109]:

(10000, 28880)

In [138]:

features

Out[138]:

['00',
 '000',
 '007',
 '00a',
 '00am',
 '00pm',
 '01',
 '02',
 '03',
 '03342',
 '04',
 '05',
 '06',
 '07',
 '08',
 '09',
 '0buxoc0crqjpvkezo3bqog',
 '0l',
 '0tzg',
 '10',
 '100',
 '1000',
 '1000x',
 '1001',
 '100lbs',
 '100s',
 '100th',
 '101',
 '102',
 '102729',
 '1030',
 '104',
 '105',
 '1070',
 '107f',
 '108',
 '109',
 '10am',
 '10ish',
 '10k',
 '10min',
 '10mins',
 '10minutes',
 '10oz',
 '10p',
 '10pm',
 '10th',
 '10x',
 '10yo',
 '11',
 '110',
 '1100',
 '111',
 '111th',
 '112',
 '113',
 '1130',
 '114',
 '1145',
 '115',
 '115th',
 '116',
 '117',
 '118',
 '11a',
 '11am',
 '11p',
 '11pm',
 '11th',
 '11year',
 '12',
 '120',
 '1200',
 '12000',
 '1202',
 '123',
 '124',
 '125',
 '128i',
 '129',
 '12a',
 '12am',
 '12k',
 '12oz',
 '12pm',
 '12th',
 '13',
 '130',
 '1300',
 '13331',
 '135',
 '13th',
 '13yr',
 '14',
 '140',
 '147',
 '149',
 '14lbs',
 '15',
 '150',
 '1500',
 '150k',
 '150mm',
 '157',
 '15am',
 '15ft',
 '15min',
 '15mins',
 '15pm',
 '15th',
 '16',
 '160',
 '1600',
 '162',
 '165',
 '1664',
 '169',
 '16oz',
 '16th',
 '16thh',
 '17',
 '170',
 '175',
 '177',
 '17cents',
 '17p',
 '17th',
 '18',
 '180',
 '1800',
 '184',
 '1892',
 '1895',
 '1899',
 '18th',
 '19',
 '1900',
 '1910',
 '1913',
 '1920',
 '1920s',
 '1926',
 '1928',
 '1929',
 '1930s',
 '1940',
 '19485',
 '1950',
 '1950s',
 '1952',
 '1955',
 '1956',
 '1960',
 '1961',
 '1962',
 '1965',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1973',
 '1978',
 '1980',
 '1980s',
 '1983',
 '1987',
 '199',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '19chptx8ahwztpc5xiarfq',
 '19th',
 '1am',
 '1b',
 '1cent',
 '1hour',
 '1jzambgdea9yyvasa8rukq',
 '1k',
 '1min',
 '1oz',
 '1p',
 '1paote8ys9ujup3u4djhiq',
 '1pm',
 '1st',
 '1star',
 '20',
 '200',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '200lb',
 '200lbs',
 '201',
 '2010',
 '20100421',
 '2011',
 '2012',
 '2013',
 '202',
 '205',
 '20mbs',
 '20miles',
 '20min',
 '20minutes',
 '20oz',
 '20pm',
 '20s',
 '20th',
 '20x',
 '21',
 '215',
 '21st',
 '22',
 '220',
 '2240',
 '22oz',
 '23',
 '230',
 '23a',
 '23rd',
 '24',
 '2400',
 '24hours',
 '24hr',
 '24hrs',
 '24st',
 '24th',
 '25',
 '250',
 '2500hd',
 '25b',
 '25min',
 '25oz',
 '25th',
 '26',
 '260',
 '2600',
 '2608',
 '2669',
 '26oz',
 '26th',
 '27',
 '270',
 '272',
 '275',
 '27th',
 '28',
 '284',
 '2852',
 '29',
 '2939',
 '2983024079',
 '299',
 '29th',
 '2am',
 '2k',
 '2lb',
 '2mbps',
 '2nd',
 '2pac',
 '2pm',
 '2rd',
 '2wice',
 '2x',
 '2x4',
 '2xu',
 '30',
 '300',
 '3000',
 '30a',
 '30am',
 '30ish',
 '30k',
 '30min',
 '30mins',
 '30p',
 '30pm',
 '30s',
 '30something',
 '30th',
 '30x',
 '31',
 '311',
 '312',
 '316',
 '32',
 '320',
 '325',
 '32nd',
 '32oz',
 '33',
 '33lb',
 '33rd',
 '34',
 '340',
 '3400',
 '341',
 '34th',
 '35',
 '350',
 '3500',
 '350ib',
 '35c',
 '35th',
 '36',
 '360',
 '365',
 '37',
 '370',
 '38',
 '38th',
 '39',
 '399',
 '39th',
 '3am',
 '3chicken',
 '3d',
 '3g',
 '3k',
 '3l',
 '3lb',
 '3lbs',
 '3n9u549zse8up',
 '3oz',
 '3p',
 '3pm',
 '3rd',
 '3s',
 '3tacos',
 '3x',
 '40',
 '400',
 '4000',
 '400s',
 '40ish',
 '40k',
 '40lm',
 '40min',
 '40pm',
 '40s',
 '40th',
 '41',
 '411',
 '411247',
 '4113416766',
 '42',
 '420',
 '43',
 '43rd',
 '44',
 '4458',
 '44oz',
 '44th',
 '45',
 '450',
 '453990',
 '45am',
 '45min',
 '45mins',
 '45pm',
 '45sec',
 '46',
 '4655',
 '46th',
 '47',
 '470',
 '475',
 '48',
 '480',
 '48th',
 '49',
 '490',
 '49er',
 '49ers',
 '49lb',
 '49th',
 '4am',
 '4b',
 '4cxbhzxxtmexf9krjmfviq',
 '4f',
 '4hr',
 '4ish',
 '4k',
 '4oz',
 '4p',
 '4peaks',
 '4pm',
 '4s',
 '4stars',
 '4th',
 '4ths',
 '4x',
 '4x6',
 '50',
 '500',
 '500sq',
 '505',
 '50cents',
 '50ft',
 '50lb',
 '50lm',
 '50s',
 '50x',
 '51',
 '5130',
 '51pm',
 '51st',
 '52',
 '5231',
 '53',
 '530',
 '53pm',
 '54',
 '547',
 '55',
 '5500',
 '56',
 '57',
 '57th',
 '58',
 '59',
 '59th',
 '5am',
 '5gb',
 '5h',
 '5ish',
 '5k',
 '5l',
 '5min',
 '5minutes',
 '5nzmttmomgj_svixktm51q',
 '5oz',
 '5p',
 '5pm',
 '5s',
 '5star',
 '5stars',
 '5th',
 '5yo',
 '60',
 '600',
 '6000',
 '602',
 '60f',
 '60miles',
 '60s',
 '60th',
 '61',
 '61st',
 '62',
 '62010',
 '623',
 '63',
 '630am',
 '64',
 '64oz',
 '64th',
 '65',
 '66',
 '6600',
 '67',
 '68',
 '680',
 '68th',
 '69',
 '696',
 '6am',
 '6ivet3g9ew',
 '6k',
 '6oz',
 '6p',
 '6pm',
 '6th',
 '6ths',
 '70',
 '700',
 '7000',
 '70s',
 '70th',
 '71',
 '71st',
 '72',
 '730',
 '730pm',
 '74',
 '747',
 '74th',
 '75',
 '750',
 '755891987',
 '757',
 '75cents',
 '75th',
 '76',
 '767',
 '777',
 '78',
 '7807',
 '79',
 '7a',
 '7am',
 '7ish',
 '7p',
 '7pm',
 '7th',
 '80',
 '800',
 '8000hp',
 '801',
 '80f',
 '80s',
 '80th',
 '81',
 '82',
 '83',
 '832',
 '8330',
 '83rd',
 '84',
 '845',
 '85',
 '85154658',
 '85340',
 '86',
 '87',
 '88',
 '89',
 '8am',
 '8inch',
 '8ish',
 '8oz',
 '8pc',
 '8pm',
 '8ppl',
 '8th',
 '8v',
 '8x10',
 '8x10s',
 '8xzd9ms7yvnaeavn1irgsq',
 '8yo',
 '90',
 '9000',
 '90210',
 '90min',
 '90minute',
 '90s',
 '90yr',
 '91',
 '911',
 '91st',
 '92',
 '92nd',
 '93',
 '94',
 '945am',
 '9495',
 '95',
 '96',
 '96th',
 '97',
 '977',
 '98',
 '99',
 '9999',
 '99cent',
 '99shillings',
 '99th',
 '9am',
 '9ers',
 '9ga',
 '9ish',
 '9oz',
 '9p',
 '9pm',
 '9th',
 '9year',
 '9yo',
 '_4xhxtuykqnyphmylm',
 '______',
 '_______',
 '_______________',
 '____berto',
 '_accommodating',
 '_affordable',
 '_c',
 '_finally_',
 '_gyib8ea4hdfylss17zc_g',
 '_l7o1zhq9edno1lhv9b10g',
 '_reasonable',
 '_she',
 '_third_',
 '_us_',
 '_very',
 'a1',
 'a2',
 'aa',
 'aaa',
 'aaaaaalright',
 'aaaamazing',
 'aaammmazzing',
 'aaand',
 'aah',
 'aand',
 'aaron',
 'aarp',
 'ab',
 'aback',
 'abacus',
 'abandon',
 'abandoned',
 'abandoning',
 'abba',
 'abbaye',
 'abbey',
 'abbreviate',
 'abbreviated',
 'abbreviations',
 'abby',
 'abc',
 'abdomen',
 'abe',
 'aberration',
 'abhor',
 'abides',
 'abiding',
 'abilities',
 'ability',
 'abilty',
 'abita',
 'able',
 'abnormally',
 'abode',
 'abodoba',
 'abogado',
 'abou',
 'abound',
 'abrasion',
 'abrasive',
 'abreast',
 'abridged',
 'abroad',
 'abrupt',
 'abruptly',
 'abs',
 'absence',
 'absense',
 'absent',
 'absinthe',
 'abslutely',
 'absoloutely',
 'absolut',
 'absolute',
 'absolutely',
 'absolutley',
 'absolutly',
 'absorb',
 'absorbed',
 'absorption',
 'abstain',
 'abstained',
 'abstract',
 'absurd',
 'absurdly',
 'absynthe',
 'abt',
 'abuelita',
 'abuelo',
 'abuelos',
 'abundance',
 'abundant',
 'abundantly',
 'abuse',
 'abused',
 'abusive',
 'abysmal',
 'ac',
 'acacia',
 'academie',
 'academy',
 'acadia',
 'acai',
 'acapulco',
 'accelerometer',
 'accent',
 'accented',
 'accents',
 'accept',
 'acceptable',
 'accepted',
 'accepting',
 'accepts',
 'accesible',
 'accesories',
 'access',
 'accessbile',
 'accessed',
 'accessibility',
 'accessible',
 'accessibly',
 'accessories',
 'accessorize',
 'accessory',
 'accident',
 'accidental',
 'accidentally',
 'accidentily',
 'accidently',
 'accidents',
 'acclaimed',
 'acclimated',
 'acclimating',
 'accolades',
 'accomidating',
 'accommodate',
 'accommodated',
 'accommodates',
 'accommodating',
 'accommodation',
 'accommodations',
 'accomodate',
 'accomodated',
 'accomodates',
 'accomodating',
 'accomodation',
 'accomodations',
 'accompanied',
 'accompanies',
 'accompaniment',
 'accompaniments',
 'accompany',
 'accompanying',
 'accomplice',
 'accomplish',
 'accomplished',
 'accomplishment',
 'accomplishments',
 'according',
 'accordingly',
 'account',
 'accountability',
 'accountable',
 'accountant',
 'accounting',
 'accounts',
 'accoutrement',
 'accoutrements',
 'accredited',
 'accross',
 'accumulate',
 'accumulated',
 'accuracy',
 'accurate',
 'accurately',
 'accusation',
 'accused',
 'accustom',
 'accustomed',
 'accutemp',
 'ace',
 'aces',
 'acess',
 'acetone',
 'ache',
 'aches',
 'achieve',
 'achieved',
 'achievement',
 'achieves',
 'achieving',
 'aching',
 'acid',
 'acidic',
 'acknowledge',
 'acknowledged',
 'acknowledgement',
 'acknowledging',
 'acknowledgment',
 'ackward',
 'acme',
 'acne',
 'acommodations',
 'acoustic',
 'acoustician',
 'acoustics',
 'acquaintance',
 'acquainted',
 'acquaintence',
 'acquire',
 'acquired',
 'acquisition',
 'acres',
 'acrid',
 'acrimonious',
 'acrylic',
 'acrylics',
 'act',
 'actaully',
 'acted',
 'acting',
 'action',
 'actions',
 'activation',
 'active',
 'actively',
 'activism',
 'activities',
 'activity',
 'actor',
 'actors',
 'actress',
 'acts',
 'actual',
 'actuality',
 'actually',
 'actualy',
 'actuators',
 'actully',
 'acuity',
 'acupuncturist',
 'acute',
 'acxupuncturist',
 'acy',
 'ad',
 'ada',
 'adage',
 'adam',
 'adamant',
 'adams',
 'adapted',
 'adapter',
 'adaptive',
 'adaquate',
 'adaquet',
 'add',
 'addage',
 'addd',
 'added',
 'addendum',
 'addict',
 'addicted',
 'addicting',
 'addictingly',
 'addiction',
 'addictionovercome',
 'addictions',
 'addictive',
 'addicts',
 'addidas',
 'adding',
 'addition',
 'additional',
 'additionally',
 'additions',
 'additive',
 'additives',
 'address',
 'addressed',
 'addresses',
 'addressing',
 'adds',
 'addtl',
 'ade',
 'adelman',
 'adequate',
 'adequately',
 'adhere',
 'adhered',
 'adherence',
 'adhesive',
 'adidas',
 'adios',
 'adjacent',
 'adjective',
 'adjectives',
 'adjoining',
 'adjunct',
 'adjust',
 'adjusted',
 'adjusting',
 'adjustment',
 'adjustments',
 'administration',
 'administrative',
 'administrators',
 'admirable',
 'admire',
 'admired',
 'admiring',
 'admission',
 'admissions',
 'admit',
 'admits',
 'admitted',
 'admittedly',
 'admitting',
 'admonishment',
 'ado',
 'adobada',
 'adobe',
 'adobo',
 'adolescence',
 'adolescent',
 'adopt',
 'adopted',
 'adopting',
 'adoption',
 'adoptions',
 'adorable',
 'adorably',
 'adorama',
 'adoration',
 'adore',
 'adored',
 'adores',
 'adorn',
 'adorned',
 'adorning',
 'adovada',
 'adquate',
 'adrenaline',
 'adria',
 'adrian',
 'adriana',
 'adrianne',
 'adriatica',
 'adrienne',
 'ads',
 'adult',
 'adulthood',
 'adults',
 'adults_night_out',
 'advance',
 'advanced',
 'advancing',
 'advantage',
 'advantages',
 'advent',
 'adventure',
 'adventurer',
 'adventures',
 'adventuresome',
 'adventurous',
 'adventurousness',
 'adverse',
 'adversity',
 'advertise',
 'advertised',
 'advertisement',
 'advertisements',
 'advertises',
 'advertising',
 'advertisments',
 'advertized',
 'adverts',
 'advice',
 'advise',
 'advised',
 'adviser',
 'advising',
 'advisor',
 'advisors',
 'advocate',
 'advocated',
 'ae',
 'aea',
 'aeg',
 'aegean',
 'aerators',
 'aerobic',
 'aerobics',
 'aeropostale',
 'aeropress',
 'aerosol',
 'aesthetic',
 'aesthetically',
 'aesthetician',
 'aestheticians',
 'aesthetics',
 'afar',
 'affair',
 'affect',
 'affected',
 'affectionately',
 'affects',
 'afficianados',
 ...]

In [137]:

pd.DataFrame(dtm.todense(), columns=features)

Out[137]:

In [139]:

def summarize():
    
    # Choose a random review that is at least 300 characters.
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = yelp.text[review_id]
        #review_text = unicode(yelp.text[review_id], 'utf-8')
        review_length = len(review_text)
    
    # Create a dictionary of words and their TF–IDF scores.
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]
    
    # Print words with the top five TF–IDF scores.
    print('TOP SCORING WORDS:')
    top_scores = sorted(list(word_scores.items()), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print(word)
    
    # Print five random words.
    print(('\n' + 'RANDOM WORDS:'))
    random_words = np.random.choice(list(word_scores.keys()), size=5, replace=False)
    for word in random_words:
        print(word)
    
    # Print the review.
    print(('\n' + review_text))

In [140]:

summarize()

Out[140]:

TOP SCORING WORDS:
blanca
frequenting
shell
basket
delicious

RANDOM WORDS:
street
way
sauces
mexican
carne

Wow, love that a place like this moved in down the street. 
Inexpensive, fresh, delicious - everything I want in a mexican restaurant.
My soft shell carne asada taco was fantastic -- freshly prepared and topped with guacamole, which by the way is super creamy. Basket of chips are free, as is the toppings bar -- every one of the 6 or sauces were delicious. 
I plan on frequenting Salsa Blanca ; )

Sentiment Analysis

Understanding how positive or negative a review is. There are many ways in practice to compute a sentiment value. For example:

Have a list of "positive" words and a list of "negative" words and count how many occur in a document.
Train a classifier given many examples of "positive" documents and "negative" documents.
- Note that this technique is often just an automated way to derive the first (e.g., using bag-of-words with logistic regression, a coefficient is assigned to each word!).

For the most accurate sentiment analysis, you will want to train a custom sentiment model based on documents that are particular to your application. Generic models (such as the one we are about to use!) often do not work as well as hoped.

As we will do below, always make sure you double-check that the algorithm is working by manually verifying that scores correctly correspond to positive/negative reviews! Otherwise, you may be using numbers that are not accurate.

In [112]:

print(review)

Out[112]:

My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.

Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.

While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I've ever had.

Anyway, I can't wait to go back!

In [113]:

# Polarity ranges from -1 (most negative) to 1 (most positive).
review.sentiment.polarity

Out[113]:

0.40246913580246907

In [114]:

# Understanding the apply method
yelp['length'] = yelp.text.apply(len)
yelp.head(1)

Out[114]:

In [115]:

# Define a function that accepts text and returns the polarity.
def detect_sentiment(text):
    return TextBlob(text).sentiment.polarity
#return TextBlob(text).sentiment.polarity

In [116]:

# Create a new DataFrame column for sentiment (Warning: SLOW!).
yelp['sentiment'] = yelp.text.apply(detect_sentiment)

In [117]:

# Box plot of sentiment grouped by stars
yelp.boxplot(column='sentiment', by='stars')

Out[117]:

<matplotlib.axes._subplots.AxesSubplot at 0x10c49f978>

In [118]:

# Reviews with most positive sentiment
yelp[yelp.sentiment == 1].text.head()

Out[118]:

  Our server Gary was awesome. Food was amazing....
  3 syllables for this place. \nA-MAZ-ING!\n\nTh...
                                  LOVE the food!!!!
  Love it!!! Wish we still lived in Arizona as C...
                                   Excellent burger
Name: text, dtype: object

In [119]:

# Reviews with most negative sentiment
yelp[yelp.sentiment == -1].text.head()

Out[119]:

   This was absolutely horrible. I got the suprem...
                Nasty workers and over priced trash
  Absolutely awful... these guys have NO idea wh...
                                     Very bad food!
      I wouldn't send my worst enemy to this place.
Name: text, dtype: object

In [120]:

# Widen the column display.
pd.set_option('max_colwidth', 500)

In [121]:

# Negative sentiment in a 5-star review
yelp[(yelp.stars == 5) & (yelp.sentiment < -0.3)].head(1)

Out[121]:

In [122]:

# Positive sentiment in a 1-star review
yelp[(yelp.stars == 1) & (yelp.sentiment > 0.5)].head(1)

Out[122]:

In [123]:

# Reset the column display width.
pd.reset_option('max_colwidth')

Bonus: Adding Features to a Document-Term Matrix

Here, we will add additional features to our CountVectorizer()-generated feature set to hopefully improve our model.

To make the best models, you will want to supplement the auto-generated features with new features you think might be important. After all, CountVectorizer() typically lowercases text and removes all associations between words. Or, you may have metadata to add in addition to just the text.

Remember: Although you may have hundreds of thousands of features, each data point is extremely sparse. So, if you add in a new feature, e.g., one that detects if the text is all capital letters, this new feature can still have a huge effect on the model outcome!

In [124]:

# Create a DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# define X and y
feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny']
X = yelp_best_worst[feature_cols]
y = yelp_best_worst.stars

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [125]:

# Use CountVectorizer with text column only.
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train.text)
X_test_dtm = vect.transform(X_test.text)
print((X_train_dtm.shape))
print((X_test_dtm.shape))

Out[125]:

(3064, 16825)
(1022, 16825)

In [126]:

# Shape of other four feature columns
X_train.drop('text', axis=1).shape

Out[126]:

(3064, 4)

In [127]:

# Cast other feature columns to float and convert to a sparse matrix.
extra = sp.sparse.csr_matrix(X_train.drop('text', axis=1).astype(float))
extra.shape

Out[127]:

(3064, 4)

In [128]:

# Combine sparse matrices.
X_train_dtm_extra = sp.sparse.hstack((X_train_dtm, extra))
X_train_dtm_extra.shape

Out[128]:

(3064, 16829)

In [129]:

# Repeat for testing set.
extra = sp.sparse.csr_matrix(X_test.drop('text', axis=1).astype(float))
X_test_dtm_extra = sp.sparse.hstack((X_test_dtm, extra))
X_test_dtm_extra.shape

Out[129]:

(1022, 16829)

In [130]:

# Use logistic regression with text column only.
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print((metrics.accuracy_score(y_test, y_pred_class)))

Out[130]:

0.9178082191780822

In [131]:

# Use logistic regression with all features.
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm_extra, y_train)
y_pred_class = logreg.predict(X_test_dtm_extra)
print((metrics.accuracy_score(y_test, y_pred_class)))

Out[131]:

0.9227005870841487

Bonus: Fun TextBlob Features

In [132]:

# Spelling correction
TextBlob('15 minuets late').correct()

Out[132]:

TextBlob("15 minutes late")

In [133]:

# Spellcheck
Word('parot').spellcheck()

Out[133]:

[('part', 0.9929478138222849), ('parrot', 0.007052186177715092)]

In [134]:

# Definitions
Word('bank').define('v')

Out[134]:

['tip laterally',
 'enclose with a bank',
 'do business with a bank or keep an account at a bank',
 'act as the banker in a game or in gambling',
 'be in the banking business',
 'put into a bank account',
 'cover with ashes so to control the rate of burning',
 'have confidence or faith in']

In [135]:

# Language identification
TextBlob('Hola amigos').detect_language()

Out[135]:

'es'

Appendix: Intro to Naive Bayes and Text Classification

Later in the course, we will explore in-depth how to use the Naive Bayes classifier with text. Naive Bayes is a very popular classifier because it has minimal storage requirements, is fast, can be tuned easily with more data, and has found very useful applications in text classificaton. For example, Paul Graham originally proposed using Naive Bayes to detect spam in his Plan for Spam.

Earlier we experimented with text classification using a Naive Bayes model. What exactly are Naive Bayes classifiers?

What is Bayes? Bayes, or Bayes' Theorem, is a different way to assess probability. It considers prior information in order to more accurately assess the situation.

Example: You are playing roulette.

As you approach the table, you see that the last number the ball landed on was Red-3. With a frequentist mindset, you know that the ball is just as likely to land on Red-3 again given that every slot on the wheel has an equal opportunity of 1 in 37.

Given that you started believing that the ball can land in each slot with an equal likelihood and that you have only seen one throw previously, you rationally believe that there would be no difference between picking Red a second time now or picking Black -- ideally they would happen with the same likelihood!

However, as you sit and watch the roulette table, you begin to notice something strange. The ball is always landing on red. Every single time the ball is thrown, it lands in a red slot. Even though your past beliefs stated that red and black were equally likely, every time it lands in red, you change those beliefs a little more towards a biased roulette table.

This is what Bayes is all about — adjusting probabilities as more data is gathered!

Below is the equation for Bayes.

P(A \ | \ B) = \frac {P(B \ | \ A) \times P(A)} {P(B)}

$P(A \ | \ B)$ : Probability of Event A occurring given Event B has occurred.
$P(B \ | \ A)$ : Probability of Event B occurring given Event A has occurred.
$P(A)$ : Probability of Event A occurring.
$P(B)$ : Probability of Event B occurring.

Applying Naive Bayes Classification to Spam Filtering

Let's pretend we have an email with three words: "Send money now." We'll use Naive Bayes to classify it as ham or spam. ("Ham" just means not spam. It can include emails that look like spam but that you opt into!)

P(spam \ | \ \text{send money now}) = \frac {P(\text{send money now} \ | \ spam) \times P(spam)} {P(\text{send money now})}

By assuming that the features (the words) are conditionally independent, we can simplify the likelihood function:

P(spam \ | \ \text{send money now}) \approx \frac {P(\text{send} \ | \ spam) \times P(\text{money} \ | \ spam) \times P(\text{now} \ | \ spam) \times P(spam)} {P(\text{send money now})}

Note that each conditional probability in the numerator is easily calculated directly from the training data!

So, we can calculate all of the values in the numerator by examining a corpus of spam email:

P(spam \ | \ \text{send money now}) \approx \frac {0.2 \times 0.1 \times 0.1 \times 0.9} {P(\text{send money now})} = \frac {0.0018} {P(\text{send money now})}

We would repeat this process with a corpus of ham email:

P(ham \ | \ \text{send money now}) \approx \frac {0.05 \times 0.01 \times 0.1 \times 0.1} {P(\text{send money now})} = \frac {0.000005} {P(\text{send money now})}

All we care about is whether spam or ham has the higher probability, and so we predict that the email is spam.

Key Takeaways

The "naive" assumption of Naive Bayes (that the features are conditionally independent) is critical to making these calculations simple.
The normalization constant (the denominator) can be ignored since it's the same for all classes.
The prior probability is much less relevant once you have a lot of features.

Comparing Naive Bayes With Other Models

Advantages of Naive Bayes:

Model training and prediction are very fast.
It's somewhat interpretable.
No tuning is required.
Features don't need scaling.
It's insensitive to irrelevant features (with enough observations).
It performs better than logistic regression when the training set is very small.

Disadvantages of Naive Bayes:

If "spam" is dependent on non-independent combinations of individual words, it may not work well.
Predicted probabilities are not well calibrated.
Correlated features can be problematic (due to the independence assumption).
It can't handle negative features (with Multinomial Naive Bayes).
It has a higher "asymptotic error" than logistic regression.

Conclusion

NLP is a gigantic field.
Understanding the basics broadens the types of data you can work with.
Simple techniques go a long way.
Use scikit-learn for NLP whenever possible.

While we used SKLearn and TextBlob today, another popular python NLP library is Spacy.

Natural Language Processing

Learning Objectives

How Do We Use NLP in Data Science?

Install TextBlob

Lesson Guide

Introduction

Introduction

What Is Natural Language Processing (NLP)?

What Are Some of the Higher-Level Task Areas?

What Are Some of the Lower-Level Components?

Why is NLP hard?

Reading in the Yelp Reviews

Introduction: Text Classification

Demo: Text Processing in scikit-learn

Creating Features Using CountVectorizer

Using CountVectorizer in a Model

N-Grams

Stop-Word Removal

Other CountVectorizer Options

Introduction to TextBlob

search for command in conda

Stemming and Lemmatization

Activity: Knowledge Check

Term Frequency–Inverse Document Frequency (TF–IDF)

Using TF–IDF to Summarize a Yelp Review

Sentiment Analysis

Bonus: Adding Features to a Document-Term Matrix

Bonus: Fun TextBlob Features

Appendix: Intro to Naive Bayes and Text Classification

Applying Naive Bayes Classification to Spam Filtering

Key Takeaways

Comparing Naive Bayes With Other Models

Conclusion

Product

Resources

Company