Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_13/natural-language-processing NLP (done).ipynb
1904 views
Kernel: Python 3

Natural Language Processing

Authors: Kiefer Katovich (San Francisco), Joseph Nelson (Washington, D.C.)


Learning Objectives

  • Discuss the major tasks involved with natural language processing.

  • Discuss, on a low level, the components of natural language processing.

  • Identify why natural language processing is difficult.

  • Demonstrate text classification.

  • Demonstrate common text preprocessing techniques.

How Do We Use NLP in Data Science?

In data science, we are often asked to analyze unstructured text or make a predictive model using it. Unfortunately, most data science techniques require numeric data. NLP libraries provide a tool set of methods to convert unstructured text into meaningful numeric data.

  • Analysis: NLP techniques provide tools to allow us to understand and analyze large amounts of text. For example:

    • Analyze the positivity/negativity of comments on different websites.

    • Extract key words from meeting notes and visualize how meeting topics change over time.

  • Vectorizing for machine learning: When building a machine learning model, we typically must transform our data into numeric features. This process of transforming non-numeric data such as natural language into numeric features is called vectorization. For example:

    • Understanding related words. Using stemming, NLP lets us know that "swim", "swims", and "swimming" all refer to the same base word. This allows us to reduce the number of features used in our model.

    • Identifying important and unique words. Using TF-IDF (term frequency-inverse document frequency), we can identify which words are most likely to be meaningful in a document.

Install TextBlob

The TextBlob Python library provides a simplified interface for exploring common NLP tasks including part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

To proceed with the lesson, first install TextBlob, as explained below. We tend to prefer Anaconda-based installations, since they tend to be tested with our other Anaconda packages.

To install textblob run:

conda install -c conda-forge textblob

Or:

pip install textblob

python -m textblob.download_corpora lite

Introduction

Adapted from NLP Crash Course by Charlie Greenbacker and Introduction to NLP by Dan Jurafsky

Introduction

What Is Natural Language Processing (NLP)?

  • Using computers to process (analyze, understand, generate) natural human languages.

  • Making sense of human knowledge stored as unstructured text.

  • Building probabilistic models using data about a language.

What Are Some of the Higher-Level Task Areas?

  • Objective: Discuss the major tasks involved with natural language processing.

We often hope that computers can solve many high-level problems involving natural language. Unfortunately, due to the difficulty of understanding human language, many of these problems are still not well solved. That said, existing solutions to these problems all involve utilizing the lower-level components of NLP discussed in the next section. Some higher-level tasks include:

What Are Some of the Lower-Level Components?

  • Objective: Discuss, on a low level, the components of natural language processing.

Unfortunately, the NLP programming libraries typically do not provide direct solutions for the high-level tasks above. Instead, they provide low-level building blocks that enable us to craft our own solutions. These include:

  • Tokenization: Breaking text into tokens (words, sentences, n-grams)

  • Stop-word removal: a/an/the

  • Stemming and lemmatization: root word

  • TF-IDF: word importance

  • Part-of-speech tagging: noun/verb/adjective

  • Named entity recognition: person/organization/location

  • Spelling correction: "New Yrok City"

  • Word sense disambiguation: "buy a mouse"

  • Segmentation: "New York City subway"

  • Language detection: "translate this page"

  • Machine learning: specialized models that work well with text

Why is NLP hard?

  • Objective: Identify why natural language processing is difficult.

Natural language processing requires an understanding of the language and the world. Several limitations of NLP are:

  • Ambiguity:

    • Hospitals Are Sued by 7 Foot Doctors

    • Juvenile Court to Try Shooting Defendant

    • Local High School Dropouts Cut in Half

  • Non-standard English: text messages

  • Idioms: "throw in the towel"

  • Newly coined words: "retweet"

  • Tricky entity names: "Where is A Bug's Life playing?"

  • World knowledge: "Mary and Sue are sisters", "Mary and Sue are mothers"

Reading in the Yelp Reviews

Throughout this lesson, we will use Yelp reviews to practice and discover common low-level NLP techniques.

You should be familiar with these terms, as they are frequently used in NLP:

  • corpus: a collection of documents (derived from the Latin word for "body")

  • corpora: plural form of corpus

Throughout this lesson, we will use a model very popular for text classification called Naive Bayes (the "NB" in BinonmialNB and MultinomialNB below). If you are unfamiliar with it, know that it works exactly the same as all other models in scikit-learn! We will look extensively at the mechanics behind Naive Bayes later in the course. However, see the appendix at the end of this notebook for a quick introduction.

!conda install -y -c conda-forge textblob
Solving environment: done ==> WARNING: A newer version of conda exists. <== current version: 4.4.10 latest version: 4.5.4 Please update conda by running $ conda update -n base conda ## Package Plan ## environment location: /anaconda3 added / updated specs: - textblob The following packages will be downloaded: package | build ---------------------------|----------------- textblob-0.15.1 | py_0 597 KB conda-forge certifi-2018.1.18 | py36_0 143 KB conda-forge ------------------------------------------------------------ Total: 741 KB The following NEW packages will be INSTALLED: textblob: 0.15.1-py_0 conda-forge The following packages will be UPDATED: certifi: 2018.1.18-py36_0 --> 2018.1.18-py36_0 conda-forge Downloading and Extracting Packages textblob 0.15.1: ####################################################### | 100% certifi 2018.1.18: ##################################################### | 100% Preparing transaction: done Verifying transaction: done Executing transaction: done
import pandas as pd import numpy as np import scipy as sp from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.naive_bayes import MultinomialNB # Naive Bayes from sklearn.linear_model import LogisticRegression from sklearn import metrics from textblob import TextBlob, Word from nltk.stem.snowball import SnowballStemmer %matplotlib inline
# Read yelp.csv into a DataFrame. path = r'./data/yelp.csv' yelp = pd.read_csv(path) # Create a new DataFrame that only contains the 5-star and 1-star reviews. yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)] # Define X and y. X = yelp_best_worst.text y = yelp_best_worst.stars # Split the new DataFrame into training and testing sets. X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
pd.Series([True, True, False, False]) | pd.Series([True, False, True, False]) # | takes the first item in the first list and compares it with the second list wit an or condition
0 True 1 True 2 True 3 False dtype: bool
pd.Series([True, True, False, False]) & pd.Series([True, False, True, False]) # & takes the first item in the first list and compares it with the second list with an and condition
0 True 1 False 2 False 3 False dtype: bool
# The head of the original data yelp.head()

Introduction: Text Classification

As you proceed through this section, note that text classification is done in the same way as all other classification models. First, the text is vectorized into a set of numeric features. Then, a standard machine learning classifier is applied. NLP libraries often include vectorizers and ML models that work particularly well with text.

We will refer to each piece of text we are trying to classify as a document.

  • For example, a document could refer to an email, book chapter, tweet, article, or text message.

Text classification is the task of predicting which category or topic a text sample is from.

We may want to identify:

  • Is an article a sports or business story?

  • Does an email have positive or negative sentiment?

  • Is the rating of a recipe 1, 2, 3, 4, or 5 stars?

Predictions are often made by using the words as features and the label as the target output.

Starting out, we will make each unique word (across all documents) a single feature. In any given corpora, we may have hundreds of thousands of unique words, so we may have hundreds of thousands of features!

  • For a given document, the numeric value of each feature could be the number of times the word appears in the document.

    • So, most features will have a value of zero, resulting in a sparse matrix of features.

  • This technique for vectorizing text is referred to as a bag-of-words model.

    • It is called bag of words because the document's structure is lost — as if the words are all jumbled up in a bag.

    • The first step to creating a bag-of-words model is to create a vocabulary of all possible words in the corpora.

Alternatively, we could make each column an indicator column, which is 1 if the word is present in the document (no matter how many times) and 0 if not. This vectorization could be used to reduce the importance of repeated words. For example, a website search engine would be susceptible to spammers who load websites with repeated words. So, the search engine might use indicator columns as features rather than word counts.

We need to consider several things to decide if bag-of-words is appropriate.

  • Does order of words matter?

  • Does punctuation matter?

  • Does upper or lower case matter?

Demo: Text Processing in scikit-learn

  • Objective: Demonstrate text classification.

Creating Features Using CountVectorizer

  • What: Converts each document into a set of words and their counts.

  • Why: To use a machine learning model, we must convert unstructured text into numeric features.

  • Notes: Relatively easy with English language text, not as easy with some languages.

# Use CountVectorizer to create document-term matrices from X_train and X_test. vect = CountVectorizer() X_train_dtm = vect.fit_transform(X_train) #fit and transform in one line on the training data X_test_dtm = vect.transform(X_test) #only transform on test
vect.transform(X_train)
<3064x16825 sparse matrix of type '<class 'numpy.int64'>' with 237720 stored elements in Compressed Sparse Row format>
# 3064 rows # 16825 words
237720 / 3064 / 16825 # % of values in matric that are not zero
0.004611284184063408
vect.transform(X_train).todense()
matrix([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]])
# Rows are documents, columns are terms (aka "tokens" or "features", individual words in this situation). X_train_dtm.shape
(3064, 16825)
# Last 50 features print((vect.get_feature_names()[-50:]))
['yyyyy', 'z11', 'za', 'zabba', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zero', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zihuatenejo', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zippers', 'zipps', 'ziti', 'zoe', 'zombi', 'zombies', 'zone', 'zones', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel', 'zzed', 'éclairs', 'école', 'ém']
pd.DataFrame(X_train_dtm.todense(), columns=vect.get_feature_names())
# Show vectorizer options. vect
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)

One common method of reducing the number of features is converting all text to lowercase before generating features! Note that to a computer, aPPle is a different token/"word" than apple. So, by converting both to lowercase letters, it ensures fewer features will be generated. It might be useful not to convert them to lowercase if capitalization matters.

# Don't convert to lowercase. vect = CountVectorizer(lowercase=False) X_train_dtm = vect.fit_transform(X_train) print(X_train_dtm.shape) vect.get_feature_names()[-10:]
(3064, 20838)
['zoning', 'zoo', 'zucchini', 'zuchinni', 'zupa', 'zwiebel', 'zzed', 'École', 'éclairs', 'ém']
X_train.head()
6841 FILLY-B's!!!!! only 8 reviews?? NINE now!!!\n... 1728 My husband and I absolutely LOVE this restaura... 3853 We went today after lunch. I got my usual of l... 671 Totally dissapointed. I had purchased a coupo... 4920 Costco Travel - My husband and I recently retu... Name: text, dtype: object
X_train.loc[0]
'My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I\'ve ever had. I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'
X_train_dtm.todense()[:5,:] # equivilent to .head forscipy matrix
matrix([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=int64)
vect.vocabulary_
{'FILLY': 2376, 'only': 15338, 'reviews': 17176, 'NINE': 4360, 'now': 15195, 'wow': 20695, 'do': 10716, 'miss': 14737, 'THIS': 6193, 'place': 16020, '24hrs': 138, 'drive': 10863, 'thru': 19445, 'or': 15383, 'walk': 20329, 'up': 20039, 'ridiculously': 17210, 'cheap': 9111, 'tasty': 19188, 'of': 15276, 'course': 9900, 'the': 19346, 'arizona': 7682, 'burritos': 8704, 'are': 7671, 'good': 12370, 'everything': 11312, 'is': 13548, 'used': 20085, 'to': 19518, 'LOVE': 3677, 'one': 15331, 'combos': 9488, 'you': 20778, 'get': 12263, 'beef': 8122, 'burrito': 8703, 'taco': 19096, 'rice': 17193, 'and': 7507, 'beans': 8092, 'for': 11932, 'UNDER': 6576, 'color': 9469, 'me': 14518, 'silly': 17978, 'call': 8799, 'sally': 17425, 'they': 19378, 'have': 12743, 'bomb': 8394, 'horchata': 12998, 'too': 19563, 'really': 16775, 'fresh': 12035, 'flautas': 11801, 'rolled': 17282, 'tacos': 19097, 'breakfast': 8532, 'damn': 10175, 'here': 12846, 'whether': 20503, 'drunk': 10889, 'sober': 18233, 'My': 4335, 'husband': 13116, 'absolutely': 7127, 'this': 19403, 'restaurant': 17123, 'Anytime': 573, 'find': 11717, 'myself': 14968, 'craving': 9967, 'Mexican': 4159, 'food': 11912, 'first': 11753, 'that': 19341, 'pops': 16166, 'in': 13241, 'my': 14966, 'head': 12756, 'Salsa': 5596, 'Blanca': 970, 'We': 6858, 'always': 7455, 'encountered': 11141, 'friendly': 12051, 'welcoming': 20469, 'staff': 18543, 'amazing': 7467, 'fulfilling': 12115, 'What': 6891, 'more': 14860, 'could': 9878, 'ask': 7744, 'went': 20474, 'today': 19526, 'after': 7318, 'lunch': 14266, 'got': 12395, 'usual': 20091, 'lime': 14067, 'basil': 8053, 'real': 16766, 'mint': 14706, 'chip': 9202, 'which': 20504, 'love': 14231, 'leaves': 13951, 'hubby': 13063, 'chocolate': 9217, 'guiness': 12580, 'four': 11988, 'peaks': 15761, 'hop': 12987, 'knot': 13783, 'Best': 920, 'ice': 13143, 'cream': 9976, 'Phoenix': 4917, 'The': 6356, 'super': 18942, 'nice': 15089, 'They': 6371, 'give': 12298, 'us': 20082, 'bags': 7951, 'take': 19115, 'our': 15441, 'home': 12949, 'Love': 3849, 'Totally': 6460, 'dissapointed': 10679, 'had': 12620, 'purchased': 16568, 'coupon': 9897, 'from': 12070, 'TravelZoo': 6494, 'try': 19777, 'out': 15444, 'given': 12300, 'its': 13566, 'location': 14160, 'would': 20689, 'thought': 19414, 'it': 13560, 'was': 20383, 'going': 12357, 'be': 8087, 'very': 20200, 'upscale': 20065, 'we': 20427, 'were': 20475, 'expecting': 11403, 'WHOLE': 6781, 'LOT': 3674, 'MORE': 3930, 'Service': 5725, 'great': 12476, 'but': 8727, 'not': 15167, 'worth': 20685, 'itself': 13567, 'outdated': 15448, '80': 294, 'crab': 9933, 'cakes': 8790, 'appertizer': 7611, 'cold': 9450, 'when': 20498, 'served': 17742, 'told': 19539, 'waiter': 20317, 'who': 20531, 'turn': 19800, 'chef': 9148, 'message': 14612, 'passed': 15687, 'back': 7920, 'she': 17816, 'appologized': 7632, 'Husband': 3235, 'ordred': 15403, '12oz': 54, 'New': 4446, 'York': 7050, 'Strip': 6070, 'ala': 7388, 'carte': 8937, '29': 155, '95': 319, 'oz': 15543, 'pure': 16572, 'fat': 11572, 'price': 16352, 'expect': 11398, 'meat': 14534, 'than': 19332, 'ordered': 15397, 'wild': 20562, 'mushroom': 14947, 'pizza': 16012, 'OK': 4541, 'This': 6382, 'needs': 15037, 'an': 7499, 'over': 15477, 'haul': 12737, 'major': 14332, 'way': 20424, 'if': 13171, 'want': 20353, 'make': 14335, 'any': 7562, 'money': 14826, 'at': 7798, 'Saturday': 5634, 'night': 15098, '7pm': 292, 'empty': 11125, 'think': 19393, 'highlight': 12876, 'meal': 14520, 'bottle': 8458, 'cabinet': 8772, 'Costco': 1695, 'Travel': 6493, 'recently': 16805, 'returned': 17156, 'trip': 19746, 'Big': 945, 'Island': 3356, 'HI': 2959, 'arranged': 7696, 'throught': 19439, 'After': 445, 'shopping': 17884, 'around': 7692, 'found': 11982, 'their': 19354, 'prices': 16355, 'best': 8201, 'phone': 15908, 'able': 7116, 'all': 7406, 'arrangements': 7698, 'airfair': 7370, 'condo': 9644, 'car': 8883, 'didn': 10520, 'with': 20617, 'Bs': 1135, 'on': 15329, 'Hilo': 3126, 'adjust': 7248, 'travel': 19693, 'dates': 10210, 'Everything': 2322, 'according': 7170, 'plan': 16029, 'being': 8155, 'accurate': 7177, 'Condo': 1645, 'outstanding': 15472, 'value': 20126, 'gas': 12203, 'did': 10518, 'some': 18277, 'souvenier': 18349, 'Kona': 3615, 'again': 7327, 'saved': 17515, 've': 20148, 'shopped': 17881, 'years': 20745, 'becoming': 8113, 'groupie': 12538, 'Not': 4508, 'And': 540, 'been': 8124, 'couple': 9894, 'There': 6367, 'advantages': 7280, 'rarely': 16714, 'busy': 8726, 'larger': 13869, 'groups': 12542, 'But': 1181, 'last': 13879, 'few': 11657, 'times': 19489, 'there': 19370, 'experienced': 11411, 'average': 7878, 'lousy': 14229, 'service': 17747, 'When': 6893, 'joining': 13638, 'already': 7440, 'seated': 17653, 'group': 12537, '28': 154, 'high': 12871, 'school': 17572, 'hostess': 13020, 'wagged': 20311, 'her': 12840, 'finger': 11727, 'face': 11487, 'like': 14056, 'East': 2197, 'German': 2763, 'border': 8437, 'guard': 12564, 'waffle': 20307, 'house': 13040, 'waitress': 20320, 'poked': 16119, 'shoulder': 17899, 'attention': 7831, 'wife': 20552, 'chicken': 9168, 'dish': 10637, 'sent': 17721, 'as': 7731, 'measly': 14530, 'pieces': 15950, 'Only': 4615, 'beers': 8128, 'tap': 19150, 'Garbage': 2727, 'muffler': 14922, 'shop': 17877, 'town': 19630, 'VERY': 6656, 'shops': 17885, 'trust': 19771, 'Mighty': 4184, 'Muffler': 4319, 'them': 19357, 'Greg': 2872, 'has': 12723, 'worked': 20664, 'multiple': 14930, 'cars': 8935, 'done': 10757, 'job': 13629, 'If': 3280, 'ever': 11304, 'issues': 13558, 'he': 12755, 'takes': 19120, 'care': 8902, 'problem': 16397, 'no': 15121, 'questions': 16629, 'asked': 7745, 'just': 13684, 'start': 18591, 'off': 15277, 'by': 8754, 'saying': 17530, 'egg': 11042, 'salad': 17409, 'sandwiches': 17468, 'probably': 16395, 'tried': 19736, 'sandwich': 17466, 'anywhere': 7574, 'serves': 17746, 'Sacks': 5567, 'BY': 773, 'FAR': 2366, 'BEST': 729, 'entitled': 11208, 'Dali': 1892, 'life': 14036, 'live': 14126, 'North': 4502, 'will': 20569, 'literally': 14122, 'many': 14396, 'miles': 14670, 'eat': 10977, 'On': 4610, 'top': 19575, 'wonderful': 20642, 'menu': 14596, 'whenever': 20499, 'order': 15396, 'comes': 9492, 'little': 14124, 'cookie': 9794, 'cookies': 9795, 'can': 8830, 'purchase': 16567, 'dough': 10784, 'also': 7442, 'other': 15433, 'delicious': 10352, 'dessert': 10460, 'bars': 8038, 'salads': 17410, 'sale': 17415, 'well': 20470, 'eating': 10986, 'least': 13947, '14': 58, 'hope': 12988, 'NEVER': 4351, 'go': 12342, 'away': 7897, 'Saw': 5644, 'Triple': 6509, 'so': 18226, 'decided': 10271, 'expectations': 11401, 'arrived': 7704, 'shocked': 17863, 'Thursday': 6396, 'morning': 14864, 'line': 14080, 'extra': 11456, 'long': 14180, 'Had': 2995, 'wait': 20315, 'patiently': 15720, 'don': 10752, 'needless': 15036, 'say': 17528, 'because': 8108, 'quickly': 16637, 'pork': 16176, 'chop': 9237, 'eggs': 11049, 'hash': 12724, 'browns': 8610, 'fabulous': 11485, 'toast': 19519, 'made': 14300, 'grape': 12446, 'jelly': 13608, 'recommend': 16829, 'anyone': 7567, 'downtown': 10802, 'area': 7672, 'Make': 3999, 'sure': 18976, 'bacon': 7934, 'hit': 12915, 'Wine': 6932, 'Down': 2097, 'Wednesday': 6863, 'happenin': 12688, 'Tastings': 6287, 'courtesy': 9906, 'KYOT': 3515, 'Here': 3105, 'what': 20486, 'wrong': 20719, 'event': 11301, 'minutes': 14712, 'maybe': 14510, 'spot': 18479, 'case': 8945, 'hoppin': 12995, 'walked': 20331, 'looks': 14192, 'possibly': 16206, 'manager': 14367, 'owner': 15534, 'then': 19362, 'pointed': 16110, 'towards': 19624, 'room': 17298, 'where': 20500, 'held': 12818, 'about': 7119, 'tables': 19088, 'people': 15808, 'Talk': 6254, 'early': 10957, 'bird': 8255, 'gets': 12265, 'worm': 20675, 'finish': 11735, 'table': 19086, 'sit': 18019, 'patio': 15722, 'She': 5752, 'said': 17403, 'yes': 20764, 'right': 17215, 'help': 12824, 'In': 3301, 'meantime': 14529, 'another': 7539, 'woman': 20636, 'friend': 12047, 'took': 19565, 'nearby': 15022, '20': 109, 'later': 13887, 'looking': 14191, 'came': 8816, 'helped': 12825, 'yet': 20766, 'left': 13962, 'return': 17155, 'small': 18138, 'happy': 12696, 'hour': 13038, 'flier': 11820, 'handed': 12653, 'distance': 10686, 'even': 11298, 'come': 9489, 'know': 13786, 're': 16751, 'wondering': 20645, 'why': 20544, '15': 62, 'wanted': 20355, 'see': 17679, 'how': 13053, 'bad': 7936, 'frankly': 12006, 'wine': 20585, 'It': 3361, 'same': 17443, 'full': 12116, 'detailed': 10470, 'menus': 14598, 'deals': 10236, 'ready': 16765, 'Um': 6605, 'gave': 12219, '10': 16, 'seconds': 17662, 'ago': 7340, 'By': 1193, 'enjoy': 11174, 'look': 14189, 'app': 7592, 'specialty': 18402, 'appetizer': 7615, 'much': 14917, 'whatever': 20487, 'deal': 10230, 'Then': 6364, 'drink': 10855, 'section': 17669, 'claim': 9306, 'glasses': 12314, 'excited': 11353, 'new': 15072, 'wines': 20586, 'reality': 16768, 'consisted': 9706, 'three': 19426, 'different': 10534, 'weren': 20476, 'women': 20637, 'outside': 15470, 'leave': 13950, 'having': 12746, 'enough': 11183, 'shenanigans': 17835, '30': 164, 'since': 17994, 'next': 15084, 'door': 10766, 'Sprouts': 5993, 'apps': 7653, 'own': 15532, 'bet': 8204, 'plenty': 16082, 'experiences': 11412, 'continue': 9751, 'visit': 20255, 'happily': 12694, 'experience': 11410, 'caused': 8984, 'write': 20712, 'usually': 20092, 'instance': 13414, 'wish': 20612, 'actually': 7216, 'review': 17171, 'poor': 16156, 'substandard': 18853, 'giving': 12303, 'understand': 19909, 'isn': 13554, 'better': 8207, 'talented': 19123, 'others': 15434, 'However': 3215, 'business': 8715, 'should': 17898, 'customer': 10136, 'cliched': 9353, 'less': 13998, 'star': 18579, 'Oh': 4591, 'yeah': 20740, 'slapped': 18080, 'down': 10791, 'You': 7053, 'Yelped': 7024, 'card': 8894, 'Take': 6246, 'pepperoni': 15816, 'AMAZED': 362, 'herb': 12841, 'vegetable': 20155, 'garden': 12191, 'front': 12072, 'FRESH': 2395, 'ingredients': 13360, 'these': 19377, 'days': 10220, 'dishes': 10638, 'cheese': 9138, 'platters': 16057, 'mouth': 14899, 'watering': 20411, 'speak': 18385, 'day': 10217, 'll': 14137, 'keep': 13704, 'adding': 7235, 'STAY': 5543, 'TUNED': 6227, 'BTW': 761, 'decor': 10289, 'reminds': 16991, 'VIG': 6660, 'definitely': 10322, 'spacious': 18359, 'dining': 10563, 'sushi': 19002, 'time': 19483, 'garlic': 12195, 'knots': 13785, 'favorite': 11591, 'ALOT': 356, 'Panda': 4778, 'Garden': 2729, 'family': 11532, 'run': 17369, 'Chinese': 1496, 'Food': 2563, 'Restaurant': 5312, 'most': 14874, 'importantly': 13220, 'style': 18830, 'meals': 14521, 'personal': 15861, 'Empress': 2255, 'Chicken': 1478, 'sweeter': 19036, 'Orange': 4629, 'flavor': 11803, 'melts': 14574, 'your': 20782, 'Kung': 3638, 'Pao': 4784, 'Two': 6560, 'Shrimp': 5795, 'leftovers': 13964, 'thanks': 19339, 'generous': 12246, 'portions': 16187, 'selection': 17695, 'sake': 17407, 'beer': 8127, 'everyone': 11310, 'else': 11083, 'checks': 9129, 'Man': 4009, 'oh': 15298, 'Since': 5817, 'leaving': 13952, 'six': 18027, 'half': 12635, 'search': 17642, 'gotten': 12400, 'point': 16109, 'settled': 17759, 'decent': 10266, 'work': 20663, 'colleague': 9459, 'suggested': 18903, 'dine': 10556, 'corporate': 9840, 'meeting': 14560, 'While': 6899, 'realized': 16770, 'native': 15012, 'Metro': 4157, 'Area': 610, 'never': 15071, 'thing': 19388, 'trusted': 19772, 'his': 12910, 'palate': 15588, 'sense': 17715, 'pleased': 16076, 'saw': 17525, 'diners': 10559, 'predominately': 16272, 'Asian': 653, 'knew': 13773, 'market': 14425, 'things': 19390, 'signaled': 17963, 'thrilled': 19432, 'familiar': 11530, 'Dim': 2022, 'Sum': 6107, 'carts': 8941, 'rolling': 17284, 'through': 19437, 'party': 15682, 'shared': 17803, 'fantastic': 11547, 'sum': 18915, 'succulent': 18876, 'Our': 4660, 'orders': 15400, 'equally': 11234, 'veggies': 20160, 'brightly': 8574, 'colored': 9471, 'tastes': 19180, 'distinctive': 10692, 'Hot': 3203, 'Sour': 5932, 'Soup': 5930, 'perfect': 15826, 'filled': 11695, 'kinds': 13754, 'mushrooms': 14948, 'etc': 11278, 'delighted': 10358, 'reminisced': 16992, 'dinners': 10567, 'Chinatown': 1495, 'NY': 4386, 'comparable': 9543, 'close': 9376, 'regular': 16920, 'Chandler': 1427, 'clear': 9339, 'side': 17949, 'World': 6963, 'figure': 11683, 'Marco': 4028, 'Polo': 5008, 'Palace': 4764, 'REAL': 5169, 'Thing': 6374, 'Friendly': 2628, 'prompt': 16463, 'rounded': 17332, 'Hooray': 3181, 'stayed': 18621, 'hotel': 13028, 'Tuscany': 6551, 'Delicious': 1964, 'Scallops': 5650, 'pasta': 15699, 'Also': 508, 'excellent': 11341, 'server': 17743, 'Pure': 5116, 'Bliss': 983, 'haven': 12744, 'quality': 16608, 'dinner': 10566, 'stupendous': 18822, 'every': 11306, 'cent': 9019, 'MUST': 3941, 'Yin': 7035, 'Yang': 7013, 'Martini': 4062, 'creative': 9989, 'presented': 16315, 'smiles': 18160, 'Smores': 5866, 'Chocolate': 1506, 'desert': 10433, 'hands': 12666, 'connosiuer': 9686, 'critical': 10029, 'So': 5876, 'college': 9465, 'kids': 13737, 'RA': 5154, 'mediocre': 14554, 'list': 14108, 'irk': 13532, 'during': 10931, 'co': 9411, 'workers': 20666, 'PACKED': 4687, 'Good': 2827, 'WRONG': 6802, 'Extremely': 2356, 'meet': 14559, 'coworkers': 9926, 'tired': 19510, 'chill': 9189, 'past': 15698, 'friends': 12052, 'alone': 7432, 'quite': 16647, 'bit': 8268, 'servings': 17751, 'OBVIOUSLY': 4533, 'LOW': 3680, 'Zen': 7086, '32': 175, 'choice': 9224, 'SakeBomber': 5581, 'BETTER': 730, 'Be': 866, 'loud': 14220, 'noises': 15129, 'trying': 19778, 'talk': 19125, 'louder': 14221, 'music': 14951, 'person': 15859, 'Just': 3501, 'overall': 15478, 'tip': 19504, 'priced': 16353, 'works': 20671, 'Horrible': 3195, 'Twice': 6555, 'gone': 12367, 'messing': 14616, 'Starbucks': 6016, 'hard': 12701, 'Believe': 900, 'All': 493, 'attitude': 7836, 'One': 4612, 'lady': 13830, 'doesn': 10727, 'ordering': 15398, 'someone': 18282, 'fix': 11768, 'known': 13791, 'Ya': 7010, 'employees': 11119, 'rude': 17354, 'makes': 14339, 'anymore': 7566, 'stopped': 18710, 'still': 18676, 'mom': 14818, 'goes': 12356, 'coffee': 9442, 'As': 643, 'east': 10973, 'coast': 9416, 'deli': 10343, 'west': 20478, 'Appalachians': 582, 'creams': 9981, 'Pricy': 5075, 'apparently': 7599, 'free': 12019, 'summer': 18919, 'kept': 13712, 'stars': 18590, 'Experience': 2344, 'Hostess': 3202, 'frown': 12079, 'birthday': 8259, 'spirits': 18447, 'written': 20717, 'clarity': 9315, 'margaritas': 14408, 'strong': 18789, 'check': 9118, 'ok': 15306, 'shrimp': 17926, '18': 77, 'two': 19833, 'mahi': 14313, 'each': 10949, '24': 137, 'RANCID': 5157, 'Yup': 7069, 'fajitas': 11517, 'jumbo': 13673, 'charge': 9091, 'entree': 11213, 'pay': 15745, 'onions': 15336, 'Poor': 5016, 'Mediocre': 4125, 'Never': 4445, 'once': 15330, 'week': 20447, 'Asada': 644, 'Enchilada': 2260, 'avocado': 7884, 'lover': 14236, 'ripest': 17230, 'avocados': 7885, 'F724': 2359, 'bother': 8453, 'asking': 7747, 'entering': 11193, 'reply': 17047, 'spoke': 18466, 'acknowledged': 7189, 'presence': 16312, 'ticket': 19458, 'happened': 12687, 'effort': 11039, 'tell': 19263, 'horrible': 13002, 'scathing': 17552, 'deplorable': 10412, 'Sadly': 5571, 'DMV': 1850, 'modeled': 14785, 'agonizing': 7341, 'depths': 10421, 'Hell': 3092, 'With': 6941, 'musac': 14940, 'score': 17590, 'playing': 16065, 'voracious': 20287, 'stench': 18649, 'air': 7368, 'spastic': 18380, 'children': 9181, 'running': 17372, 'version': 20197, 'politicians': 16134, 'campaigning': 8823, 'late': 13885, 'complete': 9583, 'emissions': 11107, 'test': 19315, 'Those': 6385, 'DMVs': 1851, 'open': 15350, 'Saturdays': 5635, 'map': 14397, 'How': 3212, 'precious': 16265, 'fortunate': 11971, 'near': 15021, 'Seriously': 5719, 'Three': 6388, 'entire': 11206, 'Valley': 6676, 'receiving': 16803, 'such': 18878, 'magnificent': 14309, 'welcome': 20466, 'among': 7487, 'painful': 15569, 'To': 6423, 'simplify': 17989, 'let': 14002, 'feel': 11624, 'constitutes': 9717, 'acceptable': 7143, 'behavior': 8150, 'public': 16521, 'Bathe': 857, 'combination': 9482, 'water': 20407, 'soap': 18230, 'along': 7433, 'massaging': 14469, 'skin': 18058, 'creates': 9985, 'outcome': 15447, 'pleasing': 16077, 'scrub': 17624, 'behind': 8151, 'ears': 10963, 'reference': 16871, 'website': 20439, 'notorious': 15183, 'http': 13061, 'www': 20724, 'craigslist': 9950, ...}

Using CountVectorizer in a Model

DTM

# Use default options for CountVectorizer. vect = CountVectorizer() # Create document-term matrices. X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) # Use Naive Bayes to predict the star rating. nb = MultinomialNB() #best for binary classification outcome nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm) # Calculate accuracy. print((metrics.accuracy_score(y_test, y_pred_class)))
0.9187866927592955
from sklearn.ensemble import RandomForestClassifier # Use default options for CountVectorizer. vect = CountVectorizer() # Create document-term matrices. X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) # Use RFC to predict the star rating. rf = RandomForestClassifier() rf.fit(X_train_dtm, y_train) y_pred_class = rf.predict(X_test_dtm) # Calculate accuracy. print((metrics.accuracy_score(y_test, y_pred_class)))
0.8688845401174168
from sklearn.linear_model import LogisticRegressionCV # Use default options for CountVectorizer. vect = CountVectorizer() # Create document-term matrices. X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) # Use LR to predict the star rating. lr = LogisticRegressionCV() lr.fit(X_train_dtm, y_train) y_pred_class = lr.predict(X_test_dtm) # Calculate accuracy. print((metrics.accuracy_score(y_test, y_pred_class)))
0.9266144814090019
print(metrics.classification_report(y_test, y_pred_class))
precision recall f1-score support 1 0.80 0.79 0.79 184 5 0.95 0.96 0.96 838 avg / total 0.93 0.93 0.93 1022
metrics.confusion_matrix(y_test, y_pred_class)
array([[145, 39], [ 36, 802]])
y_test.value_counts()
5 838 1 184 Name: stars, dtype: int64
y_test.value_counts(normalize=True) #normalize=True is great for vc as it gives percents, same as both the below
5 0.819961 1 0.180039 Name: stars, dtype: float64
y_test.value_counts() / y_test.value_counts().sum()
5 0.819961 1 0.180039 Name: stars, dtype: float64
# Calculate null accuracy. y_test_binary = np.where(y_test==5, 1, 0) # five stars become 1, one stars become 0 print('Percent 5 Stars:', y_test_binary.mean()) print('Percent 1 Stars:', 1 - y_test_binary.mean())
Percent 5 Stars: 0.8199608610567515 Percent 1 Stars: 0.18003913894324852

Our model predicted ~92% accuracy, which is an improvement over this baseline 82% accuracy (assuming our model always predicts 5 stars).

Let's look more into how the vectorizer works.

# Notice how the data was transformed into this sparse matrix with 1,022 datapoints and 16,825 features! # - Recall that vectorizations of text will be mostly zeros, since only a few unique words are in each document. # - For that reason, instead of storing all the zeros we only store non-zero values (inside the 'sparse matrix' data structure!). # - We have 3064 Yelp reviews in our training set. # - 16,825 unique words were found across all documents. X_train_dtm
<3064x16825 sparse matrix of type '<class 'numpy.int64'>' with 237720 stored elements in Compressed Sparse Row format>
# Let's take a look at the vocabulary that was generated, containing 16,825 unique words. # 'vocabulary_' is a dictionary that converts each word to its index in the sparse matrix. # - For example, the word "four" is index #3230 in the sparse matrix. vect.vocabulary_
{'filly': 5773, 'only': 10362, 'reviews': 12465, 'nine': 10069, 'now': 10180, 'wow': 16612, 'do': 4631, 'miss': 9578, 'this': 15093, 'place': 11186, '24hrs': 136, 'drive': 4809, 'thru': 15136, 'or': 10413, 'walk': 16195, 'up': 15834, 'ridiculously': 12514, 'cheap': 2789, 'tasty': 14838, 'of': 10286, 'course': 3679, 'the': 15032, 'arizona': 1018, 'burritos': 2286, 'are': 1003, 'good': 6571, 'everything': 5342, 'is': 7956, 'used': 15885, 'to': 15228, 'love': 8899, 'one': 10354, 'combos': 3233, 'you': 16727, 'get': 6433, 'beef': 1564, 'burrito': 2285, 'taco': 14720, 'rice': 12489, 'and': 805, 'beans': 1528, 'for': 6028, 'under': 15683, 'color': 3213, 'me': 9301, 'silly': 13462, 'call': 2398, 'sally': 12808, 'they': 15067, 'have': 7023, 'bomb': 1902, 'horchata': 7345, 'too': 15281, 'really': 12038, 'fresh': 6154, 'flautas': 5882, 'rolled': 12622, 'tacos': 14721, 'breakfast': 2069, 'damn': 3999, 'here': 7149, 'whether': 16394, 'drunk': 4837, 'sober': 13745, 'my': 9868, 'husband': 7476, 'absolutely': 353, 'restaurant': 12410, 'anytime': 889, 'find': 5790, 'myself': 9871, 'craving': 3757, 'mexican': 9440, 'food': 6006, 'first': 5828, 'that': 15027, 'pops': 11347, 'in': 7619, 'head': 7042, 'salsa': 12814, 'blanca': 1780, 'we': 16306, 'always': 735, 'encountered': 5144, 'friendly': 6171, 'welcoming': 16355, 'staff': 14107, 'amazing': 752, 'fulfilling': 6251, 'what': 16377, 'more': 9735, 'could': 3655, 'ask': 1101, 'went': 16361, 'today': 15237, 'after': 563, 'lunch': 8945, 'got': 6605, 'usual': 15893, 'lime': 8707, 'basil': 1477, 'real': 12029, 'mint': 9540, 'chip': 2896, 'which': 16396, 'leaves': 8570, 'hubby': 7420, 'chocolate': 2914, 'guiness': 6820, 'four': 6089, 'peaks': 10866, 'hop': 7329, 'knot': 8348, 'best': 1657, 'ice': 7503, 'cream': 3766, 'phoenix': 11045, 'super': 14545, 'nice': 10041, 'give': 6481, 'us': 15879, 'bags': 1345, 'take': 14743, 'our': 10493, 'home': 7283, 'totally': 15327, 'dissapointed': 4585, 'had': 6868, 'purchased': 11785, 'coupon': 3676, 'from': 6195, 'travelzoo': 15435, 'try': 15520, 'out': 10496, 'given': 6483, 'its': 7980, 'location': 8814, 'would': 16606, 'thought': 15104, 'it': 7971, 'was': 16252, 'going': 6551, 'be': 1522, 'very': 16032, 'upscale': 15861, 'were': 16362, 'expecting': 5440, 'whole': 16429, 'lot': 8881, 'service': 13193, 'great': 6693, 'but': 2312, 'not': 10148, 'worth': 16601, 'itself': 7981, 'outdated': 10500, '80': 284, 'crab': 3721, 'cakes': 2387, 'appertizer': 933, 'cold': 3192, 'when': 16389, 'served': 13188, 'told': 15254, 'waiter': 16181, 'who': 16425, 'turn': 15556, 'chef': 2828, 'message': 9416, 'passed': 10776, 'back': 1310, 'she': 13272, 'appologized': 957, 'ordred': 10434, '12oz': 52, 'new': 10013, 'york': 16724, 'strip': 14358, 'ala': 646, 'carte': 2573, '29': 153, '95': 307, 'oz': 10602, 'pure': 11789, 'fat': 5622, 'price': 11555, 'expect': 5435, 'meat': 9318, 'than': 15015, 'ordered': 10428, 'wild': 16461, 'mushroom': 9844, 'pizza': 11174, 'ok': 10322, 'needs': 9972, 'an': 797, 'over': 10532, 'haul': 7014, 'major': 9048, 'way': 16300, 'if': 7534, 'want': 16220, 'make': 9051, 'any': 879, 'money': 9689, 'at': 1157, 'saturday': 12905, 'night': 10052, '7pm': 282, 'empty': 5125, 'think': 15082, 'highlight': 7186, 'meal': 9303, 'bottle': 1981, 'cabinet': 2365, 'costco': 3635, 'travel': 15428, 'recently': 12073, 'returned': 12444, 'trip': 15483, 'big': 1704, 'island': 7958, 'hi': 7168, 'arranged': 1038, 'throught': 15130, 'shopping': 13361, 'around': 1033, 'found': 6083, 'their': 15039, 'prices': 11559, 'phone': 11049, 'able': 340, 'all': 679, 'arrangements': 1040, 'airfair': 621, 'condo': 3394, 'car': 2505, 'didn': 4411, 'with': 16526, 'bs': 2197, 'on': 10352, 'hilo': 7206, 'adjust': 485, 'dates': 4053, 'according': 398, 'plan': 11196, 'being': 1599, 'accurate': 405, 'outstanding': 10527, 'value': 15939, 'gas': 6356, 'did': 4409, 'some': 13798, 'souvenier': 13885, 'kona': 8363, 'again': 573, 'saved': 12923, 've': 15968, 'shopped': 13358, 'years': 16679, 'becoming': 1555, 'groupie': 6766, 'been': 1567, 'couple': 3672, 'there': 15057, 'advantages': 520, 'rarely': 11967, 'busy': 2311, 'larger': 8478, 'groups': 6770, 'last': 8490, 'few': 5720, 'times': 15190, 'experienced': 5448, 'average': 1253, 'lousy': 8897, 'joining': 8119, 'already': 719, 'seated': 13085, 'group': 6765, '28': 152, 'high': 7181, 'school': 12990, 'hostess': 7369, 'wagged': 16174, 'her': 7142, 'finger': 5800, 'face': 5528, 'like': 8695, 'east': 4943, 'german': 6428, 'border': 1953, 'guard': 6800, 'waffle': 16169, 'house': 7390, 'waitress': 16184, 'poked': 11295, 'shoulder': 13376, 'attention': 1197, 'wife': 16449, 'chicken': 2855, 'dish': 4540, 'sent': 13162, 'as': 1080, 'measly': 9314, 'pieces': 11099, 'beers': 1571, 'tap': 14792, 'garbage': 6337, 'muffler': 9815, 'shop': 13354, 'town': 15357, 'shops': 13362, 'trust': 15513, 'mighty': 9482, 'them': 15042, 'greg': 6714, 'has': 6999, 'worked': 16580, 'multiple': 9824, 'cars': 2571, 'done': 4687, 'job': 8105, 'ever': 5333, 'issues': 7968, 'he': 7041, 'takes': 14748, 'care': 2525, 'problem': 11605, 'no': 10084, 'questions': 11856, 'asked': 1102, 'just': 8189, 'start': 14159, 'off': 10287, 'by': 2346, 'saying': 12940, 'egg': 5023, 'salad': 12788, 'sandwiches': 12862, 'probably': 11603, 'tried': 15473, 'sandwich': 12860, 'anywhere': 892, 'serves': 13192, 'sacks': 12762, 'far': 5600, 'entitled': 5217, 'dali': 3991, 'life': 8673, 'live': 8775, 'north': 10136, 'will': 16470, 'literally': 8770, 'many': 9120, 'miles': 9493, 'eat': 4949, 'top': 15293, 'wonderful': 16554, 'menu': 9396, 'whenever': 16390, 'order': 10427, 'comes': 3237, 'little': 8773, 'cookie': 3558, 'cookies': 3559, 'can': 2435, 'purchase': 11784, 'dough': 4722, 'also': 721, 'other': 10485, 'delicious': 4220, 'dessert': 4342, 'bars': 1458, 'salads': 12789, 'sale': 12796, 'well': 16356, 'eating': 4958, 'least': 8566, '14': 56, 'hope': 7330, 'never': 10012, 'go': 6532, 'away': 1275, 'saw': 12934, 'triple': 15485, 'so': 13738, 'decided': 4131, 'expectations': 5438, 'arrived': 1048, 'shocked': 13339, 'thursday': 15145, 'morning': 9742, 'line': 8724, 'extra': 5494, 'long': 8842, 'wait': 16179, 'patiently': 10810, 'don': 4682, 'needless': 9971, 'say': 12938, 'because': 1548, 'quickly': 11864, 'pork': 11358, 'chop': 2938, 'eggs': 5030, 'hash': 7000, 'browns': 2168, 'fabulous': 5526, 'toast': 15229, 'made': 8998, 'grape': 6659, 'jelly': 8061, 'recommend': 12097, 'anyone': 885, 'downtown': 4743, 'area': 1004, 'sure': 14583, 'bacon': 1324, 'hit': 7232, 'wine': 16492, 'down': 4732, 'wednesday': 16325, 'happenin': 6955, 'tastings': 14837, 'courtesy': 3685, 'kyot': 8401, 'wrong': 16639, 'event': 5330, 'minutes': 9546, 'maybe': 9272, 'spot': 14035, 'case': 2583, 'hoppin': 7339, 'walked': 16197, 'looks': 8855, 'possibly': 11391, 'manager': 9088, 'owner': 10590, 'then': 15047, 'pointed': 11285, 'towards': 15350, 'room': 12641, 'where': 16391, 'held': 7114, 'about': 343, 'tables': 14711, 'people': 10927, 'talk': 14757, 'early': 4926, 'bird': 1735, 'gets': 6435, 'worm': 16591, 'finish': 5808, 'table': 14709, 'sit': 13510, 'patio': 10812, 'said': 12778, 'yes': 16698, 'right': 12522, 'help': 7124, 'meantime': 9312, 'another': 850, 'woman': 16548, 'friend': 6167, 'took': 15283, 'nearby': 9954, '20': 107, 'later': 8498, 'looking': 8853, 'came': 2417, 'helped': 7125, 'yet': 16701, 'left': 8583, 'return': 12443, 'small': 13638, 'happy': 6963, 'hour': 7388, 'flier': 5905, 'handed': 6914, 'distance': 4592, 'even': 5327, 'come': 3234, 'know': 8351, 're': 12013, 'wondering': 16557, 'why': 16440, '15': 60, 'wanted': 16222, 'see': 13113, 'how': 7404, 'bad': 1326, 'frankly': 6118, 'same': 12833, 'full': 6252, 'detailed': 4352, 'menus': 9398, 'deals': 4093, 'ready': 12028, 'um': 15635, 'gave': 6375, '10': 16, 'seconds': 13096, 'ago': 587, 'enjoy': 5179, 'look': 8851, 'app': 913, 'specialty': 13945, 'appetizer': 938, 'much': 9807, 'whatever': 16378, 'deal': 4087, 'drink': 4801, 'section': 13103, 'claim': 3029, 'glasses': 6497, 'excited': 5387, 'wines': 16493, 'reality': 12031, 'consisted': 3467, 'three': 15116, 'different': 4427, 'weren': 16363, 'women': 16549, 'outside': 10525, 'leave': 8569, 'having': 7026, 'enough': 5189, 'shenanigans': 13295, '30': 162, 'since': 13482, 'next': 10031, 'door': 4699, 'sprouts': 14064, 'apps': 978, 'own': 10588, 'bet': 1660, 'plenty': 11250, 'experiences': 5449, 'continue': 3514, 'visit': 16107, 'happily': 6961, 'experience': 5447, 'caused': 2628, 'write': 16632, 'usually': 15894, 'instance': 7807, 'wish': 16521, 'actually': 447, 'review': 12460, 'poor': 11337, 'substandard': 14446, 'giving': 6486, 'understand': 15696, 'isn': 7962, 'better': 1667, 'talented': 14753, 'others': 10486, 'however': 7407, 'business': 2300, 'should': 13375, 'customer': 3946, 'cliched': 3082, 'less': 8626, 'star': 14146, 'oh': 10310, 'yeah': 16674, 'slapped': 13575, 'yelped': 16689, 'card': 2516, 'pepperoni': 10937, 'amazed': 749, 'herb': 7143, 'vegetable': 15976, 'garden': 6340, 'front': 6197, 'ingredients': 7747, 'these': 15065, 'days': 4067, 'dishes': 4541, 'cheese': 2818, 'platters': 11225, 'mouth': 9782, 'watering': 16285, 'speak': 13926, 'day': 4064, 'll': 8788, 'keep': 8244, 'adding': 469, 'stay': 14186, 'tuned': 15547, 'btw': 2198, 'decor': 4149, 'reminds': 12272, 'vig': 16066, 'definitely': 4188, 'spacious': 13898, 'dining': 4462, 'sushi': 14612, 'time': 15184, 'garlic': 6346, 'knots': 8350, 'favorite': 5643, 'alot': 716, 'panda': 10687, 'family': 5581, 'run': 12728, 'chinese': 2894, 'most': 9757, 'importantly': 7596, 'style': 14421, 'meals': 9304, 'personal': 10987, 'empress': 5122, 'sweeter': 14655, 'orange': 10415, 'flavor': 5884, 'melts': 9371, 'your': 16732, 'kung': 8396, 'pao': 10705, 'two': 15598, 'shrimp': 13403, 'leftovers': 8585, 'thanks': 15024, 'generous': 6407, 'portions': 11369, 'selection': 13131, 'sake': 12784, 'beer': 1570, 'everyone': 5340, 'else': 5078, 'checks': 2809, 'man': 9083, 'leaving': 8571, 'six': 13519, 'half': 6889, 'search': 13073, 'gotten': 6610, 'point': 11283, 'settled': 13205, 'decent': 4125, 'work': 16579, 'colleague': 3202, 'suggested': 14496, 'dine': 4454, 'corporate': 3610, 'meeting': 9348, 'while': 16398, 'realized': 12033, 'native': 9933, 'metro': 9435, 'thing': 15077, 'trusted': 15514, 'his': 7227, 'palate': 10656, 'sense': 13156, 'pleased': 11243, 'diners': 4458, 'predominately': 11467, 'asian': 1098, 'knew': 8337, 'market': 9165, 'things': 15079, 'signaled': 13445, 'thrilled': 15123, 'familiar': 5579, 'dim': 4448, 'sum': 14510, 'carts': 2578, 'rolling': 12624, 'through': 15128, 'party': 10768, 'shared': 13254, 'fantastic': 5596, 'succulent': 14469, 'orders': 10431, 'equally': 5244, 'veggies': 15981, 'brightly': 2120, 'colored': 3215, 'tastes': 14830, 'distinctive': 4598, 'hot': 7374, 'sour': 13871, 'soup': 13868, 'perfect': 10949, 'filled': 5766, 'kinds': 8310, 'mushrooms': 9845, 'etc': 5301, 'delighted': 4226, 'reminisced': 12273, 'dinners': 4466, 'chinatown': 2893, 'ny': 10224, 'comparable': 3290, 'close': 3108, 'regular': 12197, 'chandler': 2741, 'clear': 3067, 'side': 13428, 'world': 16588, 'figure': 5750, 'marco': 9133, 'polo': 11314, 'palace': 10653, 'prompt': 11671, 'rounded': 12678, 'hooray': 7326, 'stayed': 14189, 'hotel': 7377, 'tuscany': 15566, 'scallops': 12951, 'pasta': 10788, 'excellent': 5374, 'server': 13189, 'bliss': 1820, 'haven': 7024, 'quality': 11833, 'dinner': 4465, 'stupendous': 14413, 'every': 5336, 'cent': 2674, 'must': 9856, 'yin': 16705, 'yang': 16663, 'martini': 9195, 'creative': 3779, 'presented': 11514, 'smiles': 13664, 'smores': 13684, 'desert': 4315, 'hands': 6928, 'connosiuer': 3444, 'critical': 3821, 'college': 3209, 'kids': 8288, 'ra': 11887, 'mediocre': 9341, 'list': 8755, 'irk': 7938, 'during': 4891, 'co': 3145, 'workers': 16582, 'packed': 10616, 'extremely': 5503, 'meet': 9347, 'coworkers': 3708, 'tired': 15214, 'chill': 2878, 'past': 10787, 'friends': 6172, 'alone': 710, 'quite': 11877, 'bit': 1749, 'servings': 13197, 'obviously': 10256, 'low': 8911, 'zen': 16785, '32': 172, 'choice': 2922, 'sakebomber': 12785, 'loud': 8886, 'noises': 10098, 'trying': 15521, 'louder': 8887, 'music': 9848, 'person': 10985, 'overall': 10533, 'tip': 15207, 'priced': 11556, 'works': 16587, 'horrible': 7351, 'twice': 15586, 'gone': 6564, 'messing': 9420, 'starbucks': 14148, 'hard': 6969, 'believe': 1604, 'attitude': 1202, 'lady': 8427, 'doesn': 4648, 'ordering': 10429, 'someone': 13803, 'fix': 5846, 'known': 8356, 'ya': 16655, 'employees': 5117, 'rude': 12708, 'makes': 9055, 'anymore': 884, 'stopped': 14290, 'still': 14252, 'mom': 9679, 'goes': 6549, 'coffee': 3182, 'coast': 3151, 'deli': 4210, 'west': 16365, 'appalachians': 914, 'creams': 3771, 'pricy': 11566, 'apparently': 921, 'free': 6134, 'summer': 14515, 'kept': 8256, 'stars': 14158, 'frown': 6207, 'birthday': 1739, 'spirits': 13998, 'written': 16637, 'clarity': 3042, 'margaritas': 9140, 'strong': 14376, 'check': 2797, '18': 75, 'mahi': 9025, 'each': 4917, '24': 135, 'rancid': 11948, 'yup': 16765, 'fajitas': 5564, 'jumbo': 8173, 'charge': 2763, 'entree': 5222, 'pay': 10844, 'onions': 10360, 'once': 10353, 'week': 16329, 'asada': 1081, 'enchilada': 5138, 'avocado': 1260, 'lover': 8905, 'ripest': 12542, 'avocados': 1261, 'f724': 5521, 'bother': 1975, 'asking': 1104, 'entering': 5201, 'reply': 12331, 'spoke': 14020, 'acknowledged': 419, 'presence': 11511, 'ticket': 15153, 'happened': 6954, 'effort': 5020, 'tell': 14925, 'scathing': 12969, 'deplorable': 4288, 'sadly': 12768, 'dmv': 4628, 'modeled': 9641, 'agonizing': 588, 'depths': 4297, 'hell': 7117, 'musac': 9837, 'score': 13009, 'playing': 11233, 'voracious': 16143, 'stench': 14221, 'air': 619, 'spastic': 13921, 'children': 2869, 'running': 12731, 'version': 16028, 'politicians': 11310, 'campaigning': 2426, 'late': 8496, 'complete': 3331, 'emissions': 5104, 'test': 14990, 'those': 15101, 'dmvs': 4629, 'open': 10379, 'saturdays': 12906, 'map': 9121, 'precious': 11458, 'fortunate': 6071, 'near': 9953, 'seriously': 13180, 'entire': 5215, 'valley': 15936, 'receiving': 12071, 'such': 14471, 'magnificent': 9018, 'welcome': 16352, 'among': 781, 'painful': 10633, 'simplify': 13476, 'let': 8630, 'feel': 5678, 'constitutes': 3479, 'acceptable': 371, 'behavior': 1594, 'public': 11732, 'bathe': 1492, 'combination': 3227, 'water': 16280, 'soap': 13742, 'along': 711, 'massaging': 9221, 'skin': 13553, 'creates': 3775, 'outcome': 10499, 'pleasing': 11244, 'scrub': 13050, 'behind': 1595, 'ears': 4932, 'reference': 12142, 'website': 16318, 'notorious': 10164, 'http': 7416, 'www': 16645, 'craigslist': 3740, 'org': 10443, 'htf': 7414, '755891987': 278, 'html': 7415, 'waiting': 16183, 'tapping': 14800, 'foot': 6022, 'rocking': 12609, 'forth': 6070, 'chair': 2712, 'bouncing': 1999, 'leg': 8586, 'aren': 1006, 'faster': 5620, 'watching': 16279, 'clock': 3102, 'anxious': 877, 'leash': 8563, 'cannot': 2465, 'quietly': 11868, 'winter': 16506, 'fine': 5794, 'cute': 3953, 'monster': 9702, 'aisles': 634, 'honey': 7307, 'bun': 2261, 'squeezing': 14083, 'between': 1671, 'fingers': 5805, 'fake': 5566, 'falls': 5575, 'five': 5845, 'row': 12686, 'maintain': 9039, 'quiet': 11866, 'tones': 15272, 'ensure': 5197, 'appropriate': 973, 'conversations': 3541, 'taking': 14749, 'making': 9060, 'calls': 2404, 'sorry': 13851, 'stripping': 14366, 'popped': 11342, 'into': 7877, 'mind': 9515, 'failed': 5550, 'audition': 1218, 'le': 8539, 'girl': 6470, 'continued': 3515, 'conversation': 3539, 'woodworker': 16571, 'acknowledge': 418, 'humans': 7440, 'fact': 5540, 'sideways': 13435, 'disappear': 4490, 'quit': 11876, 'bumping': 2259, 'hitting': 7235, 'unless': 15770, 'flirting': 5917, 'need': 9966, 'level': 8636, 'brilliant': 2122, 'recipe': 12083, 'dashing': 4050, 'dreams': 4778, 'gutting': 6850, 'humanity': 7439, 'manage': 9084, 'painfully': 10634, 'lit': 8766, 'reeks': 12137, 'brother': 2158, 'old': 10328, 'gym': 6853, 'shoes': 13345, 'dad': 3979, 'dirty': 4486, 'underwear': 15703, 'award': 1271, ...}
# Finally, let's convert the sparse matrix to a typical ndarray using .toarray() # - Remember, this takes up a lot more memory than the sparse matrix! However, this conversion is sometimes necessary. X_test_dtm.toarray()
array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]])
# We will use this function below for simplicity. # Define a function that accepts a vectorizer and calculates the accuracy. def tokenize_test(vect): X_train_dtm = vect.fit_transform(X_train) print(('Features: ', X_train_dtm.shape[1])) X_test_dtm = vect.transform(X_test) nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm) print(('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class)))
# min_df ignores words that occur less than twice ('df' means "document frequency"). vect = CountVectorizer(min_df=2, max_features=10000) tokenize_test(vect)
('Features: ', 8783) ('Accuracy: ', 0.9246575342465754)

Let's take a look next at other ways of preprocessing text!

  • Objective: Demonstrate common text preprocessing techniques.

N-Grams

N-grams are features which consist of N consecutive words. This is useful because using the bag-of-words model, treating data scientist as a single feature has more meaning than having two independent features data and scientist!

Example:

my cat is awesome Unigrams (1-grams): 'my', 'cat', 'is', 'awesome' Bigrams (2-grams): 'my cat', 'cat is', 'is awesome' Trigrams (3-grams): 'my cat is', 'cat is awesome' 4-grams: 'my cat is awesome'
  • ngram_range: tuple (min_n, max_n)

  • The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

# Include 1-grams and 2-grams. vect = CountVectorizer(ngram_range=(1, 2)) X_train_dtm = vect.fit_transform(X_train) X_train_dtm.shape
(3064, 169847)

We can start to see how supplementing our features with n-grams can lead to more feature columns. When we produce n-grams from a document with WW words, we add an additional (nW+1)(n-W+1) features (at most). That said, be careful — when we compute n-grams from an entire corpus, the number of unique n-grams could be vastly higher than the number of unique unigrams! This could cause an undesired feature explosion.

Although we sometimes add important new features that have meaning such as data scientist, many of the new features will just be noise. So, particularly if we do not have much data, adding n-grams can actually decrease model performance. This is because if each n-gram is only present once or twice in the training set, we are effectively adding mostly noisy features to the mix.

# Last 50 features print((vect.get_feature_names()[-50:]))
['zone out', 'zone when', 'zones', 'zones dolls', 'zoning', 'zoning issues', 'zoo', 'zoo and', 'zoo is', 'zoo not', 'zoo the', 'zoo ve', 'zoyo', 'zoyo for', 'zucca', 'zucca appetizer', 'zucchini', 'zucchini and', 'zucchini bread', 'zucchini broccoli', 'zucchini carrots', 'zucchini fries', 'zucchini pieces', 'zucchini strips', 'zucchini veal', 'zucchini very', 'zucchini with', 'zuchinni', 'zuchinni again', 'zuchinni the', 'zumba', 'zumba class', 'zumba or', 'zumba yogalates', 'zupa', 'zupa flavors', 'zuzu', 'zuzu in', 'zuzu is', 'zuzu the', 'zwiebel', 'zwiebel kräuter', 'zzed', 'zzed in', 'éclairs', 'éclairs napoleons', 'école', 'école lenôtre', 'ém', 'ém all']
# Show vectorizer options. vect
CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 2), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
  • stop_words: string {english}, list, or None (default)

  • If english, a built-in stop word list for English is used.

  • If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.

  • If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms. (If max_df = 0.7, then if > 70% of documents contain a word it will not be included in the feature set!)

Stop-Word Removal

  • What: This process is used to remove common words that will likely appear in any text.

  • Why: Because common words exist in most documents, they likely only add noise to your model and should be removed.

What are stop words? Stop words are some of the most common words in a language. They are used so that a sentence makes sense grammatically, such as prepositions and determiners, e.g., "to," "the," "and." However, they are so commonly used that they are generally worthless for predicting the class of a document. Since "a" appears in spam and non-spam emails, for example, it would only contribute noise to our model.

Example:

  1. Original sentence: "The dog jumped over the fence"

  2. After stop-word removal: "dog jumped over fence"

The fact that there is a fence and a dog jumped over it can be derived with or without stop words.

# Remove English stop words. vect = CountVectorizer(stop_words='english') tokenize_test(vect) vect.get_params()
('Features: ', 16528) ('Accuracy: ', 0.9158512720156555)
{'analyzer': 'word', 'binary': False, 'decode_error': 'strict', 'dtype': numpy.int64, 'encoding': 'utf-8', 'input': 'content', 'lowercase': True, 'max_df': 1.0, 'max_features': None, 'min_df': 1, 'ngram_range': (1, 1), 'preprocessor': None, 'stop_words': 'english', 'strip_accents': None, 'token_pattern': '(?u)\\b\\w\\w+\\b', 'tokenizer': None, 'vocabulary': None}
# Set of stop words print((vect.get_stop_words()))
frozenset({'nevertheless', 'then', 'along', 'there', 'third', 'hasnt', 'hence', 'into', 'below', 'call', 'anything', 'wherever', 'become', 'wherein', 'not', 'find', 'always', 'thereby', 'an', 'toward', 'what', 'none', 'be', 'both', 'have', 'nine', 'otherwise', 'when', 'bottom', 'co', 'amongst', 'will', 'couldnt', 'becomes', 'so', 'anyway', 'also', 'ie', 'some', 'upon', 'that', 'throughout', 'why', 'detail', 'perhaps', 'above', 'give', 'everyone', 'mill', 'she', 'except', 'was', 'they', 'your', 'name', 'down', 'much', 'around', 'con', 'often', 'others', 'back', 'we', 'whatever', 'describe', 'himself', 'without', 'had', 'under', 'them', 'please', 'for', 'ten', 'thick', 'one', 'are', 'neither', 'cannot', 'during', 'the', 'latter', 'these', 'hereafter', 'ever', 'every', 're', 'any', 'inc', 'most', 'beyond', 'seeming', 'rather', 'how', 'in', 'ltd', 'and', 'fifty', 'while', 'within', 'been', 'yourself', 'became', 'however', 'mine', 'were', 'next', 'very', 'you', 'forty', 'whereafter', 'other', 'go', 'alone', 'still', 'moreover', 'myself', 'seems', 'thereupon', 'somehow', 'first', 'can', 'my', 'over', 'hereby', 'see', 'take', 'than', 'us', 'among', 'whoever', 'its', 'too', 'mostly', 'everywhere', 'bill', 'now', 'ourselves', 'another', 'seem', 'anyone', 'herself', 'indeed', 'whereas', 'should', 'fifteen', 'nor', 'system', 'meanwhile', 'found', 'itself', 'which', 'towards', 'until', 'as', 'cant', 'interest', 'nowhere', 'several', 'afterwards', 'since', 'whom', 'serious', 'latterly', 'whenever', 'me', 'her', 'namely', 'all', 'already', 'fill', 'amount', 'due', 'between', 'eleven', 'a', 'fire', 'further', 'noone', 'de', 'or', 'this', 'almost', 'two', 'whole', 'less', 'off', 'do', 'yet', 'never', 'last', 'again', 'before', 'herein', 'therein', 'could', 'even', 'against', 'three', 'eight', 'whereupon', 'front', 'at', 'else', 'hundred', 'of', 'six', 'across', 'eg', 'out', 'up', 'is', 'anyhow', 'same', 'only', 'it', 'thin', 'would', 'anywhere', 'enough', 'per', 'well', 'our', 'whether', 'twelve', 'nothing', 'done', 'beside', 'he', 'to', 'if', 'either', 'full', 'yourselves', 'here', 'must', 'him', 'part', 'side', 'etc', 'whose', 'may', 'everything', 'ours', 'no', 'show', 'someone', 'former', 'formerly', 'nobody', 'has', 'empty', 'from', 'being', 'whereby', 'with', 'because', 'made', 'via', 'top', 'twenty', 'more', 'elsewhere', 'on', 'own', 'get', 'sometime', 'who', 'becoming', 'something', 'keep', 'might', 'about', 'put', 'five', 'am', 'besides', 'move', 'but', 'onto', 'yours', 'those', 'their', 'cry', 'thru', 'least', 'behind', 'thereafter', 'where', 'amoungst', 'though', 'whence', 'sometimes', 'un', 'few', 'seemed', 'hereupon', 'themselves', 'each', 'thence', 'once', 'together', 'therefore', 'such', 'through', 'i', 'by', 'four', 'hers', 'his', 'many', 'sincere', 'somewhere', 'thus', 'whither', 'sixty', 'although', 'after', 'beforehand'})

Other CountVectorizer Options

  • max_features: int or None, default=None

  • If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. This allows us to keep more common n-grams and remove ones that may appear once. If we include words that only occur once, this can lead to said features being highly associated with a class and cause overfitting.

# Remove English stop words and only keep 100 features. vect = CountVectorizer(stop_words='english', max_features=100) tokenize_test(vect)
('Features: ', 100) ('Accuracy: ', 0.8698630136986302)
# All 100 features print((vect.get_feature_names()))
['amazing', 'area', 'atmosphere', 'awesome', 'bad', 'bar', 'best', 'better', 'big', 'came', 'cheese', 'chicken', 'clean', 'coffee', 'come', 'day', 'definitely', 'delicious', 'did', 'didn', 'dinner', 'don', 'eat', 'excellent', 'experience', 'favorite', 'feel', 'food', 'free', 'fresh', 'friendly', 'friends', 'going', 'good', 'got', 'great', 'happy', 'home', 'hot', 'hour', 'just', 'know', 'like', 'little', 'll', 'location', 'long', 'looking', 'lot', 'love', 'lunch', 'make', 'meal', 'menu', 'minutes', 'need', 'new', 'nice', 'night', 'order', 'ordered', 'people', 'perfect', 'phoenix', 'pizza', 'place', 'pretty', 'prices', 'really', 'recommend', 'restaurant', 'right', 'said', 'salad', 'sandwich', 'sauce', 'say', 'service', 'staff', 'store', 'sure', 'table', 'thing', 'things', 'think', 'time', 'times', 'took', 'town', 'tried', 'try', 've', 'wait', 'want', 'way', 'went', 'wine', 'work', 'worth', 'years']

Just like with all other models, more features does not mean a better model. So, we must tune our feature generator to remove features whose predictive capability is none or very low.

In this case, there is roughly a 1.6% increase in accuracy when we double the n-gram size and increase our max features by 1,000-fold. Note that if we restrict it to only unigrams, then the accuracy increases even more! So, bigrams were very likely adding more noise than signal.

In the end, by only using 16,000 unigram features we came away with a much smaller, simpler, and easier-to-think-about model which also resulted in higher accuracy.

# Include 1-grams and 2-grams, and limit the number of features. print('1-grams and 2-grams, up to 100K features:') vect = CountVectorizer(ngram_range=(1, 2), max_features=100000) tokenize_test(vect) print() print('1-grams only, up to 100K features:') vect = CountVectorizer(ngram_range=(1, 1), max_features=100000) tokenize_test(vect)
1-grams and 2-grams, up to 100K features: ('Features: ', 100000) ('Accuracy: ', 0.8855185909980431) 1-grams only, up to 100K features: ('Features: ', 16825) ('Accuracy: ', 0.9187866927592955)
  • min_df: Float in range [0.0, 1.0] or int, default=1

  • When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

# Include 1-grams and 2-grams, and only include terms that appear at least two times. vect = CountVectorizer(ngram_range=(1, 2), min_df=2) tokenize_test(vect)
('Features: ', 43957) ('Accuracy: ', 0.9324853228962818)

Introduction to TextBlob

You should already have downloaded TextBlob, a Python library used to explore common NLP tasks. If you haven’t, please return to this step for instructions on how to do so. We’ll be using this to organize our corpora for analysis.

As mentioned earlier, you can read more on the TextBlob website.

search for command in conda

# Print the first review. print((yelp_best_worst.text[0]))
My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better. Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing. While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I've ever had. Anyway, I can't wait to go back!
# Save it as a TextBlob object. review = TextBlob(yelp_best_worst.text[0])
# List the words. review.words
WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'and', 'it', 'was', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'pieces', 'of', 'their', 'griddled', 'bread', 'with', 'was', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'was', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back'])
# List the sentences. review.sentences
[Sentence("My wife took me here on my birthday for breakfast and it was excellent."), Sentence("The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure."), Sentence("Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning."), Sentence("It looked like the place fills up pretty quickly so the earlier you get here the better."), Sentence("Do yourself a favor and get their Bloody Mary."), Sentence("It was phenomenal and simply the best I've ever had."), Sentence("I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it."), Sentence("It was amazing."), Sentence("While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious."), Sentence("It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete."), Sentence("It was the best "toast" I've ever had."), Sentence("Anyway, I can't wait to go back!")]
# Some string methods are available. review.lower()
TextBlob("my wife took me here on my birthday for breakfast and it was excellent. the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. our waitress was excellent and our food arrived quickly on the semi-busy saturday morning. it looked like the place fills up pretty quickly so the earlier you get here the better. do yourself a favor and get their bloody mary. it was phenomenal and simply the best i've ever had. i'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. it was amazing. while everything on the menu looks excellent, i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. it was the best "toast" i've ever had. anyway, i can't wait to go back!")

Stemming and Lemmatization

Stemming is a crude process of removing common endings from sentences, such as "s", "es", "ly", "ing", and "ed".

  • What: Reduce a word to its base/stem/root form.

  • Why: This intelligently reduces the number of features by grouping together (hopefully) related words.

  • Notes:

    • Stemming uses a simple and fast rule-based approach.

    • Stemmed words are usually not shown to users (used for analysis/indexing).

    • Some search engines treat words with the same stem as synonyms.

# Initialize stemmer. stemmer = SnowballStemmer('english') # Stem each word. print([stemmer.stem(word) for word in review.words])
['my', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excel', 'the', 'weather', 'was', 'perfect', 'which', 'made', 'sit', 'outsid', 'overlook', 'their', 'ground', 'an', 'absolut', 'pleasur', 'our', 'waitress', 'was', 'excel', 'and', 'our', 'food', 'arriv', 'quick', 'on', 'the', 'semi-busi', 'saturday', 'morn', 'it', 'look', 'like', 'the', 'place', 'fill', 'up', 'pretti', 'quick', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'bloodi', 'mari', 'it', 'was', 'phenomen', 'and', 'simpli', 'the', 'best', 'i', 've', 'ever', 'had', 'i', "'m", 'pretti', 'sure', 'they', 'onli', 'use', 'ingredi', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'it', 'was', 'amaz', 'while', 'everyth', 'on', 'the', 'menu', 'look', 'excel', 'i', 'had', 'the', 'white', 'truffl', 'scrambl', 'egg', 'veget', 'skillet', 'and', 'it', 'was', 'tasti', 'and', 'delici', 'it', 'came', 'with', '2', 'piec', 'of', 'their', 'griddl', 'bread', 'with', 'was', 'amaz', 'and', 'it', 'absolut', 'made', 'the', 'meal', 'complet', 'it', 'was', 'the', 'best', 'toast', 'i', 've', 'ever', 'had', 'anyway', 'i', 'ca', "n't", 'wait', 'to', 'go', 'back']

Some examples you can see are "excellent" stemmed to "excel" and "amazing" stemmed to "amaz".

Lemmatization is a more refined process that uses specific language and grammar rules to derive the root of a word.

This is useful for words that do not share an obvious root such as "better" and "best".

  • What: Lemmatization derives the canonical form ("lemma") of a word.

  • Why: It can be better than stemming.

  • Notes: Uses a dictionary-based approach (slower than stemming).

# Assume every word is a noun. print([word.lemmatize() for word in review.words])
['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'wa', 'excellent', 'The', 'weather', 'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'egg', 'vegetable', 'skillet', 'and', 'it', 'wa', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'piece', 'of', 'their', 'griddled', 'bread', 'with', 'wa', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'wa', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

Some examples you can see are "filled" lemmatized to "fill" and "was" lemmatized to "wa".

# Assume every word is a verb. print([word.lemmatize(pos='v') for word in review.words])
['My', 'wife', 'take', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'be', 'excellent', 'The', 'weather', 'be', 'perfect', 'which', 'make', 'sit', 'outside', 'overlook', 'their', 'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'be', 'excellent', 'and', 'our', 'food', 'arrive', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'look', 'like', 'the', 'place', 'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'be', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'have', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'be', 'amaze', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'look', 'excellent', 'I', 'have', 'the', 'white', 'truffle', 'scramble', 'egg', 'vegetable', 'skillet', 'and', 'it', 'be', 'tasty', 'and', 'delicious', 'It', 'come', 'with', '2', 'piece', 'of', 'their', 'griddle', 'bread', 'with', 'be', 'amaze', 'and', 'it', 'absolutely', 'make', 'the', 'meal', 'complete', 'It', 'be', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'have', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

Some examples you can see are "was" lemmatized to "be" and "arrived" lemmatized to "arrive".

More Lemmatization and Stemming Examples

LemmatizationStemming
shouted → shoutbadly → bad
best → goodcomputing → comput
better → goodcomputed → comput
good → goodwipes → wip
wiping → wipewiped → wip
hidden → hidewiping → wip

Activity: Knowledge Check

  • What other words or phrases might cause problems with stemming? Why?

  • What other words or phrases might cause problems with lemmatization? Why?


# Define a function that accepts text and returns a list of lemmas. def split_into_lemmas(text): text = text.lower() words = TextBlob(text).words return [word.lemmatize() for word in words]
# Use split_into_lemmas as the feature extraction function (Warning: SLOW!). vect = CountVectorizer(analyzer=split_into_lemmas, decode_error='replace') tokenize_test(vect)
('Features: ', 16452) ('Accuracy: ', 0.9207436399217221)
# Last 50 features print((vect.get_feature_names()[-50:]))
['yuyuyummy', 'yuzu', 'z', 'z-grill', 'z11', 'zach', 'zam', 'zanella', 'zankou', 'zappos', 'zatsiki', 'zen', 'zen-like', 'zero', 'zero-star', 'zest', 'zexperience', 'zha', 'zhou', 'zia', 'zilch', 'zin', 'zinburger', 'zinburgergeist', 'zinc', 'zinfandel', 'zing', 'zip', 'zipcar', 'zipper', 'zipps', 'ziti', 'zoe', 'zombi', 'zombie', 'zone', 'zoning', 'zoo', 'zoyo', 'zucca', 'zucchini', 'zuchinni', 'zumba', 'zupa', 'zuzu', 'zwiebel-kräuter', 'zzed', 'éclairs', 'école', 'ém']

With all the available options for CountVectorizer(), you may wonder how to decide which to use! It's true that you can sometimes reason about which preprocessing techniques might work best. However, you will often not know for sure without trying out many different combinations and comparing their accuracies.

Keep in mind that you should constantly be thinking about the result of each preprocessing step instead of blindly trying them without thinking. Does each type of preprocessing "makes sense" with the input data you are using? Is it likely to keep intact the signal and remove noise?

Term Frequency–Inverse Document Frequency (TF–IDF)

While a Count Vectorizer simply totals up the number of times a "word" appears in a document, the more complex TF-IDF Vectorizer analyzes the uniqueness of words between documents to find distinguishing characteristics.

  • What: Term frequency–inverse document frequency (TF–IDF) computes the "relative frequency" with which a word appears in a document, compared to its frequency across all documents.

  • Why: It's more useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents).

  • Notes: It's used for search-engine scoring, text summarization, and document clustering.

# Example documents simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
# Term frequency vect = CountVectorizer() tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names()) tf
# Document frequency vect = CountVectorizer(binary=True) df = vect.fit_transform(simple_train).toarray().sum(axis=0) pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())
# Term frequency–inverse document frequency (simple version) tf/df

The higher the TF–IDF value, the more "important" the word is to that specific document. Here, "cab" is the most important and unique word in document 1, while "please" is the most important and unique word in document 2. TF–IDF is often used for training as a replacement for word count.

# TfidfVectorizer vect = TfidfVectorizer() #this has a built in logarithm to help avoid getting rounding errors and avoid giving weight to very small numbers pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

Using TF–IDF to Summarize a Yelp Review

Reddit's autotldr uses the SMMRY algorithm, which is based on TF–IDF.

# Create a document-term matrix using TF–IDF. vect = TfidfVectorizer(stop_words='english') # Fit transform Yelp data. dtm = vect.fit_transform(yelp.text) features = vect.get_feature_names() dtm.shape
(10000, 28880)
features
['00', '000', '007', '00a', '00am', '00pm', '01', '02', '03', '03342', '04', '05', '06', '07', '08', '09', '0buxoc0crqjpvkezo3bqog', '0l', '0tzg', '10', '100', '1000', '1000x', '1001', '100lbs', '100s', '100th', '101', '102', '102729', '1030', '104', '105', '1070', '107f', '108', '109', '10am', '10ish', '10k', '10min', '10mins', '10minutes', '10oz', '10p', '10pm', '10th', '10x', '10yo', '11', '110', '1100', '111', '111th', '112', '113', '1130', '114', '1145', '115', '115th', '116', '117', '118', '11a', '11am', '11p', '11pm', '11th', '11year', '12', '120', '1200', '12000', '1202', '123', '124', '125', '128i', '129', '12a', '12am', '12k', '12oz', '12pm', '12th', '13', '130', '1300', '13331', '135', '13th', '13yr', '14', '140', '147', '149', '14lbs', '15', '150', '1500', '150k', '150mm', '157', '15am', '15ft', '15min', '15mins', '15pm', '15th', '16', '160', '1600', '162', '165', '1664', '169', '16oz', '16th', '16thh', '17', '170', '175', '177', '17cents', '17p', '17th', '18', '180', '1800', '184', '1892', '1895', '1899', '18th', '19', '1900', '1910', '1913', '1920', '1920s', '1926', '1928', '1929', '1930s', '1940', '19485', '1950', '1950s', '1952', '1955', '1956', '1960', '1961', '1962', '1965', '1968', '1969', '1970', '1970s', '1973', '1978', '1980', '1980s', '1983', '1987', '199', '1990', '1990s', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '19chptx8ahwztpc5xiarfq', '19th', '1am', '1b', '1cent', '1hour', '1jzambgdea9yyvasa8rukq', '1k', '1min', '1oz', '1p', '1paote8ys9ujup3u4djhiq', '1pm', '1st', '1star', '20', '200', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '200lb', '200lbs', '201', '2010', '20100421', '2011', '2012', '2013', '202', '205', '20mbs', '20miles', '20min', '20minutes', '20oz', '20pm', '20s', '20th', '20x', '21', '215', '21st', '22', '220', '2240', '22oz', '23', '230', '23a', '23rd', '24', '2400', '24hours', '24hr', '24hrs', '24st', '24th', '25', '250', '2500hd', '25b', '25min', '25oz', '25th', '26', '260', '2600', '2608', '2669', '26oz', '26th', '27', '270', '272', '275', '27th', '28', '284', '2852', '29', '2939', '2983024079', '299', '29th', '2am', '2k', '2lb', '2mbps', '2nd', '2pac', '2pm', '2rd', '2wice', '2x', '2x4', '2xu', '30', '300', '3000', '30a', '30am', '30ish', '30k', '30min', '30mins', '30p', '30pm', '30s', '30something', '30th', '30x', '31', '311', '312', '316', '32', '320', '325', '32nd', '32oz', '33', '33lb', '33rd', '34', '340', '3400', '341', '34th', '35', '350', '3500', '350ib', '35c', '35th', '36', '360', '365', '37', '370', '38', '38th', '39', '399', '39th', '3am', '3chicken', '3d', '3g', '3k', '3l', '3lb', '3lbs', '3n9u549zse8up', '3oz', '3p', '3pm', '3rd', '3s', '3tacos', '3x', '40', '400', '4000', '400s', '40ish', '40k', '40lm', '40min', '40pm', '40s', '40th', '41', '411', '411247', '4113416766', '42', '420', '43', '43rd', '44', '4458', '44oz', '44th', '45', '450', '453990', '45am', '45min', '45mins', '45pm', '45sec', '46', '4655', '46th', '47', '470', '475', '48', '480', '48th', '49', '490', '49er', '49ers', '49lb', '49th', '4am', '4b', '4cxbhzxxtmexf9krjmfviq', '4f', '4hr', '4ish', '4k', '4oz', '4p', '4peaks', '4pm', '4s', '4stars', '4th', '4ths', '4x', '4x6', '50', '500', '500sq', '505', '50cents', '50ft', '50lb', '50lm', '50s', '50x', '51', '5130', '51pm', '51st', '52', '5231', '53', '530', '53pm', '54', '547', '55', '5500', '56', '57', '57th', '58', '59', '59th', '5am', '5gb', '5h', '5ish', '5k', '5l', '5min', '5minutes', '5nzmttmomgj_svixktm51q', '5oz', '5p', '5pm', '5s', '5star', '5stars', '5th', '5yo', '60', '600', '6000', '602', '60f', '60miles', '60s', '60th', '61', '61st', '62', '62010', '623', '63', '630am', '64', '64oz', '64th', '65', '66', '6600', '67', '68', '680', '68th', '69', '696', '6am', '6ivet3g9ew', '6k', '6oz', '6p', '6pm', '6th', '6ths', '70', '700', '7000', '70s', '70th', '71', '71st', '72', '730', '730pm', '74', '747', '74th', '75', '750', '755891987', '757', '75cents', '75th', '76', '767', '777', '78', '7807', '79', '7a', '7am', '7ish', '7p', '7pm', '7th', '80', '800', '8000hp', '801', '80f', '80s', '80th', '81', '82', '83', '832', '8330', '83rd', '84', '845', '85', '85154658', '85340', '86', '87', '88', '89', '8am', '8inch', '8ish', '8oz', '8pc', '8pm', '8ppl', '8th', '8v', '8x10', '8x10s', '8xzd9ms7yvnaeavn1irgsq', '8yo', '90', '9000', '90210', '90min', '90minute', '90s', '90yr', '91', '911', '91st', '92', '92nd', '93', '94', '945am', '9495', '95', '96', '96th', '97', '977', '98', '99', '9999', '99cent', '99shillings', '99th', '9am', '9ers', '9ga', '9ish', '9oz', '9p', '9pm', '9th', '9year', '9yo', '_4xhxtuykqnyphmylm', '______', '_______', '_______________', '____berto', '_accommodating', '_affordable', '_c', '_finally_', '_gyib8ea4hdfylss17zc_g', '_l7o1zhq9edno1lhv9b10g', '_reasonable', '_she', '_third_', '_us_', '_very', 'a1', 'a2', 'aa', 'aaa', 'aaaaaalright', 'aaaamazing', 'aaammmazzing', 'aaand', 'aah', 'aand', 'aaron', 'aarp', 'ab', 'aback', 'abacus', 'abandon', 'abandoned', 'abandoning', 'abba', 'abbaye', 'abbey', 'abbreviate', 'abbreviated', 'abbreviations', 'abby', 'abc', 'abdomen', 'abe', 'aberration', 'abhor', 'abides', 'abiding', 'abilities', 'ability', 'abilty', 'abita', 'able', 'abnormally', 'abode', 'abodoba', 'abogado', 'abou', 'abound', 'abrasion', 'abrasive', 'abreast', 'abridged', 'abroad', 'abrupt', 'abruptly', 'abs', 'absence', 'absense', 'absent', 'absinthe', 'abslutely', 'absoloutely', 'absolut', 'absolute', 'absolutely', 'absolutley', 'absolutly', 'absorb', 'absorbed', 'absorption', 'abstain', 'abstained', 'abstract', 'absurd', 'absurdly', 'absynthe', 'abt', 'abuelita', 'abuelo', 'abuelos', 'abundance', 'abundant', 'abundantly', 'abuse', 'abused', 'abusive', 'abysmal', 'ac', 'acacia', 'academie', 'academy', 'acadia', 'acai', 'acapulco', 'accelerometer', 'accent', 'accented', 'accents', 'accept', 'acceptable', 'accepted', 'accepting', 'accepts', 'accesible', 'accesories', 'access', 'accessbile', 'accessed', 'accessibility', 'accessible', 'accessibly', 'accessories', 'accessorize', 'accessory', 'accident', 'accidental', 'accidentally', 'accidentily', 'accidently', 'accidents', 'acclaimed', 'acclimated', 'acclimating', 'accolades', 'accomidating', 'accommodate', 'accommodated', 'accommodates', 'accommodating', 'accommodation', 'accommodations', 'accomodate', 'accomodated', 'accomodates', 'accomodating', 'accomodation', 'accomodations', 'accompanied', 'accompanies', 'accompaniment', 'accompaniments', 'accompany', 'accompanying', 'accomplice', 'accomplish', 'accomplished', 'accomplishment', 'accomplishments', 'according', 'accordingly', 'account', 'accountability', 'accountable', 'accountant', 'accounting', 'accounts', 'accoutrement', 'accoutrements', 'accredited', 'accross', 'accumulate', 'accumulated', 'accuracy', 'accurate', 'accurately', 'accusation', 'accused', 'accustom', 'accustomed', 'accutemp', 'ace', 'aces', 'acess', 'acetone', 'ache', 'aches', 'achieve', 'achieved', 'achievement', 'achieves', 'achieving', 'aching', 'acid', 'acidic', 'acknowledge', 'acknowledged', 'acknowledgement', 'acknowledging', 'acknowledgment', 'ackward', 'acme', 'acne', 'acommodations', 'acoustic', 'acoustician', 'acoustics', 'acquaintance', 'acquainted', 'acquaintence', 'acquire', 'acquired', 'acquisition', 'acres', 'acrid', 'acrimonious', 'acrylic', 'acrylics', 'act', 'actaully', 'acted', 'acting', 'action', 'actions', 'activation', 'active', 'actively', 'activism', 'activities', 'activity', 'actor', 'actors', 'actress', 'acts', 'actual', 'actuality', 'actually', 'actualy', 'actuators', 'actully', 'acuity', 'acupuncturist', 'acute', 'acxupuncturist', 'acy', 'ad', 'ada', 'adage', 'adam', 'adamant', 'adams', 'adapted', 'adapter', 'adaptive', 'adaquate', 'adaquet', 'add', 'addage', 'addd', 'added', 'addendum', 'addict', 'addicted', 'addicting', 'addictingly', 'addiction', 'addictionovercome', 'addictions', 'addictive', 'addicts', 'addidas', 'adding', 'addition', 'additional', 'additionally', 'additions', 'additive', 'additives', 'address', 'addressed', 'addresses', 'addressing', 'adds', 'addtl', 'ade', 'adelman', 'adequate', 'adequately', 'adhere', 'adhered', 'adherence', 'adhesive', 'adidas', 'adios', 'adjacent', 'adjective', 'adjectives', 'adjoining', 'adjunct', 'adjust', 'adjusted', 'adjusting', 'adjustment', 'adjustments', 'administration', 'administrative', 'administrators', 'admirable', 'admire', 'admired', 'admiring', 'admission', 'admissions', 'admit', 'admits', 'admitted', 'admittedly', 'admitting', 'admonishment', 'ado', 'adobada', 'adobe', 'adobo', 'adolescence', 'adolescent', 'adopt', 'adopted', 'adopting', 'adoption', 'adoptions', 'adorable', 'adorably', 'adorama', 'adoration', 'adore', 'adored', 'adores', 'adorn', 'adorned', 'adorning', 'adovada', 'adquate', 'adrenaline', 'adria', 'adrian', 'adriana', 'adrianne', 'adriatica', 'adrienne', 'ads', 'adult', 'adulthood', 'adults', 'adults_night_out', 'advance', 'advanced', 'advancing', 'advantage', 'advantages', 'advent', 'adventure', 'adventurer', 'adventures', 'adventuresome', 'adventurous', 'adventurousness', 'adverse', 'adversity', 'advertise', 'advertised', 'advertisement', 'advertisements', 'advertises', 'advertising', 'advertisments', 'advertized', 'adverts', 'advice', 'advise', 'advised', 'adviser', 'advising', 'advisor', 'advisors', 'advocate', 'advocated', 'ae', 'aea', 'aeg', 'aegean', 'aerators', 'aerobic', 'aerobics', 'aeropostale', 'aeropress', 'aerosol', 'aesthetic', 'aesthetically', 'aesthetician', 'aestheticians', 'aesthetics', 'afar', 'affair', 'affect', 'affected', 'affectionately', 'affects', 'afficianados', ...]
pd.DataFrame(dtm.todense(), columns=features)
def summarize(): # Choose a random review that is at least 300 characters. review_length = 0 while review_length < 300: review_id = np.random.randint(0, len(yelp)) review_text = yelp.text[review_id] #review_text = unicode(yelp.text[review_id], 'utf-8') review_length = len(review_text) # Create a dictionary of words and their TF–IDF scores. word_scores = {} for word in TextBlob(review_text).words: word = word.lower() if word in features: word_scores[word] = dtm[review_id, features.index(word)] # Print words with the top five TF–IDF scores. print('TOP SCORING WORDS:') top_scores = sorted(list(word_scores.items()), key=lambda x: x[1], reverse=True)[:5] for word, score in top_scores: print(word) # Print five random words. print(('\n' + 'RANDOM WORDS:')) random_words = np.random.choice(list(word_scores.keys()), size=5, replace=False) for word in random_words: print(word) # Print the review. print(('\n' + review_text))
summarize()
TOP SCORING WORDS: blanca frequenting shell basket delicious RANDOM WORDS: street way sauces mexican carne Wow, love that a place like this moved in down the street. Inexpensive, fresh, delicious - everything I want in a mexican restaurant. My soft shell carne asada taco was fantastic -- freshly prepared and topped with guacamole, which by the way is super creamy. Basket of chips are free, as is the toppings bar -- every one of the 6 or sauces were delicious. I plan on frequenting Salsa Blanca ; )

Sentiment Analysis

Understanding how positive or negative a review is. There are many ways in practice to compute a sentiment value. For example:

  • Have a list of "positive" words and a list of "negative" words and count how many occur in a document.

  • Train a classifier given many examples of "positive" documents and "negative" documents.

    • Note that this technique is often just an automated way to derive the first (e.g., using bag-of-words with logistic regression, a coefficient is assigned to each word!).

For the most accurate sentiment analysis, you will want to train a custom sentiment model based on documents that are particular to your application. Generic models (such as the one we are about to use!) often do not work as well as hoped.

As we will do below, always make sure you double-check that the algorithm is working by manually verifying that scores correctly correspond to positive/negative reviews! Otherwise, you may be using numbers that are not accurate.

print(review)
My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better. Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing. While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I've ever had. Anyway, I can't wait to go back!
# Polarity ranges from -1 (most negative) to 1 (most positive). review.sentiment.polarity
0.40246913580246907
# Understanding the apply method yelp['length'] = yelp.text.apply(len) yelp.head(1)
# Define a function that accepts text and returns the polarity. def detect_sentiment(text): return TextBlob(text).sentiment.polarity #return TextBlob(text).sentiment.polarity
# Create a new DataFrame column for sentiment (Warning: SLOW!). yelp['sentiment'] = yelp.text.apply(detect_sentiment)
# Box plot of sentiment grouped by stars yelp.boxplot(column='sentiment', by='stars')
<matplotlib.axes._subplots.AxesSubplot at 0x10c49f978>
Image in a Jupyter notebook
# Reviews with most positive sentiment yelp[yelp.sentiment == 1].text.head()
254 Our server Gary was awesome. Food was amazing.... 347 3 syllables for this place. \nA-MAZ-ING!\n\nTh... 420 LOVE the food!!!! 459 Love it!!! Wish we still lived in Arizona as C... 679 Excellent burger Name: text, dtype: object
# Reviews with most negative sentiment yelp[yelp.sentiment == -1].text.head()
773 This was absolutely horrible. I got the suprem... 1517 Nasty workers and over priced trash 3266 Absolutely awful... these guys have NO idea wh... 4766 Very bad food! 5812 I wouldn't send my worst enemy to this place. Name: text, dtype: object
# Widen the column display. pd.set_option('max_colwidth', 500)
# Negative sentiment in a 5-star review yelp[(yelp.stars == 5) & (yelp.sentiment < -0.3)].head(1)
# Positive sentiment in a 1-star review yelp[(yelp.stars == 1) & (yelp.sentiment > 0.5)].head(1)
# Reset the column display width. pd.reset_option('max_colwidth')

Bonus: Adding Features to a Document-Term Matrix

Here, we will add additional features to our CountVectorizer()-generated feature set to hopefully improve our model.

To make the best models, you will want to supplement the auto-generated features with new features you think might be important. After all, CountVectorizer() typically lowercases text and removes all associations between words. Or, you may have metadata to add in addition to just the text.

Remember: Although you may have hundreds of thousands of features, each data point is extremely sparse. So, if you add in a new feature, e.g., one that detects if the text is all capital letters, this new feature can still have a huge effect on the model outcome!

# Create a DataFrame that only contains the 5-star and 1-star reviews. yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)] # define X and y feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny'] X = yelp_best_worst[feature_cols] y = yelp_best_worst.stars # split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Use CountVectorizer with text column only. vect = CountVectorizer() X_train_dtm = vect.fit_transform(X_train.text) X_test_dtm = vect.transform(X_test.text) print((X_train_dtm.shape)) print((X_test_dtm.shape))
(3064, 16825) (1022, 16825)
# Shape of other four feature columns X_train.drop('text', axis=1).shape
(3064, 4)
# Cast other feature columns to float and convert to a sparse matrix. extra = sp.sparse.csr_matrix(X_train.drop('text', axis=1).astype(float)) extra.shape
(3064, 4)
# Combine sparse matrices. X_train_dtm_extra = sp.sparse.hstack((X_train_dtm, extra)) X_train_dtm_extra.shape
(3064, 16829)
# Repeat for testing set. extra = sp.sparse.csr_matrix(X_test.drop('text', axis=1).astype(float)) X_test_dtm_extra = sp.sparse.hstack((X_test_dtm, extra)) X_test_dtm_extra.shape
(1022, 16829)
# Use logistic regression with text column only. logreg = LogisticRegression(C=1e9) logreg.fit(X_train_dtm, y_train) y_pred_class = logreg.predict(X_test_dtm) print((metrics.accuracy_score(y_test, y_pred_class)))
0.9178082191780822
# Use logistic regression with all features. logreg = LogisticRegression(C=1e9) logreg.fit(X_train_dtm_extra, y_train) y_pred_class = logreg.predict(X_test_dtm_extra) print((metrics.accuracy_score(y_test, y_pred_class)))
0.9227005870841487

Bonus: Fun TextBlob Features

# Spelling correction TextBlob('15 minuets late').correct()
TextBlob("15 minutes late")
# Spellcheck Word('parot').spellcheck()
[('part', 0.9929478138222849), ('parrot', 0.007052186177715092)]
# Definitions Word('bank').define('v')
['tip laterally', 'enclose with a bank', 'do business with a bank or keep an account at a bank', 'act as the banker in a game or in gambling', 'be in the banking business', 'put into a bank account', 'cover with ashes so to control the rate of burning', 'have confidence or faith in']
# Language identification TextBlob('Hola amigos').detect_language()
'es'

Appendix: Intro to Naive Bayes and Text Classification

Later in the course, we will explore in-depth how to use the Naive Bayes classifier with text. Naive Bayes is a very popular classifier because it has minimal storage requirements, is fast, can be tuned easily with more data, and has found very useful applications in text classificaton. For example, Paul Graham originally proposed using Naive Bayes to detect spam in his Plan for Spam.

Earlier we experimented with text classification using a Naive Bayes model. What exactly are Naive Bayes classifiers?

What is Bayes? Bayes, or Bayes' Theorem, is a different way to assess probability. It considers prior information in order to more accurately assess the situation.

Example: You are playing roulette.

As you approach the table, you see that the last number the ball landed on was Red-3. With a frequentist mindset, you know that the ball is just as likely to land on Red-3 again given that every slot on the wheel has an equal opportunity of 1 in 37.

Given that you started believing that the ball can land in each slot with an equal likelihood and that you have only seen one throw previously, you rationally believe that there would be no difference between picking Red a second time now or picking Black -- ideally they would happen with the same likelihood!

However, as you sit and watch the roulette table, you begin to notice something strange. The ball is always landing on red. Every single time the ball is thrown, it lands in a red slot. Even though your past beliefs stated that red and black were equally likely, every time it lands in red, you change those beliefs a little more towards a biased roulette table.

This is what Bayes is all about — adjusting probabilities as more data is gathered!

Below is the equation for Bayes.

P(A  B)=P(B  A)×P(A)P(B)P(A \ | \ B) = \frac {P(B \ | \ A) \times P(A)} {P(B)}
  • P(A  B)P(A \ | \ B) : Probability of Event A occurring given Event B has occurred.

  • P(B  A)P(B \ | \ A) : Probability of Event B occurring given Event A has occurred.

  • P(A)P(A) : Probability of Event A occurring.

  • P(B)P(B) : Probability of Event B occurring.

Applying Naive Bayes Classification to Spam Filtering

Let's pretend we have an email with three words: "Send money now." We'll use Naive Bayes to classify it as ham or spam. ("Ham" just means not spam. It can include emails that look like spam but that you opt into!)

P(spam  send money now)=P(send money now  spam)×P(spam)P(send money now)P(spam \ | \ \text{send money now}) = \frac {P(\text{send money now} \ | \ spam) \times P(spam)} {P(\text{send money now})}

By assuming that the features (the words) are conditionally independent, we can simplify the likelihood function:

P(spam  send money now)P(send  spam)×P(money  spam)×P(now  spam)×P(spam)P(send money now)P(spam \ | \ \text{send money now}) \approx \frac {P(\text{send} \ | \ spam) \times P(\text{money} \ | \ spam) \times P(\text{now} \ | \ spam) \times P(spam)} {P(\text{send money now})}

Note that each conditional probability in the numerator is easily calculated directly from the training data!

So, we can calculate all of the values in the numerator by examining a corpus of spam email:

P(spam  send money now)0.2×0.1×0.1×0.9P(send money now)=0.0018P(send money now)P(spam \ | \ \text{send money now}) \approx \frac {0.2 \times 0.1 \times 0.1 \times 0.9} {P(\text{send money now})} = \frac {0.0018} {P(\text{send money now})}

We would repeat this process with a corpus of ham email:

P(ham  send money now)0.05×0.01×0.1×0.1P(send money now)=0.000005P(send money now)P(ham \ | \ \text{send money now}) \approx \frac {0.05 \times 0.01 \times 0.1 \times 0.1} {P(\text{send money now})} = \frac {0.000005} {P(\text{send money now})}

All we care about is whether spam or ham has the higher probability, and so we predict that the email is spam.

Key Takeaways

  • The "naive" assumption of Naive Bayes (that the features are conditionally independent) is critical to making these calculations simple.

  • The normalization constant (the denominator) can be ignored since it's the same for all classes.

  • The prior probability is much less relevant once you have a lot of features.

Comparing Naive Bayes With Other Models

Advantages of Naive Bayes:

  • Model training and prediction are very fast.

  • It's somewhat interpretable.

  • No tuning is required.

  • Features don't need scaling.

  • It's insensitive to irrelevant features (with enough observations).

  • It performs better than logistic regression when the training set is very small.

Disadvantages of Naive Bayes:

  • If "spam" is dependent on non-independent combinations of individual words, it may not work well.

  • Predicted probabilities are not well calibrated.

  • Correlated features can be problematic (due to the independence assumption).

  • It can't handle negative features (with Multinomial Naive Bayes).

  • It has a higher "asymptotic error" than logistic regression.


Conclusion

  • NLP is a gigantic field.

  • Understanding the basics broadens the types of data you can work with.

  • Simple techniques go a long way.

  • Use scikit-learn for NLP whenever possible.

While we used SKLearn and TextBlob today, another popular python NLP library is Spacy.