Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_13/practice/solution-code/nlp_review-lab-solutions.ipynb
1904 views
Kernel: Python 2

Natural Language Processing (NLP) Review Lab

Authors: Joseph Nelson (DC)


Note: This lab is intended to be done as a walkthrough with the instructor.

Introduction

Adapted from NLP Crash Course by Charlie Greenbacker, Introduction to NLP by Dan Jurafsky, Kevin Markham's Data School Curriculum

What is NLP?

  • Using computers to process (analyze, understand, generate) natural human languages

  • Most knowledge created by humans is unstructured text, and we need a way to make sense of it

  • Build probabilistic model using data about a language

What are some of the higher level task areas?

What are some of the lower level components?

  • Tokenization: breaking text into tokens (words, sentences, n-grams)

  • Stopword removal: a/an/the

  • Stemming and lemmatization: root word

  • TF-IDF: word importance

  • Part-of-speech tagging: noun/verb/adjective

  • Named entity recognition: person/organization/location

  • Spelling correction: "New Yrok City"

  • Word sense disambiguation: "buy a mouse"

  • Segmentation: "New York City subway"

  • Language detection: "translate this page"

  • Machine learning

Why is NLP hard?

  • Ambiguity:

    • Hospitals are Sued by 7 Foot Doctors

    • Juvenile Court to Try Shooting Defendant

    • Local High School Dropouts Cut in Half

  • Non-standard English: text messages

  • Idioms: "throw in the towel"

  • Newly coined words: "retweet"

  • Tricky entity names: "Where is A Bug's Life playing?"

  • World knowledge: "Mary and Sue are sisters", "Mary and Sue are mothers"

NLP requires an understanding of the language and the world.

Part 1: Reading in the Yelp Reviews

  • "corpus" = collection of documents

  • "corpora" = plural form of corpus

import pandas as pd import numpy as np import scipy as sp from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from sklearn import metrics from textblob import TextBlob, Word from nltk.stem.snowball import SnowballStemmer %matplotlib inline
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-1-1c86b8428141> in <module>() 7 from sklearn.linear_model import LogisticRegression 8 from sklearn import metrics ----> 9 from textblob import TextBlob, Word 10 from nltk.stem.snowball import SnowballStemmer 11 get_ipython().magic('matplotlib inline') ModuleNotFoundError: No module named 'textblob'
csv_file = '../../data/yelp.csv'
yelp = pd.read_csv(csv_file)
yelp.head(3)

1.1 Subset the reviews to best and worst.

  • Select only 5-star and 1-star reviews.

  • The text will be the features, the stars will be the target.

  • Create a train-test split.

# read yelp.csv into a DataFrame yelp = pd.read_csv(csv_file) # create a new DataFrame that only contains the 5-star and 1-star reviews yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)] # define X and y X = yelp_best_worst.text y = yelp_best_worst.stars # split the new DataFrame into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
yelp_best_worst.ix[1].text
'I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\n\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we\'ll be seated when the girl comes back from seating someone else. We were seated at 5:52 and the waiter came and got our drink orders. Everyone was very pleasant from the host that seated us to the waiter to the server. The prices were very good as well. We placed our orders once we decided what we wanted at 6:02. We shared the baked spaghetti calzone and the small "Here\'s The Beef" pizza so we can both try them. The calzone was huge and we got the smallest one (personal) and got the small 11" pizza. Both were awesome! My friend liked the pizza better and I liked the calzone better. The calzone does have a sweetish sauce but that\'s how I like my sauce!\n\nWe had to box part of the pizza to take it home and we were out the door by 6:42. So, everything was great and not like these bad reviewers. That goes to show you that you have to try these things yourself because all these bad reviewers have some serious issues.'
print(len(X_train)) print(len(X_test))
3064 1022

Part 2: Tokenization

  • What: Separate text into units such as sentences or words

  • Why: Gives structure to previously unstructured text

  • Notes: Relatively easy with English language text, not easy with some languages

2.1 Use CountVectorizer to convert the training and testing text data.

CountVectorizer documentation

  • lowercase: boolean, True by default

    • Convert all characters to lowercase before tokenizing.

  • ngram_range: tuple (min_n, max_n)

    • The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

# use CountVectorizer to create document-term matrices from X_train and X_test vect = CountVectorizer() X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test)
# rows are documents, columns are terms (aka "tokens" or "features") X_train_dtm.shape
(3064, 16825)
# last 50 features print(vect.get_feature_names()[-50:])
[u'yyyyy', u'z11', u'za', u'zabba', u'zach', u'zam', u'zanella', u'zankou', u'zappos', u'zatsiki', u'zen', u'zero', u'zest', u'zexperience', u'zha', u'zhou', u'zia', u'zihuatenejo', u'zilch', u'zin', u'zinburger', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipper', u'zippers', u'zipps', u'ziti', u'zoe', u'zombi', u'zombies', u'zone', u'zones', u'zoning', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel', u'zzed', u'\xe9clairs', u'\xe9cole', u'\xe9m']
# show vectorizer options vect
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
# don't convert to lowercase vect = CountVectorizer(lowercase=False) X_train_dtm = vect.fit_transform(X_train) X_train_dtm.shape
(3064, 20838)
# include 1-grams and 2-grams vect = CountVectorizer(ngram_range=(1, 2)) X_train_dtm = vect.fit_transform(X_train) X_train_dtm.shape
(3064, 169847)
# last 50 features print(vect.get_feature_names()[-50:])
[u'zone out', u'zone when', u'zones', u'zones dolls', u'zoning', u'zoning issues', u'zoo', u'zoo and', u'zoo is', u'zoo not', u'zoo the', u'zoo ve', u'zoyo', u'zoyo for', u'zucca', u'zucca appetizer', u'zucchini', u'zucchini and', u'zucchini bread', u'zucchini broccoli', u'zucchini carrots', u'zucchini fries', u'zucchini pieces', u'zucchini strips', u'zucchini veal', u'zucchini very', u'zucchini with', u'zuchinni', u'zuchinni again', u'zuchinni the', u'zumba', u'zumba class', u'zumba or', u'zumba yogalates', u'zupa', u'zupa flavors', u'zuzu', u'zuzu in', u'zuzu is', u'zuzu the', u'zwiebel', u'zwiebel kr\xe4uter', u'zzed', u'zzed in', u'\xe9clairs', u'\xe9clairs napoleons', u'\xe9cole', u'\xe9cole len\xf4tre', u'\xe9m', u'\xe9m all']

2.2 Predict the star rating with the new features from CountVectorizer.

Validate on the test set.

# use default options for CountVectorizer vect = CountVectorizer() # create document-term matrices X_train_dtm = vect.fit_transform(X_train) X_test_dtm = vect.transform(X_test) # use Naive Bayes to predict the star rating nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm) # calculate accuracy print(metrics.accuracy_score(y_test, y_pred_class))
0.918786692759
# calculate null accuracy y_test_binary = np.where(y_test==5, 1, 0) max(y_test_binary.mean(), 1 - y_test_binary.mean())
0.81996086105675148
# define a function that accepts a vectorizer and calculates the accuracy def tokenize_test(vect): X_train_dtm = vect.fit_transform(X_train) print('Features: ', X_train_dtm.shape[1]) X_test_dtm = vect.transform(X_test) nb = MultinomialNB() nb.fit(X_train_dtm, y_train) y_pred_class = nb.predict(X_test_dtm) print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))
# include 1-grams and 2-grams vect = CountVectorizer(ngram_range=(1, 2)) tokenize_test(vect)
Features: 169847 Accuracy: 0.854207436399

Part 3: Stopword Removal

  • What: Remove common words that will likely appear in any text

  • Why: They don't tell you much about your text

3.1 Recreate your features with CountVectorizer removing stopwords.

  • stop_words: string {'english'}, list, or None (default)

  • If 'english', a built-in stop word list for English is used.

  • If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.

  • If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

# show vectorizer options vect
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 2), preprocessor=None, stop_words=None, strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
# remove English stop words vect = CountVectorizer(stop_words='english')

3.2 Validate your model using the features with stopwords removed.

tokenize_test(vect)
Features: 16528 Accuracy: 0.915851272016
# set of stop words print(vect.get_stop_words())
frozenset(['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'whom', 'seeming', 'under', 'ours', 'has', 'might', 'thereafter', 'latterly', 'do', 'them', 'his', 'around', 'than', 'get', 'very', 'de', 'none', 'cannot', 'every', 'whether', 'they', 'front', 'during', 'thus', 'now', 'him', 'nor', 'name', 'several', 'hereafter', 'always', 'who', 'cry', 'whither', 'this', 'someone', 'either', 'each', 'become', 'thereupon', 'sometime', 'side', 'two', 'therein', 'twelve', 'because', 'often', 'ten', 'our', 'eg', 'some', 'back', 'up', 'namely', 'towards', 'are', 'further', 'beyond', 'ourselves', 'yet', 'out', 'even', 'will', 'what', 'still', 'for', 'bottom', 'mine', 'since', 'please', 'forty', 'per', 'its', 'everything', 'behind', 'un', 'above', 'between', 'it', 'neither', 'seemed', 'ever', 'across', 'she', 'somehow', 'be', 'we', 'full', 'never', 'sixty', 'however', 'here', 'otherwise', 'were', 'whereupon', 'nowhere', 'although', 'found', 'alone', 're', 'along', 'fifteen', 'by', 'both', 'about', 'last', 'would', 'anything', 'via', 'many', 'could', 'thence', 'put', 'against', 'keep', 'etc', 'amount', 'became', 'ltd', 'hence', 'onto', 'or', 'con', 'among', 'already', 'co', 'afterwards', 'formerly', 'within', 'seems', 'into', 'others', 'while', 'whatever', 'except', 'down', 'hers', 'everyone', 'done', 'least', 'another', 'whoever', 'moreover', 'couldnt', 'throughout', 'anyhow', 'yourself', 'three', 'from', 'her', 'few', 'together', 'top', 'there', 'due', 'been', 'next', 'anyone', 'eleven', 'much', 'call', 'therefore', 'interest', 'then', 'thru', 'themselves', 'hundred', 'was', 'sincere', 'empty', 'more', 'himself', 'elsewhere', 'mostly', 'on', 'fire', 'am', 'becoming', 'hereby', 'amongst', 'else', 'part', 'everywhere', 'too', 'herself', 'former', 'those', 'he', 'me', 'myself', 'made', 'twenty', 'these', 'bill', 'cant', 'us', 'until', 'besides', 'nevertheless', 'below', 'anywhere', 'nine', 'can', 'of', 'your', 'toward', 'my', 'something', 'and', 'whereafter', 'whenever', 'give', 'almost', 'wherever', 'is', 'describe', 'beforehand', 'herein', 'an', 'as', 'itself', 'at', 'have', 'in', 'seem', 'whence', 'ie', 'any', 'fill', 'again', 'hasnt', 'inc', 'thereby', 'thin', 'no', 'perhaps', 'latter', 'meanwhile', 'when', 'detail', 'same', 'wherein', 'beside', 'also', 'that', 'other', 'take', 'which', 'becomes', 'you', 'if', 'nobody', 'see', 'though', 'may', 'after', 'upon', 'most', 'hereupon', 'eight', 'but', 'serious', 'nothing', 'such', 'why', 'a', 'off', 'whereby', 'third', 'i', 'whole', 'noone', 'sometimes', 'well', 'amoungst', 'yours', 'their', 'rather', 'without', 'so', 'five', 'the', 'first', 'whereas', 'once'])

Part 4: Other CountVectorizer Options

4.1 Shrink the maximum number of features and re-test the model.

  • max_features: int or None, default=None

  • If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

# remove English stop words and only keep 100 features vect = CountVectorizer(stop_words='english', max_features=100) tokenize_test(vect)
Features: 100 Accuracy: 0.869863013699
# all 100 features print(vect.get_feature_names())
[u'amazing', u'area', u'atmosphere', u'awesome', u'bad', u'bar', u'best', u'better', u'big', u'came', u'cheese', u'chicken', u'clean', u'coffee', u'come', u'day', u'definitely', u'delicious', u'did', u'didn', u'dinner', u'don', u'eat', u'excellent', u'experience', u'favorite', u'feel', u'food', u'free', u'fresh', u'friendly', u'friends', u'going', u'good', u'got', u'great', u'happy', u'home', u'hot', u'hour', u'just', u'know', u'like', u'little', u'll', u'location', u'long', u'looking', u'lot', u'love', u'lunch', u'make', u'meal', u'menu', u'minutes', u'need', u'new', u'nice', u'night', u'order', u'ordered', u'people', u'perfect', u'phoenix', u'pizza', u'place', u'pretty', u'prices', u'really', u'recommend', u'restaurant', u'right', u'said', u'salad', u'sandwich', u'sauce', u'say', u'service', u'staff', u'store', u'sure', u'table', u'thing', u'things', u'think', u'time', u'times', u'took', u'town', u'tried', u'try', u've', u'wait', u'want', u'way', u'went', u'wine', u'work', u'worth', u'years']
# include 1-grams and 2-grams, and limit the number of features vect = CountVectorizer(ngram_range=(1, 2), max_features=100000) tokenize_test(vect)
Features: 100000 Accuracy: 0.885518590998

4.2 Change the minimum document frequency for terms and test the model's performance.

  • min_df: float in range [0.0, 1.0] or int, default=1

  • When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

# include 1-grams and 2-grams, and only include terms that appear at least 2 times vect = CountVectorizer(ngram_range=(1, 2), min_df=2) tokenize_test(vect)
Features: 43957 Accuracy: 0.932485322896

Part 5: Introduction to TextBlob

TextBlob: "Simplified Text Processing"

5.1 Use TextBlob to convert the text in the first review in the dataset.

# print the first review print(yelp_best_worst.text[0])
My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better. Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing. While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I've ever had. Anyway, I can't wait to go back!
# save it as a TextBlob object review = TextBlob(yelp_best_worst.text[0])

5.2 List the words in the TextBlob object.

# list the words review.words
WordList(['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent', 'The', 'weather', 'was', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure', 'Our', 'waitress', 'was', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', 'fills', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', 'was', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', 'was', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', 'looks', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'and', 'it', 'was', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', 'pieces', 'of', 'their', 'griddled', 'bread', 'with', 'was', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', 'was', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back'])

5.3 List the sentences in the TextBlob object.

# list the sentences review.sentences
[Sentence("My wife took me here on my birthday for breakfast and it was excellent."), Sentence("The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure."), Sentence("Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning."), Sentence("It looked like the place fills up pretty quickly so the earlier you get here the better."), Sentence("Do yourself a favor and get their Bloody Mary."), Sentence("It was phenomenal and simply the best I've ever had."), Sentence("I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it."), Sentence("It was amazing."), Sentence("While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious."), Sentence("It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete."), Sentence("It was the best "toast" I've ever had."), Sentence("Anyway, I can't wait to go back!")]
# some string methods are available review.lower()
TextBlob("my wife took me here on my birthday for breakfast and it was excellent. the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. our waitress was excellent and our food arrived quickly on the semi-busy saturday morning. it looked like the place fills up pretty quickly so the earlier you get here the better. do yourself a favor and get their bloody mary. it was phenomenal and simply the best i've ever had. i'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. it was amazing. while everything on the menu looks excellent, i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. it was the best "toast" i've ever had. anyway, i can't wait to go back!")

Part 6: Stemming and Lemmatization

Stemming:

  • What: Reduce a word to its base/stem/root form

  • Why: Often makes sense to treat related words the same way

  • Notes:

    • Uses a "simple" and fast rule-based approach

    • Stemmed words are usually not shown to users (used for analysis/indexing)

    • Some search engines treat words with the same stem as synonyms

6.1 Initialize the SnowballStemmer and stem the words in the first review.

# initialize stemmer stemmer = SnowballStemmer('english') # stem each word print([stemmer.stem(word) for word in review.words])
[u'my', u'wife', u'took', u'me', u'here', u'on', u'my', u'birthday', u'for', u'breakfast', u'and', u'it', u'was', u'excel', u'the', u'weather', u'was', u'perfect', u'which', u'made', u'sit', u'outsid', u'overlook', u'their', u'ground', u'an', u'absolut', u'pleasur', u'our', u'waitress', u'was', u'excel', u'and', u'our', u'food', u'arriv', u'quick', u'on', u'the', u'semi-busi', u'saturday', u'morn', u'it', u'look', u'like', u'the', u'place', u'fill', u'up', u'pretti', u'quick', u'so', u'the', u'earlier', u'you', u'get', u'here', u'the', u'better', u'do', u'yourself', u'a', u'favor', u'and', u'get', u'their', u'bloodi', u'mari', u'it', u'was', u'phenomen', u'and', u'simpli', u'the', u'best', u'i', u've', u'ever', u'had', u'i', u"'m", u'pretti', u'sure', u'they', u'onli', u'use', u'ingredi', u'from', u'their', u'garden', u'and', u'blend', u'them', u'fresh', u'when', u'you', u'order', u'it', u'it', u'was', u'amaz', u'while', u'everyth', u'on', u'the', u'menu', u'look', u'excel', u'i', u'had', u'the', u'white', u'truffl', u'scrambl', u'egg', u'veget', u'skillet', u'and', u'it', u'was', u'tasti', u'and', u'delici', u'it', u'came', u'with', u'2', u'piec', u'of', u'their', u'griddl', u'bread', u'with', u'was', u'amaz', u'and', u'it', u'absolut', u'made', u'the', u'meal', u'complet', u'it', u'was', u'the', u'best', u'toast', u'i', u've', u'ever', u'had', u'anyway', u'i', u'ca', u"n't", u'wait', u'to', u'go', u'back']

6.2 Use the built-in lemmatize function on the words of the first review (parsed by TextBlob)

Lemmatization

  • What: Derive the canonical form ('lemma') of a word

  • Why: Can be better than stemming

  • Notes: Uses a dictionary-based approach (slower than stemming)

# assume every word is a noun print([word.lemmatize() for word in review.words])
['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'wa', 'excellent', 'The', 'weather', u'wa', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'wa', 'excellent', 'and', 'our', 'food', 'arrived', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', 'looked', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'wa', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', 'had', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', u'ingredient', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'wa', 'amazing', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', 'had', 'the', 'white', 'truffle', 'scrambled', u'egg', 'vegetable', 'skillet', 'and', 'it', u'wa', 'tasty', 'and', 'delicious', 'It', 'came', 'with', '2', u'piece', 'of', 'their', 'griddled', 'bread', 'with', u'wa', 'amazing', 'and', 'it', 'absolutely', 'made', 'the', 'meal', 'complete', 'It', u'wa', 'the', 'best', 'toast', 'I', "'ve", 'ever', 'had', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']
# assume every word is a verb print([word.lemmatize(pos='v') for word in review.words])
['My', 'wife', u'take', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', u'be', 'excellent', 'The', 'weather', u'be', 'perfect', 'which', u'make', u'sit', 'outside', u'overlook', 'their', u'ground', 'an', 'absolute', 'pleasure', 'Our', 'waitress', u'be', 'excellent', 'and', 'our', 'food', u'arrive', 'quickly', 'on', 'the', 'semi-busy', 'Saturday', 'morning', 'It', u'look', 'like', 'the', 'place', u'fill', 'up', 'pretty', 'quickly', 'so', 'the', 'earlier', 'you', 'get', 'here', 'the', 'better', 'Do', 'yourself', 'a', 'favor', 'and', 'get', 'their', 'Bloody', 'Mary', 'It', u'be', 'phenomenal', 'and', 'simply', 'the', 'best', 'I', "'ve", 'ever', u'have', 'I', "'m", 'pretty', 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'their', 'garden', 'and', 'blend', 'them', 'fresh', 'when', 'you', 'order', 'it', 'It', u'be', u'amaze', 'While', 'EVERYTHING', 'on', 'the', 'menu', u'look', 'excellent', 'I', u'have', 'the', 'white', 'truffle', u'scramble', u'egg', 'vegetable', 'skillet', 'and', 'it', u'be', 'tasty', 'and', 'delicious', 'It', u'come', 'with', '2', u'piece', 'of', 'their', u'griddle', 'bread', 'with', u'be', u'amaze', 'and', 'it', 'absolutely', u'make', 'the', 'meal', 'complete', 'It', u'be', 'the', 'best', 'toast', 'I', "'ve", 'ever', u'have', 'Anyway', 'I', 'ca', "n't", 'wait', 'to', 'go', 'back']

6.3 Write a function that uses TextBlob and lemmatize to lemmatize text.

# define a function that accepts text and returns a list of lemmas def split_into_lemmas(text): text = str(text, 'utf-8').lower() words = TextBlob(text).words return [word.lemmatize() for word in words]

6.4 Provide your function to CountVectorizer as the analyzer and test the performance of your model.

# use split_into_lemmas as the feature extraction function (WARNING: SLOW!) vect = CountVectorizer(analyzer=split_into_lemmas) tokenize_test(vect)
Features: 16452 Accuracy: 0.920743639922
# last 50 features print(vect.get_feature_names()[-50:])
[u'yuyuyummy', u'yuzu', u'z', u'z-grill', u'z11', u'zach', u'zam', u'zanella', u'zankou', u'zappos', u'zatsiki', u'zen', u'zen-like', u'zero', u'zero-star', u'zest', u'zexperience', u'zha', u'zhou', u'zia', u'zilch', u'zin', u'zinburger', u'zinburgergeist', u'zinc', u'zinfandel', u'zing', u'zip', u'zipcar', u'zipper', u'zipps', u'ziti', u'zoe', u'zombi', u'zombie', u'zone', u'zoning', u'zoo', u'zoyo', u'zucca', u'zucchini', u'zuchinni', u'zumba', u'zupa', u'zuzu', u'zwiebel-kr\xe4uter', u'zzed', u'\xe9clairs', u'\xe9cole', u'\xe9m']

Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

  • What: Computes "relative frequency" that a word appears in a document compared to its frequency across all documents

  • Why: More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)

  • Notes: Used for search engine scoring, text summarization, document clustering

7.1 Build a simple TF-IDF using CountVectorizer

  • Term Frequency can be calulated with default CountVectorizer.

  • Inverse Document Frequency can be calculated with CountVectorizer and argument binary=True.

# example documents simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
# Term Frequency vect = CountVectorizer() tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names()) tf
# Document Frequency vect = CountVectorizer(binary=True) df = vect.fit_transform(simple_train).toarray().sum(axis=0) pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())
# Term Frequency-Inverse Document Frequency (simple version) tf/df
# TfidfVectorizer vect = TfidfVectorizer() pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

Part 8: Using TF-IDF to Summarize a Yelp Review

Note: Reddit's autotldr uses the SMMRY algorithm, which is based on TF-IDF!

8.1 Build a TF-IDF predictor matrix excluding stopwords with TfidfVectorizer

# create a document-term matrix using TF-IDF vect = TfidfVectorizer(stop_words='english') dtm = vect.fit_transform(yelp.text) features = vect.get_feature_names() dtm.shape
(10000, 28880)

8.2 Write a function to pull out the top 5 words by TF-IDF score from a review

def summarize(): # choose a random review that is at least 300 characters review_length = 0 while review_length < 300: review_id = np.random.randint(0, len(yelp)) review_text = str(yelp.text[review_id], 'utf-8') review_length = len(review_text) # create a dictionary of words and their TF-IDF scores word_scores = {} for word in TextBlob(review_text).words: word = word.lower() if word in features: word_scores[word] = dtm[review_id, features.index(word)] # print words with the top 5 TF-IDF scores print('TOP SCORING WORDS:') top_scores = sorted(list(word_scores.items()), key=lambda x: x[1], reverse=True)[:5] for word, score in top_scores: print(word) # print 5 random words print('\n' + 'RANDOM WORDS:') random_words = np.random.choice(list(word_scores.keys()), size=5, replace=False) for word in random_words: print(word) # print the review print('\n' + review_text)
summarize()
TOP SCORING WORDS: philly casella 33rd family celebrating RANDOM WORDS: fool lbs immediately deli oven I was inspired to visit Casella's after a coworker told me about their authentically Italian goodness. As my tummy growled today, I was in the neighborhood so I knew just where to go. I walked into a simple yet clean deli. Immediately, I was welcomed by the entire staff (I believe they are all related or at least family through friendship). They asked me if it was my first time in and I said yes, so right away they started giving me a history of the shop and the tasty cuisine. Celebrating their 33rd anniversary (congrats!!), I learned that everything is homemade -- from the meatballs to the chicken salad. The owner said he goes through 40 lbs of chicken a week. He gave me a sample and the chicken was chunky and delicious. I'm not surprised he has to prepare so much chicken every week! The meatballs and freshly cut meat for Philly Cheese-steak sandwiches looked great too! The owner mentioned he is from Philly so better be ready for some authentic Philly and Italian style cookin'! While I was there, the owner was slicing some fresh provolone, so I added that to the warmed turkey sandwich I ordered. The turkey was tasty, the provolone was yummy and the sourdough roll tasted like it was fresh from the oven. It was $8 for a sandwich and soda but the sandwich was hefty enough that I saved half for later. As I walked out, everyone said goodbye to me and I felt like I was leaving a family dinner. In summary, here's what made Casella's stand out: - The very friendly staff - It's a local family-owned neighborhood deli celebrating their 33rd anniversary. Gotta support the local businesses! - Clean decor - Tasty, homemade food that's on display for visitors to see - A good family-style atmosphere that greets you the second you walk inside. I'm not sure how it's taken my so long to find this local treat, but now that I've found it, I plan on becoming a regular. Directions: Note that this is hidden in the Basha's strip mall (on the side of Granite Reef) but don't let that fool you!

Part 9: Sentiment Analysis

9.1 Extract sentiment from a review parsed with TextBlob

Sentiment polarity ranges from -1, the most negative, to 1, the most positive. A parsed TextBlob object has sentiment which can be accessed with:

review.sentiment.polarity
print(review)
My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better. Do yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing. While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best "toast" I've ever had. Anyway, I can't wait to go back!
# polarity ranges from -1 (most negative) to 1 (most positive) review.sentiment.polarity
0.40246913580246907

9.2 Calculate the sentiment for every review in the full Yelp dataset as a new column.

# define a function that accepts text and returns the polarity def detect_sentiment(text): return TextBlob(text.decode('utf-8')).sentiment.polarity
# create a new DataFrame column for sentiment (WARNING: SLOW!) yelp['sentiment'] = yelp.text.apply(detect_sentiment)
yelp.columns
Index([u'business_id', u'date', u'review_id', u'stars', u'text', u'type', u'user_id', u'cool', u'useful', u'funny', u'sentiment'], dtype='object')
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

9.3 Create a boxplot of sentiment by star rating

# box plot of sentiment grouped by stars yelp.boxplot(column='sentiment', by='stars')
<matplotlib.axes._subplots.AxesSubplot at 0x11a4254d0>
Image in a Jupyter notebook

9.4 Print reviews with the highest and lowest sentiment.

# reviews with most positive sentiment yelp[yelp.sentiment == 1].text.head()
254 Our server Gary was awesome. Food was amazing.... 347 3 syllables for this place. \nA-MAZ-ING!\n\nTh... 420 LOVE the food!!!! 459 Love it!!! Wish we still lived in Arizona as C... 679 Excellent burger Name: text, dtype: object
# reviews with most negative sentiment yelp[yelp.sentiment == -1].text.head()
773 This was absolutely horrible. I got the suprem... 1517 Nasty workers and over priced trash 3266 Absolutely awful... these guys have NO idea wh... 4766 Very bad food! 5812 I wouldn't send my worst enemy to this place. Name: text, dtype: object

10. [Bonus] Explore fun TextBlob features

10.1 Correct spelling with .correct()

# spelling correction TextBlob('15 minuets late').correct()
TextBlob("15 minutes late")

10.2 Perform spellchecking with .spellcheck()

# spellcheck Word('parot').spellcheck()
[('part', 0.9929478138222849), (u'parrot', 0.007052186177715092)]

10.3 Extract definitions with .define()

# definitions Word('bank').define('n')
[u'sloping land (especially the slope beside a body of water)', u'a financial institution that accepts deposits and channels the money into lending activities', u'a long ridge or pile', u'an arrangement of similar objects in a row or in tiers', u'a supply or stock held in reserve for future use (especially in emergencies)', u'the funds held by a gambling house or the dealer in some gambling games', u'a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force', u'a container (usually with a slot in the top) for keeping money at home', u'a building in which the business of banking transacted', u'a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)']

Conclusion

  • NLP is a gigantic field

  • Understanding the basics broadens the types of data you can work with

  • Simple techniques go a long way

  • Use scikit-learn for NLP whenever possible