GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_13/practice/nlp_review-lab.ipynb
¹⁹⁰⁴ views

Kernel: Python 2

Natural Language Processing (NLP) Review Lab

Authors: Joseph Nelson (DC)

Note: This lab is intended to be done as a walkthrough with the instructor.

Introduction

Adapted from NLP Crash Course by Charlie Greenbacker, Introduction to NLP by Dan Jurafsky, Kevin Markham's Data School Curriculum

What is NLP?

Using computers to process (analyze, understand, generate) natural human languages
Most knowledge created by humans is unstructured text, and we need a way to make sense of it
Build probabilistic model using data about a language

What are some of the higher level task areas?

Information retrieval: Find relevant results and similar results
- Google
Information extraction: Structured information from unstructured documents
- Events from Gmail
Machine translation: One language to another
- Google Translate
Text simplification: Preserve the meaning of text, but simplify the grammar and vocabulary
- Rewordify
- Simple English Wikipedia
Predictive text input: Faster or easier typing
- My application
- A much better application
Sentiment analysis: Attitude of speaker
- Hater News
Automatic summarization: Extractive or abstractive summarization
- autotldr
Natural Language Generation: Generate text from data
- How a computer describes a sports match
- Publishers withdraw more than 120 gibberish papers
Speech recognition and generation: Speech-to-text, text-to-speech
- Google's Web Speech API demo
- Vocalware Text-to-Speech demo
Question answering: Determine the intent of the question, match query with knowledge base, evaluate hypotheses

What are some of the lower level components?

Tokenization: breaking text into tokens (words, sentences, n-grams)
Stopword removal: a/an/the
Stemming and lemmatization: root word
TF-IDF: word importance
Part-of-speech tagging: noun/verb/adjective
Named entity recognition: person/organization/location
Spelling correction: "New Yrok City"
Word sense disambiguation: "buy a mouse"
Segmentation: "New York City subway"
Language detection: "translate this page"
Machine learning

Why is NLP hard?

Ambiguity:
- Hospitals are Sued by 7 Foot Doctors
- Juvenile Court to Try Shooting Defendant
- Local High School Dropouts Cut in Half
Non-standard English: text messages
Idioms: "throw in the towel"
Newly coined words: "retweet"
Tricky entity names: "Where is A Bug's Life playing?"
World knowledge: "Mary and Sue are sisters", "Mary and Sue are mothers"

NLP requires an understanding of the language and the world.

Part 1: Reading in the Yelp Reviews

"corpus" = collection of documents
"corpora" = plural form of corpus

In [1]:

import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

Out[1]:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-1c86b8428141> in <module>()
      7 from sklearn.linear_model import LogisticRegression
      8 from sklearn import metrics
----> 9 from textblob import TextBlob, Word
     10 from nltk.stem.snowball import SnowballStemmer
     11 get_ipython().magic('matplotlib inline')
ModuleNotFoundError: No module named 'textblob'

In [2]:

csv_file = '../data/yelp.csv'

In [3]:

# A:

1.1 Subset the reviews to best and worst.

Select only 5-star and 1-star reviews.
The text will be the features, the stars will be the target.
Create a train-test split.

In [4]:

# A:

Part 2: Tokenization

What: Separate text into units such as sentences or words
Why: Gives structure to previously unstructured text
Notes: Relatively easy with English language text, not easy with some languages

2.1 Use CountVectorizer to convert the training and testing text data.

CountVectorizer documentation

lowercase: boolean, True by default
- Convert all characters to lowercase before tokenizing.
ngram_range: tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [5]:

# A:

2.2 Predict the star rating with the new features from CountVectorizer.

Validate on the test set.

In [6]:

# A:

Part 3: Stopword Removal

What: Remove common words that will likely appear in any text
Why: They don't tell you much about your text

3.1 Recreate your features with CountVectorizer removing stopwords.

stop_words: string {'english'}, list, or None (default)
If 'english', a built-in stop word list for English is used.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [7]:

# A:

3.2 Validate your model using the features with stopwords removed.

In [8]:

# A:

Part 4: Other CountVectorizer Options

4.1 Shrink the maximum number of features and re-test the model.

max_features: int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [9]:

# A:

4.2 Change the minimum document frequency for terms and test the model's performance.

min_df: float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [10]:

# A:

Part 5: Introduction to TextBlob

TextBlob: "Simplified Text Processing"

5.1 Use `TextBlob` to convert the text in the first review in the dataset.

In [11]:

# A:

5.2 List the words in the `TextBlob` object.

In [12]:

# A:

5.3 List the sentences in the `TextBlob` object.

In [13]:

# A:

Part 6: Stemming and Lemmatization

Stemming:

What: Reduce a word to its base/stem/root form
Why: Often makes sense to treat related words the same way
Notes:
- Uses a "simple" and fast rule-based approach
- Stemmed words are usually not shown to users (used for analysis/indexing)
- Some search engines treat words with the same stem as synonyms

6.1 Initialize the `SnowballStemmer` and stem the words in the first review.

In [14]:

# A:

6.2 Use the built-in `lemmatize` function on the words of the first review (parsed by `TextBlob`)

Lemmatization

What: Derive the canonical form ('lemma') of a word
Why: Can be better than stemming
Notes: Uses a dictionary-based approach (slower than stemming)

In [15]:

# A:

6.3 Write a function that uses `TextBlob` and `lemmatize` to lemmatize text.

In [16]:

# A:

6.4 Provide your function to `CountVectorizer` as the `analyzer` and test the performance of your model.

In [17]:

# A:

Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

What: Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
Why: More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
Notes: Used for search engine scoring, text summarization, document clustering

7.1 Build a simple TF-IDF using CountVectorizer

Term Frequency can be calulated with default CountVectorizer.
Inverse Document Frequency can be calculated with CountVectorizer and argument binary=True.

More details: TF-IDF is about what matters

In [18]:

# A:

Part 8: Using TF-IDF to Summarize a Yelp Review

Note: Reddit's autotldr uses the SMMRY algorithm, which is based on TF-IDF!

8.1 Build a TF-IDF predictor matrix excluding stopwords with `TfidfVectorizer`

In [19]:

# A:

8.2 Write a function to pull out the top 5 words by TF-IDF score from a review

In [20]:

# A:

Part 9: Sentiment Analysis

9.1 Extract sentiment from a review parsed with `TextBlob`

Sentiment polarity ranges from -1, the most negative, to 1, the most positive. A parsed TextBlob object has sentiment which can be accessed with:

review.sentiment.polarity

In [21]:

# A:

9.2 Calculate the sentiment for every review in the full Yelp dataset as a new column.

In [22]:

# A:

9.3 Create a boxplot of sentiment by star rating

In [23]:

# A:

9.4 Print reviews with the highest and lowest sentiment.

In [24]:

# A:

10. [Bonus] Explore fun TextBlob features

10.1 Correct spelling with `.correct()`

In [25]:

# A:

10.2 Perform spellchecking with `.spellcheck()`

In [26]:

# A:

10.3 Extract definitions with `.define()`

In [27]:

# A:

Conclusion

NLP is a gigantic field
Understanding the basics broadens the types of data you can work with
Simple techniques go a long way
Use scikit-learn for NLP whenever possible