Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_13/practice/nlp_review-lab.ipynb
1904 views
Kernel: Python 2

Natural Language Processing (NLP) Review Lab

Authors: Joseph Nelson (DC)


Note: This lab is intended to be done as a walkthrough with the instructor.

Introduction

Adapted from NLP Crash Course by Charlie Greenbacker, Introduction to NLP by Dan Jurafsky, Kevin Markham's Data School Curriculum

What is NLP?

  • Using computers to process (analyze, understand, generate) natural human languages

  • Most knowledge created by humans is unstructured text, and we need a way to make sense of it

  • Build probabilistic model using data about a language

What are some of the higher level task areas?

What are some of the lower level components?

  • Tokenization: breaking text into tokens (words, sentences, n-grams)

  • Stopword removal: a/an/the

  • Stemming and lemmatization: root word

  • TF-IDF: word importance

  • Part-of-speech tagging: noun/verb/adjective

  • Named entity recognition: person/organization/location

  • Spelling correction: "New Yrok City"

  • Word sense disambiguation: "buy a mouse"

  • Segmentation: "New York City subway"

  • Language detection: "translate this page"

  • Machine learning

Why is NLP hard?

  • Ambiguity:

    • Hospitals are Sued by 7 Foot Doctors

    • Juvenile Court to Try Shooting Defendant

    • Local High School Dropouts Cut in Half

  • Non-standard English: text messages

  • Idioms: "throw in the towel"

  • Newly coined words: "retweet"

  • Tricky entity names: "Where is A Bug's Life playing?"

  • World knowledge: "Mary and Sue are sisters", "Mary and Sue are mothers"

NLP requires an understanding of the language and the world.

Part 1: Reading in the Yelp Reviews

  • "corpus" = collection of documents

  • "corpora" = plural form of corpus

import pandas as pd import numpy as np import scipy as sp from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from sklearn import metrics from textblob import TextBlob, Word from nltk.stem.snowball import SnowballStemmer %matplotlib inline
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-1-1c86b8428141> in <module>() 7 from sklearn.linear_model import LogisticRegression 8 from sklearn import metrics ----> 9 from textblob import TextBlob, Word 10 from nltk.stem.snowball import SnowballStemmer 11 get_ipython().magic('matplotlib inline') ModuleNotFoundError: No module named 'textblob'
csv_file = '../data/yelp.csv'
# A:

1.1 Subset the reviews to best and worst.

  • Select only 5-star and 1-star reviews.

  • The text will be the features, the stars will be the target.

  • Create a train-test split.

# A:

Part 2: Tokenization

  • What: Separate text into units such as sentences or words

  • Why: Gives structure to previously unstructured text

  • Notes: Relatively easy with English language text, not easy with some languages

2.1 Use CountVectorizer to convert the training and testing text data.

CountVectorizer documentation

  • lowercase: boolean, True by default

    • Convert all characters to lowercase before tokenizing.

  • ngram_range: tuple (min_n, max_n)

    • The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

# A:

2.2 Predict the star rating with the new features from CountVectorizer.

Validate on the test set.

# A:

Part 3: Stopword Removal

  • What: Remove common words that will likely appear in any text

  • Why: They don't tell you much about your text

3.1 Recreate your features with CountVectorizer removing stopwords.

  • stop_words: string {'english'}, list, or None (default)

  • If 'english', a built-in stop word list for English is used.

  • If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.

  • If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

# A:

3.2 Validate your model using the features with stopwords removed.

# A:

Part 4: Other CountVectorizer Options

4.1 Shrink the maximum number of features and re-test the model.

  • max_features: int or None, default=None

  • If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

# A:

4.2 Change the minimum document frequency for terms and test the model's performance.

  • min_df: float in range [0.0, 1.0] or int, default=1

  • When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

# A:

Part 5: Introduction to TextBlob

TextBlob: "Simplified Text Processing"

5.1 Use TextBlob to convert the text in the first review in the dataset.

# A:

5.2 List the words in the TextBlob object.

# A:

5.3 List the sentences in the TextBlob object.

# A:

Part 6: Stemming and Lemmatization

Stemming:

  • What: Reduce a word to its base/stem/root form

  • Why: Often makes sense to treat related words the same way

  • Notes:

    • Uses a "simple" and fast rule-based approach

    • Stemmed words are usually not shown to users (used for analysis/indexing)

    • Some search engines treat words with the same stem as synonyms

6.1 Initialize the SnowballStemmer and stem the words in the first review.

# A:

6.2 Use the built-in lemmatize function on the words of the first review (parsed by TextBlob)

Lemmatization

  • What: Derive the canonical form ('lemma') of a word

  • Why: Can be better than stemming

  • Notes: Uses a dictionary-based approach (slower than stemming)

# A:

6.3 Write a function that uses TextBlob and lemmatize to lemmatize text.

# A:

6.4 Provide your function to CountVectorizer as the analyzer and test the performance of your model.

# A:

Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

  • What: Computes "relative frequency" that a word appears in a document compared to its frequency across all documents

  • Why: More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)

  • Notes: Used for search engine scoring, text summarization, document clustering

7.1 Build a simple TF-IDF using CountVectorizer

  • Term Frequency can be calulated with default CountVectorizer.

  • Inverse Document Frequency can be calculated with CountVectorizer and argument binary=True.

# A:

Part 8: Using TF-IDF to Summarize a Yelp Review

Note: Reddit's autotldr uses the SMMRY algorithm, which is based on TF-IDF!

8.1 Build a TF-IDF predictor matrix excluding stopwords with TfidfVectorizer

# A:

8.2 Write a function to pull out the top 5 words by TF-IDF score from a review

# A:

Part 9: Sentiment Analysis

9.1 Extract sentiment from a review parsed with TextBlob

Sentiment polarity ranges from -1, the most negative, to 1, the most positive. A parsed TextBlob object has sentiment which can be accessed with:

review.sentiment.polarity
# A:

9.2 Calculate the sentiment for every review in the full Yelp dataset as a new column.

# A:

9.3 Create a boxplot of sentiment by star rating

# A:

9.4 Print reviews with the highest and lowest sentiment.

# A:

10. [Bonus] Explore fun TextBlob features

10.1 Correct spelling with .correct()

# A:

10.2 Perform spellchecking with .spellcheck()

# A:

10.3 Extract definitions with .define()

# A:

Conclusion

  • NLP is a gigantic field

  • Understanding the basics broadens the types of data you can work with

  • Simple techniques go a long way

  • Use scikit-learn for NLP whenever possible