Path: blob/master/lessons/lesson_13/practice/solution-code/nlp_review-lab-solutions.ipynb
1904 views
Natural Language Processing (NLP) Review Lab
Authors: Joseph Nelson (DC)
Note: This lab is intended to be done as a walkthrough with the instructor.
Introduction
Adapted from NLP Crash Course by Charlie Greenbacker, Introduction to NLP by Dan Jurafsky, Kevin Markham's Data School Curriculum
What is NLP?
Using computers to process (analyze, understand, generate) natural human languages
Most knowledge created by humans is unstructured text, and we need a way to make sense of it
Build probabilistic model using data about a language
What are some of the higher level task areas?
Information retrieval: Find relevant results and similar results
Information extraction: Structured information from unstructured documents
Machine translation: One language to another
Text simplification: Preserve the meaning of text, but simplify the grammar and vocabulary
Predictive text input: Faster or easier typing
Sentiment analysis: Attitude of speaker
Automatic summarization: Extractive or abstractive summarization
Natural Language Generation: Generate text from data
Speech recognition and generation: Speech-to-text, text-to-speech
Question answering: Determine the intent of the question, match query with knowledge base, evaluate hypotheses
What are some of the lower level components?
Tokenization: breaking text into tokens (words, sentences, n-grams)
Stopword removal: a/an/the
Stemming and lemmatization: root word
TF-IDF: word importance
Part-of-speech tagging: noun/verb/adjective
Named entity recognition: person/organization/location
Spelling correction: "New Yrok City"
Word sense disambiguation: "buy a mouse"
Segmentation: "New York City subway"
Language detection: "translate this page"
Machine learning
Why is NLP hard?
Ambiguity:
Hospitals are Sued by 7 Foot Doctors
Juvenile Court to Try Shooting Defendant
Local High School Dropouts Cut in Half
Non-standard English: text messages
Idioms: "throw in the towel"
Newly coined words: "retweet"
Tricky entity names: "Where is A Bug's Life playing?"
World knowledge: "Mary and Sue are sisters", "Mary and Sue are mothers"
NLP requires an understanding of the language and the world.
Part 1: Reading in the Yelp Reviews
"corpus" = collection of documents
"corpora" = plural form of corpus
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-1c86b8428141> in <module>()
7 from sklearn.linear_model import LogisticRegression
8 from sklearn import metrics
----> 9 from textblob import TextBlob, Word
10 from nltk.stem.snowball import SnowballStemmer
11 get_ipython().magic('matplotlib inline')
ModuleNotFoundError: No module named 'textblob'
1.1 Subset the reviews to best and worst.
Select only 5-star and 1-star reviews.
The text will be the features, the stars will be the target.
Create a train-test split.
Part 2: Tokenization
What: Separate text into units such as sentences or words
Why: Gives structure to previously unstructured text
Notes: Relatively easy with English language text, not easy with some languages
2.1 Use CountVectorizer to convert the training and testing text data.
lowercase: boolean, True by default
Convert all characters to lowercase before tokenizing.
ngram_range: tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
2.2 Predict the star rating with the new features from CountVectorizer.
Validate on the test set.
Part 3: Stopword Removal
What: Remove common words that will likely appear in any text
Why: They don't tell you much about your text
3.1 Recreate your features with CountVectorizer removing stopwords.
stop_words: string {'english'}, list, or None (default)
If 'english', a built-in stop word list for English is used.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
3.2 Validate your model using the features with stopwords removed.
Part 4: Other CountVectorizer Options
4.1 Shrink the maximum number of features and re-test the model.
max_features: int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
4.2 Change the minimum document frequency for terms and test the model's performance.
min_df: float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.
Part 5: Introduction to TextBlob
TextBlob: "Simplified Text Processing"
5.1 Use TextBlob
to convert the text in the first review in the dataset.
5.2 List the words in the TextBlob
object.
5.3 List the sentences in the TextBlob
object.
Part 6: Stemming and Lemmatization
Stemming:
What: Reduce a word to its base/stem/root form
Why: Often makes sense to treat related words the same way
Notes:
Uses a "simple" and fast rule-based approach
Stemmed words are usually not shown to users (used for analysis/indexing)
Some search engines treat words with the same stem as synonyms
6.1 Initialize the SnowballStemmer
and stem the words in the first review.
6.2 Use the built-in lemmatize
function on the words of the first review (parsed by TextBlob
)
Lemmatization
What: Derive the canonical form ('lemma') of a word
Why: Can be better than stemming
Notes: Uses a dictionary-based approach (slower than stemming)
6.3 Write a function that uses TextBlob
and lemmatize
to lemmatize text.
6.4 Provide your function to CountVectorizer
as the analyzer
and test the performance of your model.
Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)
What: Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
Why: More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
Notes: Used for search engine scoring, text summarization, document clustering
7.1 Build a simple TF-IDF using CountVectorizer
Term Frequency can be calulated with default CountVectorizer.
Inverse Document Frequency can be calculated with CountVectorizer and argument
binary=True
.
More details: TF-IDF is about what matters
Part 8: Using TF-IDF to Summarize a Yelp Review
Note: Reddit's autotldr uses the SMMRY algorithm, which is based on TF-IDF!
8.1 Build a TF-IDF predictor matrix excluding stopwords with TfidfVectorizer
8.2 Write a function to pull out the top 5 words by TF-IDF score from a review
Part 9: Sentiment Analysis
9.1 Extract sentiment from a review parsed with TextBlob
Sentiment polarity ranges from -1, the most negative, to 1, the most positive. A parsed TextBlob object has sentiment which can be accessed with:
9.2 Calculate the sentiment for every review in the full Yelp dataset as a new column.
9.3 Create a boxplot of sentiment by star rating
9.4 Print reviews with the highest and lowest sentiment.
10. [Bonus] Explore fun TextBlob features
10.1 Correct spelling with .correct()
10.2 Perform spellchecking with .spellcheck()
10.3 Extract definitions with .define()
Conclusion
NLP is a gigantic field
Understanding the basics broadens the types of data you can work with
Simple techniques go a long way
Use scikit-learn for NLP whenever possible