Natural Language Processing
Unit 4: Required
Materials We Provide
Topic | Description | Link |
---|---|---|
Lesson | Natural Language Processing | Here |
Practice | Four sample NLP activities | Here |
Data | Yelp Review and Tweet Datasets | Here |
Slides | Sample slide deck for this topic (PPTX, deprecated) | Here |
Extra Materials | Optional materials on Bayes Theorem and Naive Bayes | Here |
The Yelp dataset was chosen because of its rich and colloquial text attributes, in addition to how well it lends itself to sentiment analysis.
Note: This lesson also uses the Naive Bayes model MultinomialNB, which is often used for NLP applications, such as spam detection. An appendix is included at the end of the lesson for interested students. Supplemental materials are also offered if you want to explore Bayes-related topics.
Learning Objectives
By the end of this lesson, students should be able to:
Discuss the major tasks involved with natural language processing
Discuss, on a low level, the components of natural language processing
Identify why natural language processing is difficult
Demonstrate text classification
Demonstrate common text preprocessing techniques
Student Requirements
Before this lesson, students should already be able to:
Use Anaconda for package management
Use train/test/split to create a set of features and target values
Read data into a Pandas DataFrame
Build and evaluate predictive models using scikit-learn
Lesson Guide
Installation Notes
To procede through the lesson, first install TextBlob
as explained below. We tend to prefer Anaconda-based installations, since they tend to be tested with our other Anaconda packages. However, in this case TextBlob is not available on some platforms with Anaconda (e.g. Win64). To install textblob:
conda install -c https://conda.anaconda.org/sloria textblob
Or:
pip install textblob
python -m textblob.download_corpora lite
Additional Resources
For more information, we recommend the following resources:
Check out this Yelp blog post how they completed a classification task (with over 1000 response variables!) using restaurant review text
Always check documentation: CountVectorizer, HashingVectorizer, TF-IDF
Wikpedia's feature hashing and hash functions is a great place to turn for more info on hashing
Charlie Greenbacher's Intro to NLP
Wikipedia includes a walkthrough of TF-IDF
Google's ngram tool
An experiment using NLP and Eigenfaces (Eigenvalues for face recognition) for Tinder