Path: blob/master/lessons/lesson_13/natural-language-processing NLP (done).ipynb
1904 views
Natural Language Processing
Authors: Kiefer Katovich (San Francisco), Joseph Nelson (Washington, D.C.)
Learning Objectives
Discuss the major tasks involved with natural language processing.
Discuss, on a low level, the components of natural language processing.
Identify why natural language processing is difficult.
Demonstrate text classification.
Demonstrate common text preprocessing techniques.
How Do We Use NLP in Data Science?
In data science, we are often asked to analyze unstructured text or make a predictive model using it. Unfortunately, most data science techniques require numeric data. NLP libraries provide a tool set of methods to convert unstructured text into meaningful numeric data.
Analysis: NLP techniques provide tools to allow us to understand and analyze large amounts of text. For example:
Analyze the positivity/negativity of comments on different websites.
Extract key words from meeting notes and visualize how meeting topics change over time.
Vectorizing for machine learning: When building a machine learning model, we typically must transform our data into numeric features. This process of transforming non-numeric data such as natural language into numeric features is called vectorization. For example:
Understanding related words. Using stemming, NLP lets us know that "swim", "swims", and "swimming" all refer to the same base word. This allows us to reduce the number of features used in our model.
Identifying important and unique words. Using TF-IDF (term frequency-inverse document frequency), we can identify which words are most likely to be meaningful in a document.
Install TextBlob
The TextBlob Python library provides a simplified interface for exploring common NLP tasks including part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
To proceed with the lesson, first install TextBlob, as explained below. We tend to prefer Anaconda-based installations, since they tend to be tested with our other Anaconda packages.
To install textblob run:
conda install -c conda-forge textblob
Or:
pip install textblob
python -m textblob.download_corpora lite
Lesson Guide
Introduction
Adapted from NLP Crash Course by Charlie Greenbacker and Introduction to NLP by Dan Jurafsky
Introduction
What Is Natural Language Processing (NLP)?
Using computers to process (analyze, understand, generate) natural human languages.
Making sense of human knowledge stored as unstructured text.
Building probabilistic models using data about a language.
What Are Some of the Higher-Level Task Areas?
Objective: Discuss the major tasks involved with natural language processing.
We often hope that computers can solve many high-level problems involving natural language. Unfortunately, due to the difficulty of understanding human language, many of these problems are still not well solved. That said, existing solutions to these problems all involve utilizing the lower-level components of NLP discussed in the next section. Some higher-level tasks include:
Chatbots: Understand natural language from the user and return intelligent responses.
Information retrieval: Find relevant results and similar results.
Information extraction: Structured information from unstructured documents.
Machine translation: One language to another.
Text simplification: Preserve the meaning of text, but simplify the grammar and vocabulary.
Predictive text input: Faster or easier typing.
Sentiment analysis: Attitude of speaker.
Automatic summarization: Extractive or abstractive summarization.
Natural language generation: Generate text from data.
Speech recognition and generation: Speech-to-text, text-to-speech.
Question answering: Determine the intent of the question, match query with knowledge base, evaluate hypotheses.
What Are Some of the Lower-Level Components?
Objective: Discuss, on a low level, the components of natural language processing.
Unfortunately, the NLP programming libraries typically do not provide direct solutions for the high-level tasks above. Instead, they provide low-level building blocks that enable us to craft our own solutions. These include:
Tokenization: Breaking text into tokens (words, sentences, n-grams)
Stop-word removal: a/an/the
Stemming and lemmatization: root word
TF-IDF: word importance
Part-of-speech tagging: noun/verb/adjective
Named entity recognition: person/organization/location
Spelling correction: "New Yrok City"
Word sense disambiguation: "buy a mouse"
Segmentation: "New York City subway"
Language detection: "translate this page"
Machine learning: specialized models that work well with text
Why is NLP hard?
Objective: Identify why natural language processing is difficult.
Natural language processing requires an understanding of the language and the world. Several limitations of NLP are:
Ambiguity:
Hospitals Are Sued by 7 Foot Doctors
Juvenile Court to Try Shooting Defendant
Local High School Dropouts Cut in Half
Non-standard English: text messages
Idioms: "throw in the towel"
Newly coined words: "retweet"
Tricky entity names: "Where is A Bug's Life playing?"
World knowledge: "Mary and Sue are sisters", "Mary and Sue are mothers"
Throughout this lesson, we will use Yelp reviews to practice and discover common low-level NLP techniques.
You should be familiar with these terms, as they are frequently used in NLP:
corpus: a collection of documents (derived from the Latin word for "body")
corpora: plural form of corpus
Throughout this lesson, we will use a model very popular for text classification called Naive Bayes (the "NB" in BinonmialNB
and MultinomialNB
below). If you are unfamiliar with it, know that it works exactly the same as all other models in scikit-learn! We will look extensively at the mechanics behind Naive Bayes later in the course. However, see the appendix at the end of this notebook for a quick introduction.
As you proceed through this section, note that text classification is done in the same way as all other classification models. First, the text is vectorized into a set of numeric features. Then, a standard machine learning classifier is applied. NLP libraries often include vectorizers and ML models that work particularly well with text.
We will refer to each piece of text we are trying to classify as a document.
For example, a document could refer to an email, book chapter, tweet, article, or text message.
Text classification is the task of predicting which category or topic a text sample is from.
We may want to identify:
Is an article a sports or business story?
Does an email have positive or negative sentiment?
Is the rating of a recipe 1, 2, 3, 4, or 5 stars?
Predictions are often made by using the words as features and the label as the target output.
Starting out, we will make each unique word (across all documents) a single feature. In any given corpora, we may have hundreds of thousands of unique words, so we may have hundreds of thousands of features!
For a given document, the numeric value of each feature could be the number of times the word appears in the document.
So, most features will have a value of zero, resulting in a sparse matrix of features.
This technique for vectorizing text is referred to as a bag-of-words model.
It is called bag of words because the document's structure is lost — as if the words are all jumbled up in a bag.
The first step to creating a bag-of-words model is to create a vocabulary of all possible words in the corpora.
Alternatively, we could make each column an indicator column, which is 1 if the word is present in the document (no matter how many times) and 0 if not. This vectorization could be used to reduce the importance of repeated words. For example, a website search engine would be susceptible to spammers who load websites with repeated words. So, the search engine might use indicator columns as features rather than word counts.
We need to consider several things to decide if bag-of-words is appropriate.
Does order of words matter?
Does punctuation matter?
Does upper or lower case matter?
Demo: Text Processing in scikit-learn
Objective: Demonstrate text classification.
One common method of reducing the number of features is converting all text to lowercase before generating features! Note that to a computer, aPPle
is a different token/"word" than apple
. So, by converting both to lowercase letters, it ensures fewer features will be generated. It might be useful not to convert them to lowercase if capitalization matters.
Our model predicted ~92% accuracy, which is an improvement over this baseline 82% accuracy (assuming our model always predicts 5 stars).
Let's look more into how the vectorizer works.
N-grams are features which consist of N consecutive words. This is useful because using the bag-of-words model, treating data scientist
as a single feature has more meaning than having two independent features data
and scientist
!
Example:
ngram_range: tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
We can start to see how supplementing our features with n-grams can lead to more feature columns. When we produce n-grams from a document with words, we add an additional features (at most). That said, be careful — when we compute n-grams from an entire corpus, the number of unique n-grams could be vastly higher than the number of unique unigrams! This could cause an undesired feature explosion.
Although we sometimes add important new features that have meaning such as data scientist
, many of the new features will just be noise. So, particularly if we do not have much data, adding n-grams can actually decrease model performance. This is because if each n-gram is only present once or twice in the training set, we are effectively adding mostly noisy features to the mix.
stop_words: string {
english
}, list, or None (default)If
english
, a built-in stop word list for English is used.If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
If None, no stop words will be used.
max_df
can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms. (Ifmax_df
= 0.7, then if > 70% of documents contain a word it will not be included in the feature set!)
Stop-Word Removal
What: This process is used to remove common words that will likely appear in any text.
Why: Because common words exist in most documents, they likely only add noise to your model and should be removed.
What are stop words? Stop words are some of the most common words in a language. They are used so that a sentence makes sense grammatically, such as prepositions and determiners, e.g., "to," "the," "and." However, they are so commonly used that they are generally worthless for predicting the class of a document. Since "a" appears in spam and non-spam emails, for example, it would only contribute noise to our model.
Example:
Original sentence: "The dog jumped over the fence"
After stop-word removal: "dog jumped over fence"
The fact that there is a fence and a dog jumped over it can be derived with or without stop words.
max_features
: int or None, default=NoneIf not None, build a vocabulary that only consider the top
max_features
ordered by term frequency across the corpus. This allows us to keep more common n-grams and remove ones that may appear once. If we include words that only occur once, this can lead to said features being highly associated with a class and cause overfitting.
Just like with all other models, more features does not mean a better model. So, we must tune our feature generator to remove features whose predictive capability is none or very low.
In this case, there is roughly a 1.6% increase in accuracy when we double the n-gram size and increase our max features by 1,000-fold. Note that if we restrict it to only unigrams, then the accuracy increases even more! So, bigrams were very likely adding more noise than signal.
In the end, by only using 16,000 unigram features we came away with a much smaller, simpler, and easier-to-think-about model which also resulted in higher accuracy.
min_df
: Float in range [0.0, 1.0] or int, default=1When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.
Introduction to TextBlob
You should already have downloaded TextBlob, a Python library used to explore common NLP tasks. If you haven’t, please return to this step for instructions on how to do so. We’ll be using this to organize our corpora for analysis.
As mentioned earlier, you can read more on the TextBlob website.
search for command in conda
Stemming and Lemmatization
Stemming is a crude process of removing common endings from sentences, such as "s", "es", "ly", "ing", and "ed".
What: Reduce a word to its base/stem/root form.
Why: This intelligently reduces the number of features by grouping together (hopefully) related words.
Notes:
Stemming uses a simple and fast rule-based approach.
Stemmed words are usually not shown to users (used for analysis/indexing).
Some search engines treat words with the same stem as synonyms.
Some examples you can see are "excellent" stemmed to "excel" and "amazing" stemmed to "amaz".
Lemmatization is a more refined process that uses specific language and grammar rules to derive the root of a word.
This is useful for words that do not share an obvious root such as "better" and "best".
What: Lemmatization derives the canonical form ("lemma") of a word.
Why: It can be better than stemming.
Notes: Uses a dictionary-based approach (slower than stemming).
Some examples you can see are "filled" lemmatized to "fill" and "was" lemmatized to "wa".
Some examples you can see are "was" lemmatized to "be" and "arrived" lemmatized to "arrive".
More Lemmatization and Stemming Examples
Lemmatization | Stemming |
---|---|
shouted → shout | badly → bad |
best → good | computing → comput |
better → good | computed → comput |
good → good | wipes → wip |
wiping → wipe | wiped → wip |
hidden → hide | wiping → wip |
Activity: Knowledge Check
What other words or phrases might cause problems with stemming? Why?
What other words or phrases might cause problems with lemmatization? Why?
With all the available options for CountVectorizer()
, you may wonder how to decide which to use! It's true that you can sometimes reason about which preprocessing techniques might work best. However, you will often not know for sure without trying out many different combinations and comparing their accuracies.
Keep in mind that you should constantly be thinking about the result of each preprocessing step instead of blindly trying them without thinking. Does each type of preprocessing "makes sense" with the input data you are using? Is it likely to keep intact the signal and remove noise?
Term Frequency–Inverse Document Frequency (TF–IDF)
While a Count Vectorizer simply totals up the number of times a "word" appears in a document, the more complex TF-IDF Vectorizer analyzes the uniqueness of words between documents to find distinguishing characteristics.
What: Term frequency–inverse document frequency (TF–IDF) computes the "relative frequency" with which a word appears in a document, compared to its frequency across all documents.
Why: It's more useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents).
Notes: It's used for search-engine scoring, text summarization, and document clustering.
The higher the TF–IDF value, the more "important" the word is to that specific document. Here, "cab" is the most important and unique word in document 1, while "please" is the most important and unique word in document 2. TF–IDF is often used for training as a replacement for word count.
More details: TF–IDF is about what matters
Using TF–IDF to Summarize a Yelp Review
Reddit's autotldr uses the SMMRY algorithm, which is based on TF–IDF.
Sentiment Analysis
Understanding how positive or negative a review is. There are many ways in practice to compute a sentiment value. For example:
Have a list of "positive" words and a list of "negative" words and count how many occur in a document.
Train a classifier given many examples of "positive" documents and "negative" documents.
Note that this technique is often just an automated way to derive the first (e.g., using bag-of-words with logistic regression, a coefficient is assigned to each word!).
For the most accurate sentiment analysis, you will want to train a custom sentiment model based on documents that are particular to your application. Generic models (such as the one we are about to use!) often do not work as well as hoped.
As we will do below, always make sure you double-check that the algorithm is working by manually verifying that scores correctly correspond to positive/negative reviews! Otherwise, you may be using numbers that are not accurate.
Here, we will add additional features to our CountVectorizer()
-generated feature set to hopefully improve our model.
To make the best models, you will want to supplement the auto-generated features with new features you think might be important. After all, CountVectorizer()
typically lowercases text and removes all associations between words. Or, you may have metadata to add in addition to just the text.
Remember: Although you may have hundreds of thousands of features, each data point is extremely sparse. So, if you add in a new feature, e.g., one that detects if the text is all capital letters, this new feature can still have a huge effect on the model outcome!
Appendix: Intro to Naive Bayes and Text Classification
Later in the course, we will explore in-depth how to use the Naive Bayes classifier with text. Naive Bayes is a very popular classifier because it has minimal storage requirements, is fast, can be tuned easily with more data, and has found very useful applications in text classificaton. For example, Paul Graham originally proposed using Naive Bayes to detect spam in his Plan for Spam.
Earlier we experimented with text classification using a Naive Bayes model. What exactly are Naive Bayes classifiers?
What is Bayes? Bayes, or Bayes' Theorem, is a different way to assess probability. It considers prior information in order to more accurately assess the situation.
Example: You are playing roulette.
As you approach the table, you see that the last number the ball landed on was Red-3. With a frequentist mindset, you know that the ball is just as likely to land on Red-3 again given that every slot on the wheel has an equal opportunity of 1 in 37.
Given that you started believing that the ball can land in each slot with an equal likelihood and that you have only seen one throw previously, you rationally believe that there would be no difference between picking Red a second time now or picking Black -- ideally they would happen with the same likelihood!
However, as you sit and watch the roulette table, you begin to notice something strange. The ball is always landing on red. Every single time the ball is thrown, it lands in a red slot. Even though your past beliefs stated that red and black were equally likely, every time it lands in red, you change those beliefs a little more towards a biased roulette table.
This is what Bayes is all about — adjusting probabilities as more data is gathered!
Below is the equation for Bayes.
: Probability of
Event A
occurring givenEvent B
has occurred.: Probability of
Event B
occurring givenEvent A
has occurred.: Probability of
Event A
occurring.: Probability of
Event B
occurring.
Applying Naive Bayes Classification to Spam Filtering
Let's pretend we have an email with three words: "Send money now." We'll use Naive Bayes to classify it as ham or spam. ("Ham" just means not spam. It can include emails that look like spam but that you opt into!)
By assuming that the features (the words) are conditionally independent, we can simplify the likelihood function:
Note that each conditional probability in the numerator is easily calculated directly from the training data!
So, we can calculate all of the values in the numerator by examining a corpus of spam email:
We would repeat this process with a corpus of ham email:
All we care about is whether spam or ham has the higher probability, and so we predict that the email is spam.
Key Takeaways
The "naive" assumption of Naive Bayes (that the features are conditionally independent) is critical to making these calculations simple.
The normalization constant (the denominator) can be ignored since it's the same for all classes.
The prior probability is much less relevant once you have a lot of features.
Comparing Naive Bayes With Other Models
Advantages of Naive Bayes:
Model training and prediction are very fast.
It's somewhat interpretable.
No tuning is required.
Features don't need scaling.
It's insensitive to irrelevant features (with enough observations).
It performs better than logistic regression when the training set is very small.
Disadvantages of Naive Bayes:
If "spam" is dependent on non-independent combinations of individual words, it may not work well.
Predicted probabilities are not well calibrated.
Correlated features can be problematic (due to the independence assumption).
It can't handle negative features (with Multinomial Naive Bayes).
It has a higher "asymptotic error" than logistic regression.
Conclusion
NLP is a gigantic field.
Understanding the basics broadens the types of data you can work with.
Simple techniques go a long way.
Use scikit-learn for NLP whenever possible.
While we used SKLearn and TextBlob today, another popular python NLP library is Spacy.