Path: blob/master/text_classification/basics/basics.ipynb
2580 views
Table of Contents
- 1Â Â Text Machine Learning with scikit-learn
- 1.1Â Â Part 1: Model building in scikit-learn (refresher)
- 1.2Â Â Part 2: Representing text as numerical data
- 1.3Â Â Part 3: Reading a text-based dataset into pandas and vectorizing
- 1.4Â Â Part 4: Building and evaluating a model
- 1.5Â Â Part 5: Building and evaluating another model
- 1.6Â Â Part 6: Examining a model for further insight
- 1.7Â Â Part 7: Tuning the vectorizer
- 1.8Â Â Putting it all together
- 2Â Â Reference
Text Machine Learning with scikit-learn
Part 1: Model building in scikit-learn (refresher)
If you're already familiar with model-building in different packages, here's a quick refresher on how to train a simple classification model with scikit-learn.
"Features" are also known as predictors, inputs, or attributes. The "response" is also known as the target, label, or output.
"Observations" are also known as samples, instances, or records.
In order to build a model, the features must be numeric, and every observation must have the same features in the same order.
In order to make a prediction, the new observation must have the same features in the same order as the training observations, both in number and meaning.
Part 2: Representing text as numerical data
From the scikit-learn documentation:
Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
Thus, when working with text, we will use CountVectorizer to "convert text into a matrix of token counts":
Notice, with the default parameters:
Single character like
awill be removed.Punctuations have been removed.
Words have been converted to lower cases.
The vocabulary has no duplicated.
From the scikit-learn documentation:
In this scheme, features and samples are defined as follows:
Each individual token occurrence frequency (normalized or not) is treated as a feature.
The vector of all the token frequencies for a given document is considered a multivariate sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
Side Note On Sparse Matrices
From the scikit-learn documentation:
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the
scipy.sparsepackage.
In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning.
Summary:
vect.fit(train)learns the vocabulary of the training data.vect.transform(train)uses the fitted vocabulary to build a document-term matrix from the training data. Or justvect.fit_transform(train)to combine the two steps into one.vect.transform(test)uses the fitted vocabulary to build a document-term matrix from the testing data. Note that it ignores tokens it hasn't seen before, this is reasonable due to the fact that the word does not exist in the training data, thus the model doesn't know anything about the relationship between the word and the output.
Part 3: Reading a text-based dataset into pandas and vectorizing
This is a text data that has been labeled as spam and ham (non-spam), our goal is to see if we can correctly classify the labels using text messages.
Part 4: Building and evaluating a model
Here, the algorithms are all treated as black box, its inner workings are explained in other notebooks.
We will use multinomial Naive Bayes, Naive Bayes class algorithms are extremely fast and it's usually the go-to method for doing classification on text data.:
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
Part 5: Building and evaluating another model
We will compare multinomial Naive Bayes with logistic regression:
Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.
Part 6: Examining a model for further insight
After building the model, it's a good practice to look at examples of where your model got it wrong, and think about what features you can add to "maybe" improve the performance.
After looking at these incorrectly classified messages, you might think does the message's length have something to do with it being a spam/ham.
Next, we will examine the our trained Naive Bayes model to calculate the approximate "spamminess" of each token to see which word appears more often in spam messages.
Before we can calculate the relative "spamminess" of each token, we need to avoid dividing by zero and account for the class imbalance.
e.g. From the looks of it, the word claim appears a lot more often in spam messages than ham ones.
Part 7: Tuning the vectorizer
Thus far, we have been using the default parameters of CountVectorizer:
However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:
stop_words: string {'english'}, list, or None (default)
If 'english', a built-in stop word list for English is used.
If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
If None, no stop words will be used.
Removing common and uncommon words is extremely useful in text analytics. The rationale behind it is that common words such as 'the', 'a', or 'and' appear so commonly in the English language that they tell us almost nothing about how similar or dissimilar two documents might be, or in a sense, they carry less semantic weight. On the other hand, there may be words that only appear one of twice in the entire corpus and we simply don't have enough data about these words to learn a meaningful output.
ngram_range: tuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different n-grams to be extracted.
All values of n such that min_n <= n <= max_n will be used.
max_df: float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents.
If integer, the parameter represents an absolute count.
min_df: float in range [0.0, 1.0] or int, default=1
When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
If float, the parameter represents a proportion of documents.
If integer, the parameter represents an absolute count.
Guidelines for tuning CountVectorizer:
Use your knowledge of the problem and the text, and your understanding of the tuning parameters, to help you decide what parameters to tune and how to tune them.
Experiment, and let the data tell you the best approach!
Putting it all together
In the cell below, we're putting the basic workflow of text classification's script into one cell for future reference's convenience.
In the example above, we were doing the bag of word transformation and fitting the model as separate steps. i.e. calling CountVectorizer().fit_transform and then model().fit, as we can imagine this can become tedious if we were trying to string together multiple preprocessing steps. To streamline these type of preprocessing and modeling pipeline. Scikit-learn provides an Pipeline object. To learn more about this topic, scikit-learn's Documentation: Pipeline and FeatureUnion: combining estimators has a really nice introduction on this subject.