Path: blob/master/text_classification/naive_bayes/naive_bayes.ipynb
2617 views
Naive Bayes
Naive Bayes classifiers is based on Bayes’ theorem, and the adjective naive comes from the assumption that the features in a dataset are mutually independent. In practice, the independence assumption is often violated, but Naive Bayes still tend to perform very well in the fields of text/document classification. Common applications includes spam filtering (categorized a text message as spam or not-spam) and sentiment analysis (categorized a text message as positive or negative review). More importantly, the simplicity of the method means that it takes order of magnitude less time to train when compared to more complexed models like support vector machines.
Text/Document Representations
Text classifiers often don't use any kind of deep representation about language: often times a document is represented as a bag of words. (A bag is like a set that allows repeating elements.) This is an extremely simple representation as it throws away the word order and only keeps which words are included in the document and how many times each word occurs.
We shall look at two probabilistic models of documents, both of which represent documents as a bag of words, using the Naive Bayes assumption. Both models represent documents using feature vectors whose components correspond to word types. If we have a document containing distinct vocabularies, then the feature vector dimension .
Bernoulli document model: a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present.
Multinomial document model: a document is represented by a feature vector with integer elements whose value is the frequency of that word in the document.
Example: Consider the vocabulary V = {blue,red, dog, cat, biscuit, apple}. In this case |V| = d = 6. Now consider the (short) document "the blue dog ate a blue biscuit". If is the Bernoulli feature vector for this document, and is the Multinomial feature vector, then we would have:
Bernoulli Model
Consider a corpus of documents (training data) whose class is given by . Using Naive Bayes (no matter if it's the bernoulli model or the multinomial model which we'll later see), we classify a document as the class which has the highest posterior probability , which can be re-expressed using Bayes’ Theorem:
Where:
means is proportional to.
represents the class k's prior probabilities.
is the likelihoods of the document given the class k.
is the normalizing factor which we don't have to compute since it does not depend on the class . i.e. this factor will be the same across all class , thus the numerator will be enough to determine which is the largest.
Starting with . The spirit of Naive Bayes is it assumes that each of the features it uses are conditionally independent of one another given some class. More formally, if we wish to calculate the probability of observing features through , given some class we can do it by the following math formula:
Suppose we have a vocabulary (features) containing a set of words and the dimension of a document vector corresponds to word in the vocabulary. Following the Naive Bayes assumption, that the probability of each word occurring in the document is independent of the occurrences of the other words, we can then re-write the document's likelihood as:
Where:
is the probability of word occurring in a document of class .
is the probability of not occurring in a document of class .
is either 0 or 1 representing the absence or presence of word in the document.
This product goes over all words in the vocabulary. If word is present, then and the associated probability is ; If word is not present, then and the associated probability becomes .
As for the word likelihood , we can learn (estimate) these parameters from a training set of documents labelled with class .
Where:
is the number of class 's document in which is observed.
is the number of documents that belongs to class .
Last, calculating is relatively simple: If there are documents in total in the training set, then the prior probability of class may be estimated as the relative frequency of documents of class :
Where is the total number of documents in the training set.
Bernoulli Model Implementation
Consider a set of documents, each of which is related either to Class 1 or to Class 0. Given a training set of 11 documents, we would like to train a Naive Bayes classifier, using the Bernoulli document model, to classify unlabelled documents as Class 1 or 0. We define a vocabulary of eight words.
Thus the training data is presented below as a 11*8 matrix, in which each row represents an 8-dimensional document vector. And the represents the class of each document. Then we would like to classify the two testing data.
Multinomial Distribution
Before discussing the multinomial document model, it is important to be familiar with the multinomial distribution. The multinomial distribution can be used to compute the probabilities in situations in which there are more than two possible outcomes. For example, suppose that two chess players had played numerous games and it was determined that the probability that Player A would win is 0.40, the probability that Player B would win is 0.35, and the probability that the game would end in a draw is 0.25. The multinomial distribution can be used to answer questions such as: "If these two chess players played 12 games, what is the probability that Player A would win 7 games, Player B would win 2 games, and the remaining 3 games would be drawn?" The following generalized formula gives the probability of obtaining a specific set of outcomes when there are possible outcomes for each event:
n is the total number of events.
is the number of times outcome 1 to d occurred.
is the probability of outcome 1 to d occurred.
Or more compactly written as:
If all of that is still unclear, refer to the following link for a worked example. Youtube: Introduction to the Multinomial Distribution.
Multinomial Model
Recall that for Naive Bayes is the objective function that we're trying to solve for. In the multinomial case, calculating for the document becomes:
Where:
, is the count of the number of times word occurs in document .
is the total number of words in document .
Often times, we don't need the normalization term , because it does not depend on the class, .
is the probability of word occurring in a document of class . This time estimated using the word frequency information from the document's feature vectors. More specifically, this is: .
can be interpreted as the product of word likelihoods for each word in the document.
Laplace Smoothing
One drawback of the equation for the multinomial model is that the likelihood involves taking a product of probabilities . Hence if any one of the terms of the product is zero, then the whole product becomes zero. This means that the probability of the document belonging to the class in question is zero (impossible). Intuitively, just because a word does not occur in a document class in the training data does not mean that it cannot occur in any document of that class.
Therefore, one way to alleviate the problem is Laplace Smoothing or add one smoothing, where we add a count of one to each word type and the denominator will be increased by , the number of vocabularies (features), to ensure that the probabilities are still normalized. More formally, our becomes:
In sum, by performing Laplace Smoothing, we ensure that our will never equal to 0.
Log-Transformation
Our original formula for classifying a document in to a class using Multinomial Naive Bayes was:
In practice, when we have a lot of unique words, we create very small values by computing the product of many terms. On a computer, the values may become so small that they may "underflow" (run out of memory to represent the value and thus it will be rounded to zero). To prevent this, we can simply throw a logarithm around everything:
Using the property that , the formula above then becomes:
Multinomial Model Implementation
Given the four documents and its corresponding class (label), which class does the document with the message Chinese Chinese Chinese Tokyo Japan more likely belong to.
The implementation in the following code chunk is a very crude implementation while the one two code chunks below is a more efficient and robust implementation that leverages sparse matrix and matrix multiplication.
Pros and Cons of Naive Bayes
We'll end this notebook with this algorithm's pros and cons.
Pros:
Extremely fast to train/apply and is reliably a high bias/low variance classifier (less likely to overfit).
Handles extraneous features well, meaning it's robust to irrelevant features.
Famously good at text classification. e.g. spam filtering. Or domains where you have many equally important features, which tends to be a problem for other kind of classifiers, in particular tree based algorithms.
No parameter tuning is required
Cons:
Conditional independence is not always a valid assumption, thus can be outperformed by other methods.
Predicted probabilities are not well-calibrated.