Path: blob/master/text_classification/chisquare.ipynb
1470 views
Table of Contents
Chi-Square Feature Selection
Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested. The benefits of performing feature selection before modeling your data are:
Avoid Overfitting: Less redundant data gives performance boost to the model and results in less opportunity to make decisions based on noise
Reduces Training Time: Less data means that algorithms train faster
One common feature selection method that is used with text data is the Chi-Square feature selection. The test is used in statistics to test the independence of two events. More specifically in feature selection we use it to test whether the occurrence of a specific term and the occurrence of a specific class are independent. More formally, given a document , we estimate the following quantity for each term and rank them by their score:
Where
is the observed frequency in and the expected frequency
takes the value 1 if the document contains term and 0 otherwise
takes the value 1 if the document is in class and 0 otherwise
For each feature (term), a corresponding high score indicates that the null hypothesis of independence (meaning the document class has no influence over the term's frequency) should be rejected and the occurrence of the term and class are dependent. In this case, we should select the feature for the text classification.
Implementation
We first compute the observed count for each class. This is done by building a contingency table from an input (feature values) and (class labels). Each entry , corresponds to some feature and some class , and holds the sum of the feature's values across all samples belonging to the class .
Note that although the feature values here are represented as frequencies, this method also works quite well in practice when the values are tf-idf values, since those are just weighted/scaled frequencies.
e.g. the second row of the observed array refers to the total count of the terms that belongs to class 1. Then we compute the expected frequencies of each term for each class.
We can confirm our result with the scikit-learn library using the chi2
function. The following code chunk computes chi-square value for each feature. For the returned tuple, the first element is the chi-square scores, the scores are better if greater. The second element is the p-values, they are better if smaller.
Scikit-learn provides a SelectKBest
class that can be used with a suite of different statistical tests. It will rank the features with the statistical test that we've specified and select the top k performing ones (meaning that these terms is considered to be more relevant to the task at hand than the others), where k is also a number that we can tweak.
For the Chi-Square feature selection we should expect that out of the total selected features, a small part of them are still independent from the class. In text classification, however, it rarely matters when a few additional terms are included the in the final feature set. All is good as long as the feature selection is ranking features with respect to their usefulness and is not used to make statements about statistical dependence or independence of variables.