Path: blob/master/model_selection/collinearity.ipynb
1470 views
Multicollearity
Many machine learning models have either some inherent internal ranking of features or it is not extremely complicated to generate/access the ranking from the structure of the model. This document discusses the issue of multicollinearity, i.e. how multicollinearity can affect the feature ranking and potential methods that can be used to address them. As we'll be looking at the coefficients of linear regression model for selecting and interpreting features, the next section contains a introduction to this method/algorithm.
Linear Regression
Unlike a lot of other tutorials, we'll introduce linear regression from a maximum likelihood perspective. The principle of maximum likelihood is at the heart of machine learning. It guides us to find the best model in a search space of all models. In simple terms, Maximum Likelihood Estimation (MLE) lets us choose a model (parameters) that explains the data (training set) better than all other models.
Maximum Likelihood - Primer
The process of sampling from a normal distribution is expressed as, . is a random variable sampled or generated or simulated from the gaussian distribution. As we sample from this distribution, most samples will fall around the center, near the mean, because of higher probability density in the center.
Let's consider 3 data points, , which are independent and drawn from a gaussian with unknown mean and constant variance of 1. Suppose we now have two choices for : {1, 2.5}. Which one should we choose? Which model would explain the data better? In general, any data point drawn from a gaussian with mean and and variance 1, can be written as,
This can be read as the mean, shifts the center of the standard normal distribution ( and ) The likelihood of data (y1,y2,y3) having been drawn from , can be expressed as:
as the data points are assumed to be independent of one another.
Now, we have two normal distributions defined by and . Let us draw both and plot the data points. In the figure below, notice the dotted lines that connect the bell curve to the data points. Consider the point in the first distribution . The length of the dotted line gives the probability of the being drawn from . And the same goes for the second distribution .
Knowing that the likelihood of data (y1,y2,y3) having been drawn from is given by:
The individual probabilities in the equation above are equal to the heights of corresponding dotted lines in the figure. We see that the likelihood, computed by the product of individual probabilities of data points given model, is essentially the product of lengths of dotted lines. In this toy example, the likelihood of model seems to higher than , so that's the model we'll be going with.
Maximum Likelihood - Linear Regression
For linear regression we assume the relationship between our input variable and our output label can be modeled by a linear function.
The model assumes the label for each observation, , is gaussian distributed with mean, and variance, , which can be written as:
The mean represents the best fitted line with all data points varying around that line, and the term , captures this varying variance .
Now, recall that the formula for the gaussian/normal distribution is:
Given that linear regression assumes each point to be gaussian distributed, the process of learning becomes the process of maximizing the product of the individual probabilities:
Next we rewrite the equation in vector form and due to the fact that the original maximization problem is equivalent to maximizing its log likelihood (log is a monotonic transformation thus does not affect that learned parameters). Thus we take the log to make the derivation later easier.
Our current interest right now is to solve for the unknown parameter (we can use similar idea to solve for the unknown ), thus we can further simplify the equation above, to remove all the terms that are not relevant to , and in machine learning problems, we're often interested in minimizing the objective function, thus we negate the negative sign and turn the maximization problem into a minimization one.
When introducing linear regression, an alternative way of viewing it is from a least squares perspective. The objective of least squares is to minimize the squared distance between the prediciton and the ground truth. So, we want to minimize the mean squared error: . We now see that we can come to the same objective from two different perspective. The following section lists out the derivation for solving .
We'll first expand this equation:
Using the standard rule of minimization in calculus, if we wish to solve for the weight , we would take the derivative w.r.t. and set it to zero.
In the steps above, vanished as there’s no dependence, and becomes as is analogous to . As the final step, we will perform some rearrangement of the formula:
Matrix calculus can feel a bit handwavy sometimes, if you're not convinced by the derivation above, the following link walks through each individual steps in much more detail. Blog: The Normal Equation and matrix calculus
After solving for the coefficients of the regression model, we can use it for selecting and interpreting features, if all features are on the same scale, the most important features should have the highest coefficients in the model, while features uncorrelated with the output variables should have coefficient values close to zero. This approach can work well when the data is not very noisy (or there is a lot of data compared to the number of features) and the features are (relatively) independent:
As we can see in this example, the model indeed recovers the underlying structure of the data very well, despite quite significant noise in the data. Given that the the predictors are on the same scale, we can compare the coefficients directly to determine variable importance, we can see here that when using linear regression, X2 is the most important predictor for this given dataset. To be explicit, standardized coefficients represent the mean change in the response given one standard deviation change in the predictor.
R-squared
After fitting our predictive model, we would most likely wish to evaluate its performance and R-squared is a statistic that is often used to evaluate a regression model's performance. It takes a value ranging from 0 to 1 and is usually interpreted as summarizing the percent of variation in the response that the regression model is capable of explaining. So a R-squared of 0.65 means the model explains about 65% of the variation in our dependent variable. Given this logic, we prefer our regression models have a high R-squared, since we want the model we've trained to capture the output's variance as much as possible. One way to compute R-squared is the sum of squared fitted-value deviations divided by the sum of squared original-value deviations:
: original reponse variable.
: predicted value for the reponse variable.
: The average of reponse variable (pronounced as y bar).
An alternative form is:
RSS: Stands for Residual Sum of Squares or referred to as sum of squared error (the measurement that linear model tries of minimize). This value captures the variance that is left between our prediction and the actual values of the output.
TSS: Stands for Total Sum of Squares, which measures the total variance in the output variable.
Though widely used, this is actually a measurement that requires some context for it to be a valid evaluation metric. We'll give some examples of why:
R-squared can be arbitrarily close to 1 when the model is totally wrong.
When checking the R-squared value for this model, it’s very high at about 0.90, but the model is completely wrong as this data follows a nonlinear distribution. Using R-squared to justify the "goodness" of our model in this instance would be a mistake. Hopefully one would plot the data first and recognize that a simple linear regression in this case would be inappropriate.
We’re better off using Mean Square Error (MSE) or other error-based metric as a measure of prediction error. As R-squared can be anywhere between 0 and 1 just by changing the range of X.
Let’s demonstrate this statement by first generating data that meets all simple linear regression assumptions and then regressing y on x to assess both R-squared and MSE.
We repeat the above code, but this time with a different range of x. Leaving everything else the same:
R-squared falls from around 0.9 to around 0.2, but the MSE remains fairly the same. In other words the predictive ability is the same for both data sets, but the R-squared would lead you to believe the first example somehow had a model with more predictive power.
The problem we just tackled was particularly well suited for a linear model: purely linear relationship between features and the response variable, and no correlations between features. The issue arises when there are multiple (linearly) correlated features (as is the case with very many real life datasets), the model then becomes unstable, meaning small changes in the data can cause large changes in the model (i.e. coefficient values), making model interpretation very difficult.
For example, assume we have a dataset where the "true" model for the data is , while we observe , with being the error term. On top of that let's say and are linearly correlated such that . Ideally the learned model will be . But depending on the amount of noise , the amount of data at hand and the correlation between and , it could also be (i.e. using only as the predictor) or (shifting of the coefficients might happen to give a better fit in the noisy training set) etc.
The coefficients of our fitted linear model sums up to ~3, so we can expect it to perform well. On the other hand, if we were to interpret the coefficients at face value, then according to the model has a strong positive impact on the output variable, while has a negative one, when in fact all the features are correlated and should have equal effects to the output variable. This multicollearity issue also applies to other methods/algorithms and should be addressed before feeding our data to a machine learning method/algorithm.
Variance Inflation Factor
One of the most widely used statistical measure of detecting multicollinearity amongst numerical variable is the Variance Inflation Factor (VIF). The VIF may be calculated for each predictor by performing a linear regression of that predictor on all the other predictors, i.e. if we wish to calculate the VIF for predictor , then we would use that column as the response variable and use all other columns excluding as the input. After fitted the linear regression, we would then obtain the rsquared value, , which tells us how much variance in our predictor can be explained by all the other predictors. Lastly the VIF can be computed using:
It’s called the variance inflation factor because it estimates how much the variance of a coefficient is "inflated" because of linear dependence with other predictors. Thus, a VIF of 1.8 tells us that the variance (the square of the standard error) of a particular coefficient is 80% larger than it would be if that predictor was completely uncorrelated with all the other predictors.
Cramer's V
Now that we've discussed the method for detecting collinearity amongst numerical variables, we will shift our gears towards categorical variables. Cramer’s V is a statistic measuring the strength of association or dependency between two (nominal) categorical variables.
Suppose and are two categorical variables that are to be analyzed in a some experimental or observational data with the following information:
has distinct categories or classes, labeled .
has distinct categories, labeled .
Form a contingency table such that cell contains the count of occurrences of category in and category in . This would give us total pairs of observations.
We start of with the null hypothesis that and are independent random variables, then based on the table and the null hypothesis, the chi-squared statistic can be computed. After that, Cramer's V is defined to be:
Remarks:
. The closer is to 0, the smaller the association between the categorical variables and . On the other hand, being close to 1 is an indication of a strong association between and . If , then .
In order for to make sense, each categorical variable must have at least 2 categories.
If one of the categorical variables is dichotomous, i.e. either or , Cramer's V is equal to the phi statistic (), which is defined to be .
Cramer's V is a chi-square based measure of association. The chi-square value depends on the strength of the relationship and sample size, while eliminates the sample size by dividing chi-square by , the sample size, and taking the square root.