Path: blob/master/text_classification/logistic.ipynb
1470 views
Logistic Regression
Logistic regression is an excellent tool to know for classification problems, which are problems where the output value that we wish to predict only takes on only a small number of discrete values. Here we'll focus on the binary classification problem, where the output can take on only two distinct classes. To make our examples more concrete, we will consider the Glass dataset.
Our task is to predict the household
column using the al
column. Let's visualize the relationship between the input and output and also train the logsitic regression to see the outcome that it produces.
As we can see, logistic regression can output the probabilities of observation belonging to a specific class and these probabilities can be converted into class predictions by choosing a cutoff value (e.g. probability higher than 0.5 is classified as class 1).
Logistic Function
In Logistic Regression, the log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:
Where:
is the intercept term, and to represents the parameters for all the other features (a total of j features).
By convention of we can assume that , so that we can re-write the whole thing using the matrix notation .
This is called the logit function. The equation can be re-arranged into the logistic function:
Or in the more commonly seen form:
Let's take a look at the plot of the function:
The logistic function has some nice properties. The y-value represents the probability and it is always bounded between 0 and 1, which is want we wanted for probabilities. For an x value of 0 you get a 0.5 probability. Also as you get more positive x value you get a higher probability, on the other hand, a more negative x value results in a lower probability.
Toy sample code of how to predict the probability given the data and the weight is provided below.
Interpreting the Intercept
We can check logistic regression's coefficient does in fact generate the log-odds.
Interpretation: 1 unit increase in al
is associated with a 4.18 unit increase in the log-odds of the observation being classified as household 1
. We can confirm that again by doing the calculation ourselves.
Defining The Cost Function
When utilizing logistic regression, we are trying to learn the values in order to maximize the probability of correctly classifying our glasses. Let's say someone did give us some values of the logistic regression model, how would we determine if they were good values or not? What we would hope is that for the household of class 1, the probability values are close to 1 and for the household of class 0 the probability is close to 0.
But we don't care about getting the correct probability for just one observation, we want to correctly classify all our observations. If we assume our data are independent and identically distributed (think of it as all of them are treated equally), we can just take the product of all our individually calculated probabilities and that becomes the objective function we want to maximize. So in math:
The symbol means take the product of the for the observations that are classified as that class. You will notice that for observations that are labeled as class 0, we are taking 1 minus the logistic function. That is because we are trying to find a value to maximize, and since observations that are labeled as class 0 should have a probability close to zero, 1 minus the probability should be close to 1. This procedure is also known as the maximum likelihood estimation and the following link contains a nice discussion of maximum likelihood using linear regression as an example. Blog: The Principle of Maximum Likelihood
Next we will re-write the original cost function as:
Where:
We define to be 1 when the observation is labeled class 1 and 0 when labeled as class 0, then we only compute for observations that are labeled class 1 and for observations that are labeled class 0, which is still the same idea as the original function.
Next we'll transform the original by taking the log. As we'll later see this logarithm transformation will make our cost function more convenient to work with, and because the logarithm is a monotonically increasing function, the logarithm of a function achieves its maximum value at the same points as the function itself. When we take the log, our product across all data points, it becomes a sum. See log rules for more details (Hint: log(ab) = log(a) + log(b)).
The simply represents the total number of the data.
Often times you'll also see the notation above be simplified in the form of a maximum likelihood estimator:
The equation above simply denotes the idea that , represents the parameters we would like to estimate the parameters by maximizing conditional probability of given .
Now by definition of probability in the logistic regression model: and . By substituting these expressions into our equation and simplifying it further we can obtain a simpler expression.
We'll use the formula above to compute the log likelihood for the entire dataset, which is used to assess the convergence of the algorithm. Toy code provided below.
Note: We made one tiny modification to the log likelihood function We added a term which averages the log likelihood across all data points. The term will make it easier for us to compare stochastic gradient ascent with batch gradient ascent later.
Gradient
Now that we obtain the formula to assess our algorithm, we'll dive into the meat of the algorithm, which is to derive the gradient for the formula (the derivative of the formula with respect to each coefficient):
And it turns out the derivative of log likelihood with respect to to a single coefficient is as follows (the form is the same for all coefficients):
To compute it, you simply need the following two terms:
is the vector containing the difference between the predicted probability and the original label.
is the vector containing the feature's value.
For a step by step derivation, consider going through the following link. Blog: Maximum likelihood and gradient descent demonstration, it uses a slightly different notation, but the walkthrough should still be pretty clear.
Stochastic/Mini-batch Gradient
The problem with computing the gradient (or so called batched gradient) is the term . This means that we must sum the contributions over all the data points to calculate the gradient, and this can be problematic if the dataset we're studying is extremely large. Thus, in stochastic gradient, we can use a single point as an approximation to the gradient:
Note1: Because the Stochastic Gradient algorithm uses each row of data in turn to update the gradient, if our data has some sort of implicit ordering, this will negatively affect the convergence of the algorithm. At an extreme, what if we had the data sorted so that all positive reviews came before negative reviews? In that case, even if most reviews are negative, we might converge on an answer of +1 because we never get to see the other data. To avoid this, one practical trick is to shuffle the data before we begin so the rows are in random order.
Note2: Stochastic gradient compute the gradient using only 1 data point to update the the parameters, while batch gradient uses all data points. An alternative to these two extremes is a simple change that allows us to use a mini-batch of data points to calculate the gradient. This simple approach is faster than batch gradient but less noisy than stochastic gradient that uses only 1 data point. Given a mini-batch (or a set of data points) , the gradient function for this mini-batch of data points is given by:
Here, the means that we are normalizing the gradient update rule by the batch size . In other words, we update the coefficients using the average gradient over data points (instead of using a pure summation). By using the average gradient, we ensure that the magnitude of the gradient is approximately the same for all batch sizes. This way, we can more easily compare various batch sizes and study the effect it has on the algorithm.
Implementation
Recall our task is to find the optimal value for each individual weight to lower the cost. This requires taking the partial derivative of the cost/error function with respect to a single weight, and then running gradient descent for each individual weight to update them. Thus, for any individual weight , we'll compute the following:
Where:
denotes the the learning rate or so called step size, in other places you'll see it denoted as .
denotes the weight of the feature at iteration .
And we'll do this iteratively for each weight, many times, until the whole network's cost function is minimized.
Comparing Result and Convergence Behavior
We'll use the logistic regression code that we've implemented and compare the predicted auc score with scikit-learn's implementation. This only serves to check that the predicted results are similar and that our toy code is correctly implemented. Then we'll also explore the convergence difference between batch gradient descent and stochastic gradient descent.
Based on the convergence plot above, we can see that the it's a good idea to use mini-batch gradient descent since it strikes a good balance between batch gradient, which convergences steadily but can be computationly too expensive when the dataset is too large, and stochastic gradient, which is faster to train, but the result can be too noisy.
Pros and Cons of Logistic Regression
We'll end this notebook listing out some pros and cons of this method.
Pros:
Highly interpretable (if you remember how).
Model training and prediction are fast. Thus can be desirable in large-scale applications when we're dealing with millions of parameters.
Almost no parameter tuning is required (excluding regularization).
Outputs well-calibrated predicted probabilities.
Cons:
Presumes a linear relationship between the features
Performance is (generally) not competitive with the best supervised learning methods.
Can't automatically learn feature interactions.