Path: blob/master/lessons/lesson_08/logistic-regression-starter - (done).ipynb
1904 views
Logistic Regression
Authors: Multiple
Instructor Note: There are several portions of this lab that are half filled in. You can use these as independent activity or a refresher walkthrough
Learning Objectives
Recall how to perform linear regression in scikit-learn.
Demonstrate why logistic regression is a better alternative for classification than linear regression.
Understand the concepts of probability, odds, e, log, and log-odds in relation to machine learning.
Explain how logistic regression works.
Interpret logistic regression coefficients.
Use logistic regression with categorical features.
Compare logistic regression with other models.
Utilize different metrics for evaluating classifier models.
Construct a confusion matrix based on predicted classes.
Introduction
In this lesson we learn about Logistic Regression, or what is sometimes referred to as Logistic Classification.
"How can a model be both a Regression and a Classification?" you may ask.
Discussion
Have you ever had to sort objects, but everything didn't fit perfectly into groups?
Example:
Movies/Books
Socks
Phone apps
Logistic Regression/Classification uses elements from both the Linear Regression and the K Nearest Neighbors algorithms.
Refresher: Fitting and Visualizing a Linear Regression Using scikit-learn
Use Pandas to load in the glass attribute data from the UCI machine learning website. The columns are different measurements of properties of glass that can be used to identify the glass type. For detailed information on the columns in this data set, please see the included .names file.
Data Dictionary
Id
: number: 1 to 214RI
: refractive indexNa
: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)Mg
: MagnesiumAl
: AluminumSi
: SiliconK
: PotassiumCa
: CalciumBa
: BariumFe
: IronType
: Type of glass:
Pretend we want to predict ri
, and our only feature is al
. How could we do it using machine learning?
How would we visualize this model?
How can we draw this plot (just the points — don't worry about the regression line) without using Seaborn?
To build a linear regression model to predict ri
using scikit-learn, we will need to Import LinearRegression
from linear_model
.
Using LinearRegression
, fit a model predicting ri
from al
(and an intercept).
Using the LinearRegression
object we have fit, create a variable that are our predictions for ri
for each row's al
in the data set.
Plot this regression line with the scatter points on the same chart.
Print out the intercept and coefficient values from our fit LinearRegression
object.
Manually compute the predicted value of ri
when al=2.0
using the regression equation.
Confirm that this is the same value we would get when using the built-in .predict()
method of the LinearRegression
object.
Coefficient interpretation: A 1-unit increase in al
is associated with a ~0.0025-unit decrease in ri
.
Intercept interpretation: When al = 0
, the estimated value of ri
is 1.52194533024.
Predicting a Single Categorical Response
Linear regression is appropriate when we want to predict the value of a continuous target/response variable, but what about when we want to predict membership in a class or category?
Examine the glass type column in the data set. What are the counts in each category?
Say these types are subdivisions of broader glass types:
Window glass: types 1, 2, and 3
Household glass: types 5, 6, and 7
Create a new household
column that indicates whether or not a row is household glass, coded as 1 or 0, respectively.
Let's change our task, so that we're predicting the household
category using al
. Let's visualize the relationship to figure out how to do this.
Make a scatter plot comparing al
and household
.
Fit a new LinearRegression
predicting household
from al
.
Let's draw a regression line like we did before:
If al=3, what class do we predict for household? 1
If al=1.5, what class do we predict for household? 0
We predict the 0 class for lower values of al, and the 1 class for higher values of al. What's our cutoff value? Around al=2, because that's where the linear regression line crosses the midpoint between predicting class 0 and class 1.
Therefore, we'll say that if household_pred >= 0.5, we predict a class of 1, else we predict a class of 0.
Using this threshold, create a new column of our predictions for whether a row is household glass.
Plot a line that shows our predictions for class membership in household vs. not.
Using Logistic Regression for Classification
Logistic regression is a more appropriate method for what we just did with a linear regression. The values output from a linear regression cannot be interpreted as probabilities of class membership since their values can be greater than 1 and less than 0. Logistic regression, on the other hand, ensures that the values output as predictions can be interpreted as probabilities of class membership.
Import the LogisticRegression
class from linear_model
below and fit the same regression model predicting household
from al
.
Plot the predicted class using the logistic regression as we did for the linear regression predictions above.
As you can see, the class predictions are the same.
What if we wanted the predicted probabilities instead of just the class predictions, to understand how confident we are in a given prediction?
Using the built-in .predict_proba()
function, examine the predicted probabilities for the first handful of rows of X
.
Sklearn orders the columns according to our class labels. The two-column output of predict_proba
returns a column for each class of our household
variable. The first column is the probability of household=0
for a given row, and the second column is the probability of household=1
.
Store the predicted probabilities of class=1 in its own column in the data set.
Plot the predicted probabilities as a line on our plot (probability of household=1
as al
changes).
We can also use statsmodels to get the standard errors
confusion matrix and classification report
Exercise 1:
Build and train a logistic regression model.
Select 2 features for your X
y will remain the same
glass.household
Evaluate the model with
model.score
on the testing data.
Probability, e, Log, and Log Odds
To understand how logistic regression predicts the probability of class membership we need to start by understanding the relationship between probability, odds ratios, and log odds ratios. This is because logistic regression predicts log odds and so reading log odds is extremely useful for interpreting logistic regression.
It is often useful to think of the numeric odds as a ratio. For example, 5/1 = 5 odds is "5 to 1" -- five wins for every one loss (e.g. of six total plays). 2/3 odds means "2 to 3" -- two wins for every three losses (e.g. of five total plays).
Examples:
Dice roll of 1: probability = 1/6, odds = 1/5
Even dice roll: probability = 3/6, odds = 3/3 = 1
Dice roll less than 5: probability = 4/6, odds = 4/2 = 2
As an example we can create a table of probabilities vs. odds, as seen below.
What is a (natural) log? It gives you the time needed to reach a certain level of growth:
It is also the inverse of the exponential function:
Lets take one of our odds from out table and walk through how it works.
*for more on e... > extra materials > e_log_examples notebook
Linear regression: Continuous response is modeled as a linear combination of the features.
Logistic regression: Log odds of a categorical response being "true" (1) is modeled as a linear combination of the features.
This is called the logit function.
Probability is sometimes written as pi.
The equation can be rearranged into the logistic function.
In other words:
Logistic regression outputs the probabilities of a specific class.
Those probabilities can be converted into class predictions.
The logistic function has some nice properties:
Takes on an "s" shape
Output is bounded by 0 and 1
We have covered how this works for binary classification problems (two response classes). But what about multi-class classification problems (more than two response classes)?
The most common solution for classification models is "one-vs-all" (also known as "one-vs-rest"): Decompose the problem into multiple binary classification problems.
Multinomial logistic regression, on the other hand, can solve this as a single problem, but how this works is beyond the scope of this lesson.
Interpreting Logistic Regression Coefficients
Logistic regression coefficients are not as immediately interpretable as the coefficients from a linear regression. To interpret the coefficients we need to remember how the formulation for logistic regression differs from linear regression.
First let's plot our logistic regression predicted probability line again.
Remember:
That means we'll get out the log odds if we compute the intercept plus the coefficient times a value for al
.
Compute the log odds of household
when al=2
.
Now that we have the log odds, we will need to go through the process of converting these log odds to probability.
Convert the log odds to odds, then the odds to probability.
This finally gives us the predicted probability of household=1
when al=2
. You can confirm this is the same as the value you would get out of the .predict_proba()
method of the sklearn object.
Interpretation: A 1-unit increase in al
is associated with a 2.01-unit increase in the log odds of household
.
Bottom line: Positive coefficients increase the log odds of the response (and thus increase the probability), and negative coefficients decrease the log odds of the response (and thus decrease the probability).
Intercept interpretation: For an al
value of 0, the log-odds of household
is -4.12790736.
That makes sense from the plot above, because the probability of household=1
should be very low for such a low al
value.
Changing the value shifts the curve horizontally, whereas changing the value changes the slope of the curve.
Comparing Logistic Regression to Other Models
Advantages of logistic regression:
Highly interpretable (if you remember how).
Model training and prediction are fast.
No tuning is required (excluding regularization).
Features don't need scaling.
Can perform well with a small number of observations.
Outputs well-calibrated predicted probabilities.
Disadvantages of logistic regression:
Presumes a linear relationship between the features and the log odds of the response.
Performance is (generally) not competitive with the best supervised learning methods.
Can't automatically learn feature interactions.
Advanced Classification Metrics
When we evaluate the performance of a logistic regression (or any classifier model), the standard metric to use is accuracy: How many class labels did we guess correctly? However, accuracy is only one of several metrics we could use when evaluating a classification model.
Accuracy alone doesn’t always give us a full picture.
If we know a model is 75% accurate, it doesn’t provide any insight into why the 25% was wrong.
Consider a binary classification problem where we have 165 observations/rows of people who are either smokers or nonsmokers.
n = 165 | Predicted: No | Predicted: Yes | |
Actual: No | |||
Actual: Yes | |||
There are 60 in class 0, nonsmokers, and 105 observations in class 1, smokers
n = 165 | Predicted: No | Predicted: Yes | |
Actual: No | 60 | ||
Actual: Yes | 105 | ||
We have 55 predictions of class, predicted as nonsmokers, and 110 of class 1, predicted to be smokers.
n = 165 | Predicted: No | Predicted: Yes | |
Actual: No | 60 | ||
Actual: Yes | 105 | ||
55 | 110 |
True positives (TP): These are cases in which we predicted yes (smokers), and they actually are smokers.
True negatives (TN): We predicted no, and they are nonsmokers.
False positives (FP): We predicted yes, but they were not actually smokers. (This is also known as a "Type I error.")
False negatives (FN): We predicted no, but they are smokers. (This is also known as a "Type II error.")
n = 165 | Predicted: No | Predicted: Yes | |
Actual: No | TN = 50 | FP = 10 | 60 |
Actual: Yes | FN = 5 | TP = 100 | 105 |
55 | 110 |
Categorize these as TP, TN, FP, or FN:
Try not to look at the answers above.
We predict nonsmoker, but the person is a smoker.
We predict nonsmoker, and the person is a nonsmoker.
We predict smoker and the person is a smoker.
We predict smoker and the person is a nonsmoker.
Accuracy: Overall, how often is the classifier correct?
n = 165 | Predicted: No | Predicted: Yes | |
Actual: No | TN = 50 | FP = 10 | 60 |
Actual: Yes | FN = 5 | TP = 100 | 105 |
55 | 110 |
True positive rate (TPR) asks, “Out of all of the target class labels, how many were accurately predicted to belong to that class?”
For example, given a medical exam that tests for cancer, how often does it correctly identify patients with cancer?
n = 165 | Predicted: No | Predicted: Yes | |
Actual: No | TN = 50 | FP = 10 | 60 |
Actual: Yes | FN = 5 | TP = 100 | 105 |
55 | 110 |
False positive rate (FPR) asks, “Out of all items not belonging to a class label, how many were predicted as belonging to that target class label?”
For example, given a medical exam that tests for cancer, how often does it trigger a “false alarm” by incorrectly saying a patient has cancer?
n = 165 | Predicted: No | Predicted: Yes | |
Actual: No | TN = 50 | FP = 10 | 60 |
Actual: Yes | FN = 5 | TP = 100 | 105 |
55 | 110 |
Can you see that we might weigh TPR AND FPR differently depending on the situation?
Give an example when we care about TPR, but not FPR.
Give an example when we care about FPR, but not TPR.
More Trade-Offs
The true positive and false positive rates gives us a much clearer picture of where predictions begin to fall apart.
This allows us to adjust our models accordingly.
Below we will load in some data on admissions to college.
We can predict the admit
class from gre
and use a train-test split to evaluate the performance of our model on a held-out test set.
Recall that our "baseline" accuracy is the proportion of the majority class label.
Create a confusion matrix of predictions on our test set using metrics.confusion_matrix
.
Answer the following:
What is our accuracy on the test set?
True positive rate?
False positive rate?
A good classifier would have a true positive rate approaching 1 and a false positive rate approaching 0.
In our smoking problem, this model would accurately predict all of the smokers as smokers and not accidentally predict any of the nonsmokers as smokers.
Trading True Positives and True Negatives
By default, and with respect to the underlying assumptions of logistic regression, we predict a positive class when the probability of the class is greater than .5 and predict a negative class otherwise.
What if we decide to use .3 as a threshold for picking the positive class? Is that even allowed?
This turns out to be a useful strategy. By setting a lower probability threshold we will predict more positive classes. Which means we will predict more true positives, but fewer true negatives.
Making this trade-off is important in applications that have imbalanced penalties for misclassification.
The most popular example is medical diagnostics, where we want as many true positives as feasible. For example, if we are diagnosing cancer we prefer to have false positives, predict a cancer when there is no cancer, that can be later corrected with a more specific test.
We do this in machine learning by setting a low threshold for predicting positives which increases the number of true positives and false positives, but allows us to balance the the costs of being correct and incorrect.
We can vary the classification threshold for our model to get different predictions.
The Accuracy Paradox
Accuracy is a very intuitive metric — it's a lot like an exam score where you get total correct/total attempted. However, accuracy is often a poor metric in application. There are many reasons for this:
Imbalanced problems problems with 95% positives in the baseline will have 95% accuracy even with no predictive power.
This is the paradox; pursuing accuracy often means predicting the most common class rather than doing the most useful work.
Applications often have uneven penalties and rewards for true positives and false positives.
Ranking predictions in the correct order be more important than getting them correct.
In many case we need to know the exact probability of a positives and negatives.
To calculate an expected return.
To triage observations that are borderline positive.
Some of the most useful metrics for addressing these problems are:
Classification accuracy/error
Classification accuracy is the percentage of correct predictions (higher is better).
Classification error is the percentage of incorrect predictions (lower is better).
Easiest classification metric to understand.
Confusion matrix
Gives you a better understanding of how your classifier is performing.
Allows you to calculate sensitivity, specificity, and many other metrics that might match your business objective better than accuracy.
Precision and recall are good for balancing misclassification costs.
ROC curves and area under a curve (AUC)
Good for ranking and prioritization problems.
Allows you to visualize the performance of your classifier across all possible classification thresholds, thus helping you to choose a threshold that appropriately balances sensitivity and specificity.
Still useful when there is high class imbalance (unlike classification accuracy/error).
Harder to use when there are more than two response classes.
Log loss
Most useful when well-calibrated predicted probabilities are important to your business objective.
Expected value calculations
Triage
The good news is that these are readily available in Python and R, and are usually easy to calculate once you know about them.
OPTIONAL: How Many Samples Are Needed?
We often ask how large our data set should be to achieve a reasonable logistic regression result. Below, a few methods will be introduced for determining how accurate the resulting model will be.
Rule of Thumb
Quick: At least 100 samples total. At least 10 samples per feature.
Formula method:
Find the proportion of positive cases and negative cases. Take the smaller of the two.
Ideally, you want 50/50 for a proportion of 0.5.
Example: Suppose we are predicting "male" or "female". Our data is 80% male, 20% female.
So, we choose the proportion since it is smaller.
Find the number of independent variables .
Example: We are predicting gender based on the last letter of the first name, giving us 26 indicator columns for features. So, .
Let the minimum number of cases be . The minimum should always be set to at least .
Example: Here, . So, we would need 1300 names (supposing 80% are male).
Both methods from: Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.
Statistical Testing
Logistic regression is one of the few machine learning models where we can obtain comprehensive statistics. By performing hypothesis testing, we can understand whether we have sufficient data to make strong conclusions about individual coefficients and the model as a whole. A very popular Python library which gives you these statistics with just a few lines of code is statsmodels.
Power Analysis
As you may suspect, many factors affect how statistically significant the results of a logistic regression are. The art of estimating the sample size to detect an effect of a given size with a given degree of confidence is called power analysis.
Some factors that influence the accuracy of our resulting model are:
Desired statistical significance (p-value)
Magnitude of the effect
It is more difficult to distinguish a small effect from noise. So, more data would be required!
Measurement precision
Sampling error
An effect is more difficult to detect in a smaller sample.
Experimental design
So, many factors, in addition to the number of samples, contribute to the resulting statistical power. Hence, it is difficult to give an absolute number without a more comprehensive analysis. This analysis is out of the scope of this lesson, but it is important to understand some of the factors that affect confidence.
Lesson Review
Logistic regression
What kind of machine learning problems does logistic regression address?
What do the coefficients in a logistic regression represent? How does the interpretation differ from ordinary least squares? How is it similar?
The confusion matrix
How do true positive rate and false positive rate help explain accuracy?
Why might one classification metric be more important to tune than another? Give an example of a business problem or project where this would be the case.