GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_08/logistic-regression-starter - (done).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Logistic Regression

Authors: Multiple

Instructor Note: There are several portions of this lab that are half filled in. You can use these as independent activity or a refresher walkthrough

Learning Objectives

Recall how to perform linear regression in scikit-learn.
Demonstrate why logistic regression is a better alternative for classification than linear regression.
Understand the concepts of probability, odds, e, log, and log-odds in relation to machine learning.
Explain how logistic regression works.
Interpret logistic regression coefficients.
Use logistic regression with categorical features.
Compare logistic regression with other models.
Utilize different metrics for evaluating classifier models.
Construct a confusion matrix based on predicted classes.

Lesson Guide

Introduction

In this lesson we learn about Logistic Regression, or what is sometimes referred to as Logistic Classification.

"How can a model be both a Regression and a Classification?" you may ask.

Discussion

Have you ever had to sort objects, but everything didn't fit perfectly into groups?

Example:

Movies/Books
Socks
Phone apps

Logistic Regression/Classification uses elements from both the Linear Regression and the K Nearest Neighbors algorithms.

Refresher: Fitting and Visualizing a Linear Regression Using scikit-learn

Use Pandas to load in the glass attribute data from the UCI machine learning website. The columns are different measurements of properties of glass that can be used to identify the glass type. For detailed information on the columns in this data set, please see the included .names file.

In [1]:

# Glass identification data set
import pandas as pd

In [2]:

glass = pd.read_csv('data/glass.csv')

In [3]:

# change columns to something more uniform
glass.columns = ['ri','na','mg','al','si','k','ca','ba','fe','glass_type']

Data Dictionary

Id: number: 1 to 214
RI: refractive index
Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
Mg: Magnesium
Al: Aluminum
Si: Silicon
K : Potassium
Ca: Calcium
Ba: Barium
Fe: Iron
Type : Type of glass:

Pretend we want to predict ri, and our only feature is al. How could we do it using machine learning?

How would we visualize this model?

In [13]:

glass.dtypes

Out[13]:

ri            float64
na            float64
mg            float64
al            float64
si            float64
k             float64
ca            float64
ba            float64
fe            float64
glass_type      int64
y_pred        float64
dtype: object

In [4]:

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(font_scale=1.5);

In [7]:

#scatter with regression line
sns.lmplot(x='al', y='ri', data=glass)

Out[7]:

<seaborn.axisgrid.FacetGrid at 0xb39c438>

How can we draw this plot (just the points — don't worry about the regression line) without using Seaborn?

In [8]:

# Scatter plot using Pandas
glass.plot(kind='scatter', x='al', y='ri')

# Seaborn with parameters
#sns.lmplot(x='al', y='ri', data=glass, fit_reg=False);

# Equivalent scatter plot using Matplotlib
#plt.scatter(glass.al, glass.ri)
#plt.xlabel('al')
#plt.ylabel('ri')

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0xb8744e0>

To build a linear regression model to predict ri using scikit-learn, we will need to Import LinearRegression from linear_model.

Using LinearRegression, fit a model predicting ri from al (and an intercept).

In [9]:

# Fit a linear regression model (name the model "linreg").
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

feature_cols = ['al']
X = glass[feature_cols]
y = glass['ri']

linreg.fit(X, y)
linreg

Out[9]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Using the LinearRegression object we have fit, create a variable that are our predictions for ri for each row's al in the data set.

In [10]:

# Make predictions for all values of X and add back to the original DataFrame.
y_pred = linreg.predict(X)

# new column of y_pred
glass['y_pred'] = y_pred
glass.head()

Out[10]:

Plot this regression line with the scatter points on the same chart.

In [11]:

# Put the plots together (use a scatter and line graph).
plt.scatter(glass.al, glass.ri)
plt.plot(glass.al, glass.y_pred, color='red')
plt.xlabel('al')
plt.ylabel('ri')

Out[11]:

Text(0,0.5,'ri')

Refresher: Interpreting Linear Regression Coefficients

Recall the simple linear regression equation is $y = \beta_0 + \beta_1x$

$\beta_0$ is the intercept and $\beta_1$ is, in this case, our coefficient on the al predictor.

Print out the intercept and coefficient values from our fit LinearRegression object.

In [10]:

print(linreg.intercept_) #this is the intercept where x is 0
print(linreg.coef_) #for every unit change in x y changes by this much

Out[10]:

1.5219453302386425
[-0.00247761]

Manually compute the predicted value of ri when al=2.0 using the regression equation.

In [11]:

# Compute prediction for al=2 using the equation.
linreg.intercept_ + linreg.coef_[0] * 2

Out[11]:

1.5169901174637033

Confirm that this is the same value we would get when using the built-in .predict() method of the LinearRegression object.

In [12]:

# Compute prediction for al=2 using the predict method.
linreg.predict(2)

Out[12]:

array([1.51699012])

Coefficient interpretation: A 1-unit increase in al is associated with a ~0.0025-unit decrease in ri.

Intercept interpretation: When al = 0, the estimated value of ri is 1.52194533024.

Predicting a Single Categorical Response

Linear regression is appropriate when we want to predict the value of a continuous target/response variable, but what about when we want to predict membership in a class or category?

Examine the glass type column in the data set. What are the counts in each category?

In [13]:

# Examine glass_type.
glass['glass_type'].value_counts().sort_index()

Out[13]:

  70
  76
  17
  13
   9
  29
Name: glass_type, dtype: int64

Say these types are subdivisions of broader glass types:

Window glass: types 1, 2, and 3

Household glass: types 5, 6, and 7

Create a new household column that indicates whether or not a row is household glass, coded as 1 or 0, respectively.

In [14]:

# Types 1, 2, 3 are window glass.
# Types 5, 6, 7 are household glass.
glass['household'] = glass['glass_type'].map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1}) #.map is like .apply
glass.head()

Out[14]:

Let's change our task, so that we're predicting the household category using al. Let's visualize the relationship to figure out how to do this.

Make a scatter plot comparing al and household.

In [15]:

plt.scatter(glass.al, glass.household)
plt.xlabel('al')
plt.ylabel('household')

Out[15]:

Text(0,0.5,'household')

Fit a new LinearRegression predicting household from al.

Let's draw a regression line like we did before:

In [16]:

# Fit a linear regression model and store the predictions.
feature_cols = ['al']
X = glass[feature_cols] 
y = glass.household 
linreg.fit(X, y)
glass['household_pred'] = linreg.predict(X) # prediction via Lin Reg

In [17]:

# Scatter plot that includes the regression line
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred, color='red')
plt.xlabel('al')
plt.ylabel('household')

Out[17]:

Text(0,0.5,'household')

If al=3, what class do we predict for household? 1

If al=1.5, what class do we predict for household? 0

We predict the 0 class for lower values of al, and the 1 class for higher values of al. What's our cutoff value? Around al=2, because that's where the linear regression line crosses the midpoint between predicting class 0 and class 1.

Therefore, we'll say that if household_pred >= 0.5, we predict a class of 1, else we predict a class of 0.

Using this threshold, create a new column of our predictions for whether a row is household glass.

In [18]:

# Understanding np.where
import numpy as np
nums = np.array([5, 15, 8])

# np.where returns the first value if the condition is True, and the second value if the condition is False.
np.where(nums > 10, 'big', 'small')

Out[18]:

array(['small', 'big', 'small'], dtype='<U5')

In [19]:

# Transform household_pred to 1 or 0.
glass['household_pred_class'] = np.where(glass.household_pred >= 0.5, 1, 0)
glass.head()

Out[19]:

Plot a line that shows our predictions for class membership in household vs. not.

In [20]:

# sort so we can have a continuous line
glass.sort_values('al', inplace=True)
# Plot the class predictions.
plt.scatter(glass.al, glass.household)

plt.xlabel('al')
plt.ylabel('household')

plt.plot(glass.al, glass.household_pred_class, color='red')

#not to self, recall and precision are different depending what value you look at

Out[20]:

[<matplotlib.lines.Line2D at 0x570be80>]

Using Logistic Regression for Classification

Logistic regression is a more appropriate method for what we just did with a linear regression. The values output from a linear regression cannot be interpreted as probabilities of class membership since their values can be greater than 1 and less than 0. Logistic regression, on the other hand, ensures that the values output as predictions can be interpreted as probabilities of class membership.

Import the LogisticRegression class from linear_model below and fit the same regression model predicting household from al.

In [21]:

# Fit a logistic regression model and store the class predictions.
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression() #create object

feature_cols = ['al']
X = glass[feature_cols] #create X
y = glass.household  #create y

logreg.fit(X,y) #fit
pred = logreg.predict(X) # predict

Plot the predicted class using the logistic regression as we did for the linear regression predictions above.

As you can see, the class predictions are the same.

In [22]:

# Plot the class predictions.
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, pred, color='red')
plt.xlabel('al')
plt.ylabel('household')

Out[22]:

Text(0,0.5,'household')

What if we wanted the predicted probabilities instead of just the class predictions, to understand how confident we are in a given prediction?

Using the built-in .predict_proba() function, examine the predicted probabilities for the first handful of rows of X.

In [23]:

logreg.classes_ #  _ is an sklearn convention for attributes

Out[23]:

array([0, 1], dtype=int64)

In [24]:

logreg.predict_proba(X)[0:10] #predicts a 2 column array, one for each class, 0 and 1

Out[24]:

array([[0.97193375, 0.02806625],
       [0.96905681, 0.03094319],
       [0.96017962, 0.03982038],
       [0.96017962, 0.03982038],
       [0.9569876 , 0.0430124 ],
       [0.95265323, 0.04734677],
       [0.95265323, 0.04734677],
       [0.95080573, 0.04919427],
       [0.94378757, 0.05621243],
       [0.94271112, 0.05728888]])

Sklearn orders the columns according to our class labels. The two-column output of predict_proba returns a column for each class of our household variable. The first column is the probability of household=0 for a given row, and the second column is the probability of household=1.

Store the predicted probabilities of class=1 in its own column in the data set.

In [25]:

# Store the predicted probabilities of class 1.
glass['household_pred_prob'] = logreg.predict_proba(X)[:, 1]

In [26]:

glass.head(10)

Out[26]:

Plot the predicted probabilities as a line on our plot (probability of household=1 as al changes).

In [27]:

# Plot the predicted probabilities.
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_prob, color='red')
plt.xlabel('al')
plt.ylabel('household')

Out[27]:

Text(0,0.5,'household')

In [28]:

# Examine some example predictions.
print(logreg.predict_proba(1))
print(logreg.predict_proba(2))
print(logreg.predict_proba(3))

Out[28]:

[[0.89253652 0.10746348]]
[[0.52645662 0.47354338]]
[[0.12953623 0.87046377]]

We can also use statsmodels to get the standard errors

In [29]:

### You have to include this depending on your version of statsmodels!!
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

import statsmodels.formula.api as smf
result = smf.logit('household ~ al', data=glass)
result = result.fit()
result.summary()

Out[29]:

Optimization terminated successfully.
         Current function value: 0.354364
         Iterations 7

confusion matrix and classification report

In [30]:

import pprint
pp = pprint.PrettyPrinter(indent=4)

from sklearn.metrics import classification_report
pp.pprint(classification_report(glass['household'], glass['household_pred_class']))

Out[30]:

('             precision    recall  f1-score   support\n'
 '\n'
 '          0       0.86      0.98      0.92       163\n'
 '          1       0.89      0.49      0.63        51\n'
 '\n'
 'avg / total       0.87      0.86      0.85       214\n')

Exercise 1:

Build and train a logistic regression model.
Select 2 features for your X
y will remain the same glass.household
Evaluate the model with model.score on the testing data.

In [31]:

# A: 

# Fit a logistic regression model and store the class predictions.

from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train, X_test, y_train, y_test = train_test_split(glass[['al', 'si']], glass['household'], 
                                                    stratify=glass['household']) 
                                                    #stratify takes the 0 and 1 values of household and evenly splits them for the TTS

from sklearn.linear_model import LogisticRegression

#the below three lines are for a standard regression, and not a TTS
#feature_cols = ['al', 'si'] #set your columns to a variable
#X = glass[feature_cols] #create X
#y = glass['household']  #create y

log_reg = LogisticRegression() #create object
log_reg.fit(X_train, y_train) #fits the moodel
preds = log_reg.predict(X_test) #predicts y
log_reg.score(X_test, y_test) #it uses X_test to get y predict, and then scores against y actual

#note, if you were to replace as follows:
#log_reg.score(X_test, y_test)
#you would get a perfect score of 1, bc X_test gives you y predict, 
#and running y predict against the x values that gave you y predict would give you a perfect score
#obviously this is wrong to do.

Out[31]:

0.8518518518518519

In [32]:

preds

Out[32]:

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

In [33]:

np.mean(y_test==preds)

#bool value of True=1, so if you take the mean, you get the %

Out[33]:

0.8518518518518519

In [34]:

np.mean(y_test-preds)

#why does this work?

Out[34]:

0.14814814814814814

Probability, e, Log, and Log Odds

To understand how logistic regression predicts the probability of class membership we need to start by understanding the relationship between probability, odds ratios, and log odds ratios. This is because logistic regression predicts log odds and so reading log odds is extremely useful for interpreting logistic regression.

probability = \frac {one\ outcome} {all\ outcomes}

odds = \frac {one\ outcome} {all\ other\ outcomes}

It is often useful to think of the numeric odds as a ratio. For example, 5/1 = 5 odds is "5 to 1" -- five wins for every one loss (e.g. of six total plays). 2/3 odds means "2 to 3" -- two wins for every three losses (e.g. of five total plays).

Examples:

Dice roll of 1: probability = 1/6, odds = 1/5
Even dice roll: probability = 3/6, odds = 3/3 = 1
Dice roll less than 5: probability = 4/6, odds = 4/2 = 2

odds = \frac {probability} {1 - probability}

probability = \frac {odds} {1 + odds}

As an example we can create a table of probabilities vs. odds, as seen below.

In [35]:

# Create a table of probability versus odds.
table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})
table['odds'] = table.probability / (1 - table.probability)
table

Out[35]:

Understanding e and the Natural Logarithm

What is e? It is the base rate of growth shared by all continually growing processes:

e is the irrational base of the natural log ln.

2.718281828459

In [36]:

# Exponential function: e^1
e = np.exp(1)
e

Out[36]:

2.718281828459045

What is a (natural) log? It gives you the time needed to reach a certain level of growth:

In [37]:

# Time needed to grow 1 unit to 2.718 units
# ln e = 1
np.log(2.718281828459) # very close to previous value

Out[37]:

0.9999999999999832

In [38]:

np.log(e)

Out[38]:

1.0

It is also the inverse of the exponential function:

In [39]:

# e^5
np.exp(5)

Out[39]:

148.4131591025766

In [40]:

# np.exp(5)
2.7182818**5

Out[40]:

148.41315133352072

In [41]:

# Taking the log of the exponential returns back to original input
np.log(np.exp(5))

Out[41]:

5.0

Lets take one of our odds from out table and walk through how it works.

In [42]:

# Odds = 0.25
# ln 0.25 = -1.38629436
np.log(0.25)

Out[42]:

-1.3862943611198906

In [43]:

print(e**-1.3862943611198906)
print(np.exp(-1.3862943611198906))

Out[43]:

0.25000000000000006
0.25

When we take the logarithm of the odds, we get what is known as the log odds. This may seem like an arbitrary transformation, but it has an important property: The log odds has the range $[-\infty, \infty]$ . This is not true for the odds ratio, which can never be a negative number.

In [44]:

# Add log odds to the table.
table['logodds'] = np.log(table['odds'])
table

Out[44]:

*for more on e... > extra materials > e_log_examples notebook

What Is Logistic Regression?

Linear regression: Continuous response is modeled as a linear combination of the features.

y = \beta_0 + \beta_1x

Logistic regression: Log odds of a categorical response being "true" (1) is modeled as a linear combination of the features.

\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x

This is called the logit function.

Probability is sometimes written as pi.

\log \left({\pi\over 1-\pi}\right) = \beta_0 + \beta_1x

The equation can be rearranged into the logistic function.

\hat{p} = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}

In other words:

Logistic regression outputs the probabilities of a specific class.
Those probabilities can be converted into class predictions.

The logistic function has some nice properties:

Takes on an "s" shape
Output is bounded by 0 and 1

We have covered how this works for binary classification problems (two response classes). But what about multi-class classification problems (more than two response classes)?

The most common solution for classification models is "one-vs-all" (also known as "one-vs-rest"): Decompose the problem into multiple binary classification problems.
Multinomial logistic regression, on the other hand, can solve this as a single problem, but how this works is beyond the scope of this lesson.

Interpreting Logistic Regression Coefficients

Logistic regression coefficients are not as immediately interpretable as the coefficients from a linear regression. To interpret the coefficients we need to remember how the formulation for logistic regression differs from linear regression.

First let's plot our logistic regression predicted probability line again.

In [45]:

# Plot the predicted probabilities again.
plt.scatter(glass.al, glass.household)
plt.plot(glass.al, glass.household_pred_prob, color='red')
plt.xlabel('al')
plt.ylabel('household')

Out[45]:

Text(0,0.5,'household')

Remember:

\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x

That means we'll get out the log odds if we compute the intercept plus the coefficient times a value for al.

Compute the log odds of household when al=2.

In [46]:

# Compute predicted log odds for al=2 using the equation.
logodds = logreg.intercept_ + logreg.coef_[0] * 2
logodds

Out[46]:

array([-0.10592543])

Now that we have the log odds, we will need to go through the process of converting these log odds to probability.

Convert the log odds to odds, then the odds to probability.

In [47]:

# Convert log odds to odds.
odds = np.exp(logodds)
odds

Out[47]:

array([0.89949172])

In [48]:

# Convert odds to probability.
prob = odds/(1 + odds)
prob

Out[48]:

array([0.47354338])

This finally gives us the predicted probability of household=1 when al=2. You can confirm this is the same as the value you would get out of the .predict_proba() method of the sklearn object.

In [49]:

# Compute predicted probability for al=2 using the predict_proba method.
logreg.predict_proba(2)[:, 1]

Out[49]:

array([0.47354338])

In [50]:

# Examine the coefficient for al.
list(zip(feature_cols, logreg.coef_[0]))

Out[50]:

[('al', 2.0109909641729433)]

In [51]:

# Print the intercept.
logreg.intercept_

Out[51]:

array([-4.12790736])

Interpretation: A 1-unit increase in al is associated with a 2.01-unit increase in the log odds of household.

In [52]:

# Increasing al by 1 (so that al=3)
logodds = -4.12790736 + 2.0109909641729442*3
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob

Out[52]:

0.8704637704833843

In [53]:

# Compute predicted probability for al=3 using the predict_proba method.
logreg.predict_proba(3)[:, 1]

Out[53]:

array([0.87046377])

Bottom line: Positive coefficients increase the log odds of the response (and thus increase the probability), and negative coefficients decrease the log odds of the response (and thus decrease the probability).

In [54]:

# Examine the intercept.
logreg.intercept_

Out[54]:

array([-4.12790736])

Intercept interpretation: For an al value of 0, the log-odds of household is -4.12790736.

In [55]:

# Convert log odds to probability.
logodds = logreg.intercept_
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob

Out[55]:

array([0.01586095])

That makes sense from the plot above, because the probability of household=1 should be very low for such a low al value.

Changing the $\beta_0$ value shifts the curve horizontally, whereas changing the $\beta_1$ value changes the slope of the curve.

Comparing Logistic Regression to Other Models

Advantages of logistic regression:

Highly interpretable (if you remember how).
Model training and prediction are fast.
No tuning is required (excluding regularization).
Features don't need scaling.
Can perform well with a small number of observations.
Outputs well-calibrated predicted probabilities.

Disadvantages of logistic regression:

Presumes a linear relationship between the features and the log odds of the response.
Performance is (generally) not competitive with the best supervised learning methods.
Can't automatically learn feature interactions.

Advanced Classification Metrics

When we evaluate the performance of a logistic regression (or any classifier model), the standard metric to use is accuracy: How many class labels did we guess correctly? However, accuracy is only one of several metrics we could use when evaluating a classification model.

Accuracy = \frac{total~predicted~correct}{total~predicted}

Accuracy alone doesn’t always give us a full picture.

If we know a model is 75% accurate, it doesn’t provide any insight into why the 25% was wrong.

Consider a binary classification problem where we have 165 observations/rows of people who are either smokers or nonsmokers.

n = 165	Predicted: No	Predicted: Yes
Actual: No
Actual: Yes

There are 60 in class 0, nonsmokers, and 105 observations in class 1, smokers

n = 165	Predicted: No	Predicted: Yes
Actual: No			60
Actual: Yes			105

We have 55 predictions of class, predicted as nonsmokers, and 110 of class 1, predicted to be smokers.

n = 165	Predicted: No	Predicted: Yes
Actual: No			60
Actual: Yes			105
	55	110

True positives (TP): These are cases in which we predicted yes (smokers), and they actually are smokers.
True negatives (TN): We predicted no, and they are nonsmokers.
False positives (FP): We predicted yes, but they were not actually smokers. (This is also known as a "Type I error.")
False negatives (FN): We predicted no, but they are smokers. (This is also known as a "Type II error.")

n = 165	Predicted: No	Predicted: Yes
Actual: No	TN = 50	FP = 10	60
Actual: Yes	FN = 5	TP = 100	105
	55	110

Categorize these as TP, TN, FP, or FN:

Try not to look at the answers above.

We predict nonsmoker, but the person is a smoker.
We predict nonsmoker, and the person is a nonsmoker.
We predict smoker and the person is a smoker.
We predict smoker and the person is a nonsmoker.

Accuracy, True Positive Rate, and False Negative Rate

Accuracy: Overall, how often is the classifier correct?

(TP+TN)/total = (100+50)/165 = 0.91

n = 165	Predicted: No	Predicted: Yes
Actual: No	TN = 50	FP = 10	60
Actual: Yes	FN = 5	TP = 100	105
	55	110

True positive rate (TPR) asks, “Out of all of the target class labels, how many were accurately predicted to belong to that class?”

For example, given a medical exam that tests for cancer, how often does it correctly identify patients with cancer?

TP/actual yes = 100/105 = 0.95

n = 165	Predicted: No	Predicted: Yes
Actual: No	TN = 50	FP = 10	60
Actual: Yes	FN = 5	TP = 100	105
	55	110

False positive rate (FPR) asks, “Out of all items not belonging to a class label, how many were predicted as belonging to that target class label?”

For example, given a medical exam that tests for cancer, how often does it trigger a “false alarm” by incorrectly saying a patient has cancer?

FP/actual no = 10/60 = 0.17

n = 165	Predicted: No	Predicted: Yes
Actual: No	TN = 50	FP = 10	60
Actual: Yes	FN = 5	TP = 100	105
	55	110

Can you see that we might weigh TPR AND FPR differently depending on the situation?

Give an example when we care about TPR, but not FPR.
Give an example when we care about FPR, but not TPR.

More Trade-Offs

The true positive and false positive rates gives us a much clearer picture of where predictions begin to fall apart.

This allows us to adjust our models accordingly.

Below we will load in some data on admissions to college.

In [56]:

import pandas as pd
from sklearn import linear_model, model_selection, metrics

admissions = pd.read_csv('data/admissions.csv')
admissions = admissions.dropna()
# Get dummy variables for prestige.
admissions = admissions.join(pd.get_dummies(admissions['prestige'], prefix='prestige'))

We can predict the admit class from gre and use a train-test split to evaluate the performance of our model on a held-out test set.

In [57]:

X = admissions[['gre']]
y = admissions['admit']
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=46)
logit_simple = linear_model.LogisticRegression(C=1e9).fit(X_train, y_train)

Recall that our "baseline" accuracy is the proportion of the majority class label.

In [58]:

1. - y_train.mean()

Out[58]:

0.696969696969697

In [59]:

# What is our accuracy on the test set?
print(np.mean(y_test == logit_simple.predict(X_test)))

Out[59]:

0.64

Create a confusion matrix of predictions on our test set using metrics.confusion_matrix.

In [60]:

# Get probability predictions.
logit_pred_proba = logit_simple.predict_proba(X_test)[:,1]

In [61]:

metrics.confusion_matrix(y_true=y_test, y_pred=logit_pred_proba > .5)

Out[61]:

array([[64,  0],
       [36,  0]], dtype=int64)

Answer the following:

What is our accuracy on the test set?
True positive rate?
False positive rate?

In [62]:

# Answer here:

A good classifier would have a true positive rate approaching 1 and a false positive rate approaching 0.

In our smoking problem, this model would accurately predict all of the smokers as smokers and not accidentally predict any of the nonsmokers as smokers.

Trading True Positives and True Negatives

By default, and with respect to the underlying assumptions of logistic regression, we predict a positive class when the probability of the class is greater than .5 and predict a negative class otherwise.

What if we decide to use .3 as a threshold for picking the positive class? Is that even allowed?

This turns out to be a useful strategy. By setting a lower probability threshold we will predict more positive classes. Which means we will predict more true positives, but fewer true negatives.

Making this trade-off is important in applications that have imbalanced penalties for misclassification.

The most popular example is medical diagnostics, where we want as many true positives as feasible. For example, if we are diagnosing cancer we prefer to have false positives, predict a cancer when there is no cancer, that can be later corrected with a more specific test.

We do this in machine learning by setting a low threshold for predicting positives which increases the number of true positives and false positives, but allows us to balance the the costs of being correct and incorrect.

We can vary the classification threshold for our model to get different predictions.

In [63]:

metrics.confusion_matrix(y_true=y_test, y_pred=logit_pred_proba > .3)

Out[63]:

array([[37, 27],
       [10, 26]], dtype=int64)

The Accuracy Paradox

Accuracy is a very intuitive metric — it's a lot like an exam score where you get total correct/total attempted. However, accuracy is often a poor metric in application. There are many reasons for this:

Imbalanced problems problems with 95% positives in the baseline will have 95% accuracy even with no predictive power.
- This is the paradox; pursuing accuracy often means predicting the most common class rather than doing the most useful work.
Applications often have uneven penalties and rewards for true positives and false positives.
Ranking predictions in the correct order be more important than getting them correct.
In many case we need to know the exact probability of a positives and negatives.
- To calculate an expected return.
- To triage observations that are borderline positive.

Some of the most useful metrics for addressing these problems are:

Classification accuracy/error
- Classification accuracy is the percentage of correct predictions (higher is better).
- Classification error is the percentage of incorrect predictions (lower is better).
- Easiest classification metric to understand.
Confusion matrix
- Gives you a better understanding of how your classifier is performing.
- Allows you to calculate sensitivity, specificity, and many other metrics that might match your business objective better than accuracy.
- Precision and recall are good for balancing misclassification costs.
ROC curves and area under a curve (AUC)
- Good for ranking and prioritization problems.
- Allows you to visualize the performance of your classifier across all possible classification thresholds, thus helping you to choose a threshold that appropriately balances sensitivity and specificity.
- Still useful when there is high class imbalance (unlike classification accuracy/error).
- Harder to use when there are more than two response classes.
Log loss
- Most useful when well-calibrated predicted probabilities are important to your business objective.
  - Expected value calculations
  - Triage

The good news is that these are readily available in Python and R, and are usually easy to calculate once you know about them.

OPTIONAL: How Many Samples Are Needed?

We often ask how large our data set should be to achieve a reasonable logistic regression result. Below, a few methods will be introduced for determining how accurate the resulting model will be.

Rule of Thumb

Quick: At least 100 samples total. At least 10 samples per feature.

Formula method:

Find the proportion $p$ of positive cases and negative cases. Take the smaller of the two.
- Ideally, you want 50/50 for a proportion of 0.5.
- Example: Suppose we are predicting "male" or "female". Our data is 80% male, 20% female.
  - So, we choose the proportion $p = 0.2$ since it is smaller.
Find the number of independent variables $k$ .
- Example: We are predicting gender based on the last letter of the first name, giving us 26 indicator columns for features. So, $k = 26$ .
Let the minimum number of cases be $N = \frac{10k}{p}$ . The minimum should always be set to at least $100$ .
- Example: Here, $N = 10*26 / 0.2 = 1300$ . So, we would need 1300 names (supposing 80% are male).

Both methods from: Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.

Statistical Testing

Logistic regression is one of the few machine learning models where we can obtain comprehensive statistics. By performing hypothesis testing, we can understand whether we have sufficient data to make strong conclusions about individual coefficients and the model as a whole. A very popular Python library which gives you these statistics with just a few lines of code is statsmodels.

Power Analysis

As you may suspect, many factors affect how statistically significant the results of a logistic regression are. The art of estimating the sample size to detect an effect of a given size with a given degree of confidence is called power analysis.

Some factors that influence the accuracy of our resulting model are:

Desired statistical significance (p-value)
Magnitude of the effect
- It is more difficult to distinguish a small effect from noise. So, more data would be required!
Measurement precision
Sampling error
- An effect is more difficult to detect in a smaller sample.
Experimental design

So, many factors, in addition to the number of samples, contribute to the resulting statistical power. Hence, it is difficult to give an absolute number without a more comprehensive analysis. This analysis is out of the scope of this lesson, but it is important to understand some of the factors that affect confidence.

Lesson Review

Logistic regression
- What kind of machine learning problems does logistic regression address?
- What do the coefficients in a logistic regression represent? How does the interpretation differ from ordinary least squares? How is it similar?
The confusion matrix
- How do true positive rate and false positive rate help explain accuracy?
- Why might one classification metric be more important to tune than another? Give an example of a business problem or project where this would be the case.