GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_06/code/bias-and-variance - (train-test-split) - (done).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Train/Test Split and Bias and Variance

Authors: Joseph Nelson (DC), Kevin Markham (DC)

Learning Objectives

Define error due to bias and error due to variance.
Identify the bias-variance trade-off.
Describe what overfitting and underfitting means in the context of model building.
Explain problems associated with over- and underfitting.
Grasp why train/test split is necessary.
Explore k-folds, LOOCV, and three split methods.

Lesson Guide

Bias and Variance Trade-Off

The bias-variance tradeoff is widely used in machine learning as a conceptual way of comparing and contrasting different models. We only have a few methods that are able to compare all machine learning models. The others are more mathematical.

Bias is error stemming from incorrect model assumptions.

Example: Assuming data is linear when it has a more complicated structure.

Variance is error stemming from being overly sensitive from changes to the training data.

Example: Using the training set exactly (e.g. 1-NN) for a model results in a completely different model -- even if the training set differs only slightly.

As model complexity increases:

Bias decreases. (The model can more accurately model complex structure in data.)
Variance increases. (The model identifies more complex structures, making it more sensitive to small changes in the training data.)

Bias? Variance?

Conceptual Definitions

Bias: How close are predictions to the actual values?
- Roughly, whether or not our model aims on target.
- If the model cannot represent the data's structure, our predictions could be consistent, but will not be accurate.
Variance: How variable are our predictions?
- Roughly, whether or not our model is reliable.
- We will make slightly different predictions given slightly different training sets.

Visually, we are building a model where the bulls-eye is the goal.
Each individual hit is one prediction based on our model.
Critically, the success of our model (low variance, low bias) depends on the training data present.

Examples:

Linear regression: Low variance, High bias.
- If we train with a different subset of the training set, the model will be about the same. Hence, the model has low variance.
- The resulting model will predict the training points incorrectly (unless they happen to be perfectly linear). Hence, it has high bias.
Nearest neighbor: High variance, Low bias.
- If we train with a different subset of the training set, the model will make predictions very differently. Hence, the model is highly variable.
- The resulting model will predict every training point perfectly. Hence, it has low bias.
K-Nearest neightbor: Med-high variance, Med-low bias.
- The model itself is more robust to outliers, so it will make more predictions the same than before. Hence, it has lower variance than 1-NN.
- The resulting model no longer predicts every point perfectly, since outliers will be mispredicted. So, the bias will be higher than before.

See if you can figure out:

High-order polynomial (as compared to linear regression)

Expressing bias and variance mathematically:

It can be helpful understanding these terms by looking at how we can decompose the total error into them mathematically. (We will skip the derivations for now!)

Let's define the error of our predictor as the expected value of our squared error. Note this error is not based on any particular fitted model, but on the family of potential models given a dataset (i.e. all fitted models made from all possible subsets of data).

E[(y - \hat{f}(x))^2] = Bias[\hat{f}(x)]^2 + Var[\hat{f}(x)] + \sigma^2

This states the expected error is based on only three components: bias, variance, and irreducible error.

Breaking the bias and variance down further:

Bias[\hat{f}(x)] = E[\hat{f}(x) - f(x)].

The bias is just the average expected distance between our predictor and actual values.

Var[\hat{f}(x)] = E[\hat{f}(x)^2] - E[\hat{f}(x)]^2.

The variance is how much our predictions vary about the mean. ( $E[\hat{f}(x)]$ is our predictor's mean prediction.)
The irreducible error stems from noise in the problem itself.

Some common questions:

From the math above, we can answer a few common questions:

Can a model have high bias given one dataset and low bias for another?

Yes. If our data is linearly related, for example, it will have low bias on a linear model! However, in general across all datasets very few are accurately described with a linear model. So, in general we say a linear model has high bias and low variance.

Is the MSE for a fitted linear regression the same thing as the bias?

It's close, but bias does not apply to a specific fitted model. Bias is the expected error of a model no matter what subset of the data it is fit on. This way, if we happen to get a lucky MSE fitting a model on a particular subset of our data, this does not mean we will have a low bias overall.

Exploring the Bias-Variance Trade-Off

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Allow plots to appear in the notebook.
%matplotlib inline

Brain and Body Weight Mammal Data Set

This is a data set of the average weight of the body (in kg) and the brain (in g) for 62 mammal species. We'll use this dataset to investigate bias vs. variance. Let's read it into Pandas and take a quick look:

In [2]:

path = 'mammals.txt'
cols = ['brain','body']
mammals = pd.read_table(path, sep='\t', names=cols, header=0)
mammals.head()

Out[2]:

In [3]:

mammals.describe()

Out[3]:

We're going to focus on a smaller subset in which the body weight is less than 200 kg.

In [4]:

# Only keep rows in which the body weight is less than 200 kg.
mammals = mammals[mammals['body'] < 200]
mammals.shape

Out[4]:

(51, 2)

We're now going to pretend that there are only 51 mammal species in existence. In other words, we are pretending that this is the entire data set of brain and body weights for every known mammal species.

Let's create a scatterplot (using Seaborn) to visualize the relationship between brain and body weight:

In [5]:

sns.lmplot(x='body', y='brain', data=mammals, ci=None, fit_reg=False);
plt.xlim(-10, 200);
plt.ylim(-10, 250);

Out[5]:

There appears to be a relationship between brain and body weight for mammals.

Making a Prediction

Linear Regression: A Quick Review

Now let's pretend that a new mammal species is discovered. We measure the body weight of every member of this species we can find and calculate an average body weight of 100 kgs. We want to predict the average brain weight of this species (rather than measuring it directly). How might we do this?

In [6]:

sns.lmplot(x='body', y='brain', data=mammals, ci=None);
plt.xlim(-10, 200);
plt.ylim(-10, 250);

Out[6]:

We drew a straight line that appears to best capture the relationship between brain and body weight. So, we might predict that our new species has a brain weight of about 45 g, as that's the approximate y value when x=100.

This is known as a "linear model" or a "linear regression model."

Making a Prediction From a Sample

Earlier, we assumed that this dataset contained every known mammal species. That's very convenient, but in the real world, all you ever have is a sample of data. This may sound like a contentious statement, but the point of machine learning is to generalize from a sample to the population. If you already have data for the entire population, then you have no need for machine learning -- you can apply statistics directly and get optimal answers!

Here, a more realistic situation would be to only have brain and body weights for (let's say) half of the 51 known mammals.

When that new mammal species (with a body weight of 100 kg) is discovered, we still want to make an accurate prediction for its brain weight, but this task might be more difficult, as we don't have all of the data we would ideally like to have.

Let's simulate this situation by assigning each of the 51 observations to either universe 1 or universe 2:

In [7]:

# Set a random seed for reproducibility.
np.random.seed(12345)

# Randomly assign every observation to either universe 1 or universe 2.
mammals['universe'] = np.random.randint(1, 3, len(mammals))
mammals.head()

Out[7]:

Important: We only live in one of the two universes. Both universes have 51 known mammal species, but each universe knows the brain and body weight for different species.

We can now tell Seaborn to create two plots in which the left plot only uses the data from universe 1 and the right plot only uses the data from universe 2:

In [8]:

# col='universe' subsets the data by universe and creates two separate plots.
sns.lmplot(x='body', y='brain', data=mammals, ci=None, col='universe');
plt.xlim(-10, 200);
plt.ylim(-10, 250);

Out[8]:

The line looks pretty similar between the two plots, despite the fact that they used separate samples of data. In both cases, we would predict a brain weight of about 45 g.

It's easier to see the degree of similarity by placing them on the same plot:

In [9]:

# hue='universe' subsets the data by universe and creates a single plot.
sns.lmplot(x='body', y='brain', data=mammals, ci=None, hue='universe');
plt.xlim(-10, 200);
plt.ylim(-10, 250);

Out[9]:

So, what was the point of this exercise? This was a visual demonstration of a high-bias, low-variance model.

It's high bias because it doesn't fit the data particularly well.
It's low variance because it doesn't change much depending on which observations happen to be available in that universe.

Let's Try Something Completely Different

What would a low bias, high variance model look like? Let's try polynomial regression with an eighth-order polynomial.

In [10]:

sns.lmplot(x='body', y='brain', data=mammals, ci=None, col='universe', order=8);
plt.xlim(-10, 200);
plt.ylim(-10, 250);

Out[10]:

It's low bias because the models match the data effectively.
It's high variance because the models are widely different, depending on which observations happen to be available in that universe. (For a body weight of 100 kg, the brain weight prediction would be 40 kg in one universe and 0 kg in the other!)

In [11]:

sns.lmplot(x='body', y='brain', data=mammals, ci=None, hue='universe', order=8);
plt.xlim(-10, 200);
plt.ylim(-10, 250);

Out[11]:

Balancing Bias and Variance

Can we find a middle ground?

Perhaps we can create a model that has less bias than the linear model and less variance than the eighth order polynomial?

Let's try a second order polynomial instead:

In [12]:

sns.lmplot(x='body', y='brain', data=mammals, ci=None, col='universe', order=2);
plt.xlim(-10, 200);
plt.ylim(-10, 250);

Out[12]:

This seems better. In both the left and right plots, it fits the data well, but not too well.

This is the essence of the bias-variance trade-off: You are seeking a model that appropriately balances bias and variance and thus will generalize to new data (known as "out-of-sample" data).

We want a model that best balances bias and variance. It should match our training data well (moderate bias) yet be low variance for out-of-sample data (moderate variance).

Training error as a function of complexity.
Question: Why do we even care about variance if we know we can generate a more accurate model with higher complexity?

Can we obtain a zero-bias, zero-variance model?

No! If there is any noise in the data-generating process, then a zero-variance model would not be learning from the data. Additionally, a model only has zero bias if the true relationship between the target and the features is hard-coded into it. If that were the case, you wouldn't be doing machine learning -- it would be similar to trying to predict today's temperature by using today's temperature!

Train-test-split

For the rest of the lab, we will look at three evaluation procedures for predicting model out-of-sample accuracy:

Train on the entire dataset should never be done to estimate model accuracy on out-of-sample data! After all, training error can be made arbitrarily small or large. You might train on the entire dataset as the very last step when a model is chosen, hoping to make the final model as accurate as possible. Or, you could use this to estimate the degree of overfitting.
Train-test-split is useful if cross-validation is not practical (e.g. it takes too long to train). It is also useful for computing a quick confusion matrix. You could also use this as a final step after the model is finalized (often called evaluating the model against a validation set).
Cross-validation is the gold standard for estimating accuracy and comparing accuracy across models.
Three-way split combines cross-validation and the train-test-split. It takes an initial split to be used as a final validation set, then uses cross-validation on the rest.

We run into a problem when powerful models can perfectly fit the data on which they are trained. These models are low bias and high variance. However, we can't observe the variance of a model directly, because we only know how it fits the data we have rather than all potential samples.

Solution: Create a procedure that estimates how well a model is likely to perform on out-of-sample data and use that to choose between models.

Before, we have been splitting the data into a single training group and a single test group.
Now, to estimate how well the model is likely to perform on out-of-sample data, we will create many training groups and many test groups then fit many models.

Note: These procedures can be used with any machine learning model.

The Holdout Method: Train/Test Split

Training set: Used to train the classifier.
Testing set: Used to estimate the error rate of the trained classifier.
Advantages: Fast, simple, computationally inexpensive.
Disadvantages Eliminates data, imperfectly splits.

Evaluation Procedure #1: Train and Test on the Entire Data Set (Do Not Do This)

Train the model on the entire data set.
Test the model on the same data set and evaluate how well we did by comparing the predicted response values with the true response values.

Load in the Boston data.

In [13]:

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()

In [14]:

print(boston.DESCR)

Out[14]:

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

Create X and y variable to stores the feature matrix and response from the Boston data.

In [15]:

# Create a DataFrame for both parts of data; don't forget to assign column names.
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=['MEDV'])

Concatenate y and X, then overwrite the Boston variable.

In [16]:

boston = pd.concat([y, X], axis=1)

Perform basic EDA to make sure the data are in order.

In [17]:

boston.isnull().sum()

Out[17]:

MEDV       0
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
dtype: int64

In [18]:

boston.dtypes

Out[18]:

MEDV       float64
CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
dtype: object

In [19]:

boston.describe()

Out[19]:

Prepare a feature matrix (X) and response (y) for scikit-learn.

In [20]:

# create feature matrix (X)
feature_cols = boston.columns.drop(['MEDV'])
X = boston[feature_cols]

# create response vector (y)
y = boston.MEDV

Import linear regression, instantiate, fit, and preview predictions.

In [21]:

# Import the class.
from sklearn.linear_model import LinearRegression

# Instantiate the model.
lr = LinearRegression()

# Train the model on the entire data set.
lr.fit(X, y)

# Predict the response values for the observations in X ("test the model").
lr.predict(X)

Out[21]:

array([30.00821269, 25.0298606 , 30.5702317 , 28.60814055, 27.94288232,
       25.25940048, 23.00433994, 19.5347558 , 11.51696539, 18.91981483,
       18.9958266 , 21.58970854, 20.90534851, 19.55535931, 19.2837957 ,
       19.30000174, 20.52889993, 16.9096749 , 16.17067411, 18.40781636,
       12.52040454, 17.67104565, 15.82934891, 13.80368317, 15.67708138,
       13.3791645 , 15.46258829, 14.69863607, 19.54518512, 20.87309945,
       11.44806825, 18.05900412,  8.78841666, 14.27882319, 13.69097132,
       23.81755469, 22.34216285, 23.11123204, 22.91494157, 31.35826216,
       34.21485385, 28.0207132 , 25.20646572, 24.61192851, 22.94438953,
       22.10150945, 20.42467417, 18.03614022,  9.10176198, 17.20856571,
       21.28259372, 23.97621248, 27.65853521, 24.0521088 , 15.35989132,
       31.14817003, 24.85878746, 33.11017111, 21.77458036, 21.08526739,
       17.87203538, 18.50881381, 23.9879809 , 22.54944098, 23.37068403,
       30.36557584, 25.53407332, 21.11758504, 17.42468223, 20.7893086 ,
       25.20349174, 21.74490595, 24.56275612, 24.04479519, 25.5091157 ,
       23.97076758, 22.94823519, 23.36106095, 21.26432549, 22.4345376 ,
       28.40699937, 26.99734716, 26.03807246, 25.06152125, 24.7858613 ,
       27.79291889, 22.16927073, 25.89685664, 30.67771522, 30.83225886,
       27.12127354, 27.41597825, 28.9456478 , 29.08668003, 27.04501726,
       28.62506705, 24.73038218, 35.78062378, 35.11269515, 32.25115468,
       24.57946786, 25.59386215, 19.76439137, 20.31157117, 21.4353635 ,
       18.53971968, 17.18572611, 20.74934949, 22.64791346, 19.77000977,
       20.64745349, 26.52652691, 20.77440554, 20.71546432, 25.17461484,
       20.4273652 , 23.37862521, 23.69454145, 20.33202239, 20.79378139,
       21.92024414, 22.47432006, 20.55884635, 16.36300764, 20.56342111,
       22.48570454, 14.61264839, 15.1802607 , 18.93828443, 14.0574955 ,
       20.03651959, 19.41306288, 20.06401034, 15.76005772, 13.24771577,
       17.26167729, 15.87759672, 19.36145104, 13.81270814, 16.44782934,
       13.56511101,  3.98343974, 14.59241207, 12.14503093,  8.72407108,
       12.00815659, 15.80308586,  8.50963929,  9.70965512, 14.79848067,
       20.83598096, 18.30017013, 20.12575267, 17.27585681, 22.35997992,
       20.07985184, 13.59903744, 33.26635221, 29.03938379, 25.56694529,
       32.71732164, 36.78111388, 40.56615533, 41.85122271, 24.79875684,
       25.3771545 , 37.20662185, 23.08244608, 26.40326834, 26.65647433,
       22.55412919, 24.2970948 , 22.98024802, 29.07488389, 26.52620066,
       30.72351225, 25.61835359, 29.14203283, 31.43690634, 32.9232938 ,
       34.72096487, 27.76792733, 33.88992899, 30.99725805, 22.72124288,
       24.76567683, 35.88131719, 33.42696242, 32.41513625, 34.51611818,
       30.76057666, 30.29169893, 32.92040221, 32.11459912, 31.56133385,
       40.84274603, 36.13046343, 32.66639271, 34.70558647, 30.09276228,
       30.64139724, 29.29189704, 37.07062623, 42.02879611, 43.18582722,
       22.6923888 , 23.68420569, 17.85435295, 23.49543857, 17.00872418,
       22.39535066, 17.06152243, 22.74106824, 25.21974252, 11.10601161,
       24.51300617, 26.60749026, 28.35802444, 24.91860458, 29.69254951,
       33.18492755, 23.77145523, 32.14086508, 29.74802362, 38.36605632,
       39.80716458, 37.58362546, 32.39769704, 35.45048257, 31.23446481,
       24.48478321, 33.28615723, 38.04368164, 37.15737267, 31.71297469,
       25.26658017, 30.101515  , 32.71897655, 28.42735376, 28.42999168,
       27.2913215 , 23.74446671, 24.11878941, 27.40241209, 16.32993575,
       13.39695213, 20.01655581, 19.86205904, 21.28604604, 24.07796482,
       24.20603792, 25.04201534, 24.91709097, 29.93762975, 23.97709054,
       21.69931969, 37.51051381, 43.29459357, 36.48121427, 34.99129701,
       34.80865729, 37.16296374, 40.9823638 , 34.44211691, 35.83178068,
       28.24913647, 31.22022312, 40.83256202, 39.31768808, 25.71099424,
       22.30344878, 27.20551341, 28.51386352, 35.47494122, 36.11110647,
       33.80004807, 35.61141951, 34.84311742, 30.35359323, 35.31260262,
       38.79684808, 34.33296541, 40.34038636, 44.67339923, 31.5955473 ,
       27.35994642, 20.09520596, 27.04518524, 27.21674397, 26.91105226,
       33.43602979, 34.40228785, 31.83374181, 25.82416035, 24.43687139,
       28.46348891, 27.36916176, 19.54441878, 29.11480679, 31.90852699,
       30.77325183, 28.9430835 , 28.88108106, 32.79876794, 33.20356949,
       30.76568546, 35.55843485, 32.70725436, 28.64759861, 23.59388439,
       18.5461558 , 26.88429024, 23.28485442, 25.55002201, 25.48337323,
       20.54343769, 17.61406384, 18.37627933, 24.29187594, 21.3257202 ,
       24.88826131, 24.87143049, 22.87255605, 19.4540234 , 25.11948741,
       24.66816374, 23.68209656, 19.33951725, 21.17636041, 24.25306588,
       21.59311197, 19.98766667, 23.34079584, 22.13973959, 21.55349196,
       20.61808868, 20.1607571 , 19.28455466, 22.16593919, 21.24893735,
       21.42985456, 30.32874523, 22.04915396, 27.70610125, 28.54595004,
       16.54657063, 14.78278261, 25.27336772, 27.54088054, 22.14633467,
       20.46081206, 20.54472332, 16.88194391, 25.40066956, 14.32299547,
       16.5927403 , 19.63224597, 22.7117302 , 22.19946949, 19.1989151 ,
       22.66091019, 18.92059374, 18.22715359, 20.22444386, 37.47946099,
       14.29172583, 15.53697148, 10.82825817, 23.81134987, 32.64787163,
       34.61163401, 24.94604102, 26.00259724,  6.12085728,  0.78021126,
       25.311373  , 17.73465914, 20.22593282, 15.83834861, 16.83742401,
       14.43123608, 18.47647773, 13.42427933, 13.05677824,  3.27646485,
        8.05936467,  6.13903114,  5.62271213,  6.44935154, 14.20597451,
       17.21022671, 17.29035065,  9.89064351, 20.21972222, 17.94511052,
       20.30017588, 19.28790318, 16.33300008,  6.56843662, 10.87541577,
       11.88704097, 17.81098929, 18.25461066, 12.99282707,  7.39319053,
        8.25609561,  8.07899971, 19.98563715, 13.69651744, 19.83511412,
       15.2345378 , 16.93112419,  1.69347406, 11.81116263, -4.28300934,
        9.55007844, 13.32635521,  6.88351077,  6.16827417, 14.56933235,
       19.59292932, 18.1151686 , 18.52011987, 13.13707457, 14.59662601,
        9.8923749 , 16.31998048, 14.06750301, 14.22573568, 13.00752251,
       18.13277547, 18.66645496, 21.50283795, 17.00039379, 15.93926602,
       13.32952716, 14.48949211,  8.78366731,  4.8300317 , 13.06115528,
       12.71101472, 17.2887624 , 18.73424906, 18.05271013, 11.49855612,
       13.00841512, 17.66975577, 18.12342294, 17.51503231, 17.21307203,
       16.48238543, 19.40079737, 18.57392951, 22.47833186, 15.24179836,
       15.78327609, 12.64853778, 12.84121049, 17.17173661, 18.50906858,
       19.02803874, 20.16441773, 19.76975335, 22.42614937, 20.31750314,
       17.87618837, 14.3391341 , 16.93715603, 16.98716629, 18.59431701,
       20.16395155, 22.97743546, 22.45110639, 25.5707207 , 16.39091112,
       16.09765427, 20.52835689, 11.5429045 , 19.20387482, 21.86820603,
       23.47052203, 27.10034494, 28.57064813, 21.0839881 , 19.4490529 ,
       22.2189221 , 19.65423066, 21.324671  , 11.86231364,  8.22260592,
        3.65825168, 13.76275951, 15.93780944, 20.62730097, 20.61035443,
       16.88048035, 14.01017244, 19.10825534, 21.29720741, 18.45524217,
       20.46764235, 23.53261729, 22.37869798, 27.62934247, 26.12983844,
       22.34870269])

Store the predicted response values.

In [22]:

y_pred = lr.predict(X)

To evaluate a model, we also need an evaluation metric:

A numeric calculation used to quantify the performance of a model.
The appropriate metric depends on the goals of your problem.

The most common choices for regression problems are:

R-squared: The percentage of variation explained by the model (a "reward function," as higher is better).
Mean squared error: The average squared distance between the prediction and the correct answer (a "loss function," as lower is better).

In this case, we'll use mean squared error because it is more interpretable in a predictive context.

Compute mean squared error using a function from `metrics`.

In [23]:

from sklearn import metrics

print(metrics.mean_squared_error(y, y_pred))

Out[23]:

21.897779217687496

This is known as the training mean squared error because we are evaluating the model based on the same data we used to train the model.

Problems With Training and Testing on the Same Data

Our goal is to estimate likely performance of a model on out-of-sample data.
But, maximizing the training mean squared error rewards overly complex models that won't necessarily generalize.
Unnecessarily complex models overfit the training data.
- They will do well when tested using the in-sample data.
- They may do poorly with out-of-sample data.
- They learn the "noise" in the data rather than the "signal."
- From Quora: What is an intuitive explanation of overfitting?

Thus, the training MSE is not a good estimate of the out-of-sample MSE.

Evaluation procedure #2: Train/Test Split

Split the data set into two pieces: a training set and a testing set.
Train the model on the training set.
Test the model on the testing set and evaluate how well we did.

Often a good rule-of-thumb is 70% training/30% test, but this can vary based on the size of your dataset. For example, with a small dataset you would need to use as much training data as possible (in return, your test accuracy will be more variable).

What does this accomplish?

Models can be trained and tested on different data (We treat testing data like out-of-sample data).
Response values are known for the testing set and thus predictions can be evaluated.

This is known as the testing mean squared error because we are evaluating the model on an independent "test set" that was not used during model training.

The testing MSE is a better estimate of out-of-sample performance than the training MSE.

Before We Dive Into Train/Test Split, Let's Understand "Unpacking" Syntax

Unpacking in itself allows us to break down the contents of an object and assign it equally to several variables simultaneously.

Let's create a packed object (boxed), then unpack it using a for loop.

In [24]:

# Let's start with two lists that are related in some manner.
package = ['package_1','package_2','package_3','package_4']
directions = ['directions_1','directions_2','directions_3','directions_4']

# we'll zip them together to form the associate combos
# We can then use `for Obj-1, Obj-2 in` to isolate the values we need.
for p, d in zip(package, directions):
    print('Shipment: {} | Shipment Contents: {}'.format(p,d))

Out[24]:

Shipment: package_1 | Shipment Contents: directions_1
Shipment: package_2 | Shipment Contents: directions_2
Shipment: package_3 | Shipment Contents: directions_3
Shipment: package_4 | Shipment Contents: directions_4

Rather than using a for loop to unpack an output, we can simply assign the results, assuming we know exactly how many results need to be assigned. We can think of the result of zip as comprising four subcomponents; we can use a for loop to help us break the subcomponents out OR use the unpacking method.

In [25]:

box1, box2, box3, box4 = zip(package, directions)

In [26]:

print(box1)
print(box3)

Out[26]:

('package_1', 'directions_1')
('package_3', 'directions_3')

In the case of train/test split, we add an unpackaging assignment to the return value of a function, as exemplified by the code below:

In [27]:

# Create a function that takes an argument to act up. 
def min_max(nums):
    smallest = min(nums)
    largest = max(nums)
    
    # The function returns a list in the order below.
    return [smallest, largest, 5]

In [28]:

# We can assign the returned list to a single variable,
min_and_max = min_max([1, 2, 3])

print(min_and_max)
print(type(min_and_max))

Out[28]:

[1, 3, 5]
<class 'list'>

In [29]:

# OR, because we know the list is composed of three elements, 
# assign each element to its own variable.
the_min, the_max, five = min_max([1, 2, 3])

print(the_max)
print(the_min)
print(five)

Out[29]:

3
1
5

Understanding the `train_test_split` Function

In [30]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [31]:

# Before splitting
print(X.shape)

# After splitting
print(X_train.shape)
print(X_test.shape)

#80/20 default split

Out[31]:

(506, 13)
(379, 13)
(127, 13)

In [32]:

# Recall that (1,) is a tuple. 
# The trailing comma distinguishes it as being a tuple, not an integer.

# Before splitting
print(y.shape)

# After splitting
print(y_train.shape)
print(y_test.shape)

Out[32]:

(506,)
(379,)
(127,)

Understanding the `random_state` Parameter

The random_state is a pseudo-random number that allows us to reproduce our results every time we run them. However, it makes it impossible to predict what are exact results will be if we chose a new random_state.

random_state is very useful for testing that your model was made correctly since it provides you with the same split each time. However, make sure you remove it if you are testing for model variability!

In [33]:

# WITHOUT a random_state parameter:
#  (If you run this code several times, you get different results!)
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Print the first element of each object.
print(X_train.head(1))

Out[33]:

        CRIM   ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
313  0.26938  0.0    9.9   0.0  0.544  6.266  82.8  3.2628  4.0  304.0   

     PTRATIO       B  LSTAT  
313     18.4  393.39    7.9  

In [34]:

# WITH a random_state parameter:
#  (Same split every time! Note you can change the random state to any integer.)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Print the first element of each object.
print(X_train.head(1))
print(X_test.head(1))
print(y_train.head(1))
print(y_test.head(1))

Out[34]:

        CRIM   ZN  INDUS  CHAS    NOX    RM   AGE     DIS  RAD    TAX  \
502  0.04527  0.0  11.93   0.0  0.573  6.12  76.7  2.2875  1.0  273.0   

     PTRATIO      B  LSTAT  
502     21.0  396.9   9.08  
        CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
307  0.04932  33.0   2.18   0.0  0.472  6.849  70.3  3.1827  7.0  222.0   

     PTRATIO      B  LSTAT  
307     18.4  396.9   7.53  
502    20.6
Name: MEDV, dtype: float64
307    28.2
Name: MEDV, dtype: float64

Introduce Patsy

We will make one more modification. Patsy is a library that allows you to quickly perform simple data transformations in a style similar to R.

Rather than manually creating X and y, we will use the .dmatricies() function from Patsy to create the matricies and explore the effect of changing features on training and testing error.

In [35]:

import patsy

Step 1: Split X and y into training and testing sets (using `random_state` for reproducibility).

In [36]:

y, X = patsy.dmatrices("MEDV ~ AGE + RM", data=boston, return_type="dataframe")

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

Step 2: Train the model on the training set.

In [37]:

lr = LinearRegression()
lr.fit(X_train, y_train) #.fit finds the coefficients
lr.coef_

Out[37]:

array([[ 0.        , -0.07483512,  8.2721415 ]])

Step 3: Test the model on the testing set and check the accuracy.

In [38]:

y_pred = lr.predict(X_test) 
from sklearn.metrics import r2_score 

#print(metrics.mean_squared_error(y_train, lr.predict(X_train)))
#print(metrics.mean_squared_error(y_test, y_pred))

print(r2_score(y_train, lr.predict(X_train)))
print(r2_score(y_test, y_pred))

Out[38]:

0.5225720430412025
0.5420794678685888

Go back to Step 1 and try adding new variables and transformations.

Training error: Decreases as model complexity increases (lower value of k).
Testing error: Is minimized at the optimum model complexity.

Comparing Test Performance With a Null Baseline

When interpreting the predictive power of a model, it's best to compare it to a baseline using a dummy model, sometimes called a ZeroR model or a baseline model. A dummy model is simply using the mean, median, or most common value as the prediction. This forms a benchmark to compare your model against and becomes especially important in classification where your null accuracy might be 95 percent.

For example, suppose your dataset is imbalanced -- it contains 99% one class and 1% the other class. Then, your baseline accuracy (always guessing the first class) would be 99%. So, if your model is less than 99% accurate, you know it is worse than the baseline. Imbalanced datasets generally must be trained differently (with less of a focus on accuracy) because of this.

You can alternatively use simple models to achieve baseline results, for example nearest neighbors or a basic unigram bag of words for text data.

Examine the baseline mean squared error using a null model.

How does this compare to what we achieved with linear regression. Is our model making an actual improvement?

In [39]:

# Use .apply() to broadcast a mean for every prediction.
print(metrics.mean_squared_error(y_test, y_test.apply(np.mean, broadcast=True)))

Out[39]:

91.72279744559489

K-Folds Cross-Validation

Train/test split provides us with helpful tool, but it's a shame that we are tossing out a large chunk of our data for testing purposes.

How can we use the maximum amount of our data points while still ensuring model integrity?

Split our data into a number of different pieces (folds).
Train using k-1 folds for training and a different fold for testing.
Average our model against EACH of those iterations.
Choose our model and TEST it against the final fold.
Average all test accuracies to get the estimated out-of-sample accuracy.

Although this may sound complicated, we are just training the model on k separate train-test-splits, then taking the average of the resulting test accuracies!

Leave-One-Out Cross-Validation

A special case of k-fold cross-validation is leave-one-out cross-validation. Rather than taking 5–10 folds, we take a fold of size n-1 and leave one observation to test.

Typically, 5–10 fold cross-validaiton is recommended.

Intro to Cross-Validation With the Boston Data

Create a cross-valiation with five folds.

In [40]:

from sklearn import model_selection

In [41]:

kf = model_selection.KFold(n_splits=5, shuffle=True)

In [42]:

mse_values = []
scores = []
n = 0

print("~~~~ CROSS VALIDATION each fold ~~~~")
for train_index, test_index in kf.split(X, y):
    lr = LinearRegression()
    lr.fit(X.iloc[train_index], y.iloc[train_index])
    
    mse_values.append(metrics.mean_squared_error(y.iloc[test_index], lr.predict(X.iloc[test_index])))
    scores.append(lr.score(X, y))
    
    n += 1
    
    print('Model {}'.format(n))
    print('MSE: {}'.format(mse_values[n-1]))
    print('R2: {}\n'.format(scores[n-1]))


print("~~~~ SUMMARY OF CROSS VALIDATION ~~~~")
print('Mean of MSE for all folds: {}'.format(np.mean(mse_values)))
print('Mean of R2 for all folds: {}'.format(np.mean(scores)))

Out[42]:

~~~~ CROSS VALIDATION each fold ~~~~
Model 1
MSE: 31.85278190542879
R2: 0.5296503177147146

Model 2
MSE: 49.28656832449435
R2: 0.529736665484839

Model 3
MSE: 34.7083355581299
R2: 0.529608811175335

Model 4
MSE: 33.74819013789556
R2: 0.5299810728091623

Model 5
MSE: 52.073056248929845
R2: 0.5280340474683578

~~~~ SUMMARY OF CROSS VALIDATION ~~~~
Mean of MSE for all folds: 40.33378643497569
Mean of R2 for all folds: 0.5294021829304818

In [43]:

from sklearn.model_selection import cross_val_score

# Note the results will vary each run since we take a different
#   subset of the data each time (since shuffle=True)
kf = model_selection.KFold(n_splits=5, shuffle=True)

print(np.mean(-cross_val_score(lr, X, y, cv=kf, scoring='neg_mean_squared_error')))
print(np.mean(cross_val_score(lr, X, y, cv=kf)))

Out[43]:

40.34914319956572
0.48796063579837357

While the cross-validated approach here generated more overall error, which of the two approaches would predict new data more accurately — the single model or the cross-validated, averaged one? Why?

Answer:

....

Three-Way Data Split

The most common workflow is actually a combination of train/test split and cross-validation. We take a train/test split on our data right away and try not spend a lot of time using the testing data set. Instead, we take our training data and tune our models using cross-validation. When we think we are done, we do one last test on the testing data to make sure we haven't accidently overfit to our training data.

If you tune hyperparameters via cross-validation, you should never use cross-validation on the same dataset to estimate OOS accuracy! Using cross-validation in this way, the entire dataset was used to tune hyperparameters. So, this invalidates our condition above -- where we assumed the test set is a pretend "out-of-sample" dataset that was not used to train our model! So, we would expect the accuracy on this test set to be artificially inflated as compared to actual "out-of-sample" data.

Even with good evaluation procedures, it is incredible easy to overfit our models by including features that will not be available during production or leak information about our testing data in other ways.

If model selection and true error estimates are to be computed simultaneously, three disjointed data sets are best.
- Training set: A set of examples used for learning – what parameters of the classifier?
- Validation set: A set of examples used to tune the parameters of the classifier.
- Testing set: A set of examples used ONLY to assess the performance of the fully trained classifier.
Validation and testing must be separate data sets. Once you have the final model set, you cannot do any additional tuning after testing.

Divide data into training, validation, and testing sets.
Select architecture (model type) and training parameters (k).
Train the model using the training set.
Evaluate the model using the training set.
Repeat 2–4 times, selecting different architectures (models) and tuning parameters.
Select the best model.
Assess the model with the final testing set.

Additional Resources

Bias Variance
University of Washington slides

Summary

In this lab, we compared four methods of estimating model accuracy on out-of-sample data. Throughout your regular data science work, you will likely use all four at some point:

Train on the entire dataset
Train-test-split
Cross-validation
Three-way split