GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_06/code/practice/train_test-cross_validation- done.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Train-test Split and Cross-Validation Lab

Authors: Joseph Nelson (DC), Kiefer Katovich (SF)

Review of train/test validation methods

We've discussed overfitting, underfitting, and how to validate the "generalizeability" of your models by testing them on unseen data.

In this lab you'll practice two related validation methods:

train/test split
k-fold cross-validation

Train/test split and k-fold cross-validation both serve two useful purposes:

We prevent overfitting by not using all the data, and
We retain some remaining data to evaluate our model.

In the case of cross-validation, the model fitting and evaluation is performed multiple times on different train/test splits of the data.

Ultimately we can the training and testing validation framework to compare multiple models on the same dataset. This could be comparisons of two linear models, or of completely different models on the same data.

Instructions

For your independent practice, fit three different models on the Boston housing data. For example, you could pick three different subsets of variables, one or more polynomial models, or any other model that you like.

Start with train/test split validation:

Fix a testing/training split of the data
Train each of your models on the training data
Evaluate each of the models on the test data
Rank the models by how well they score on the testing data set.

Then try K-Fold cross-validation:

Perform a k-fold cross validation and use the cross-validation scores to compare your models. Did this change your rankings?
Try a few different K-splits of the data for the same models.

If you're interested, try a variety of response variables. We start with MEDV (the .target attribute from the dataset load method).

In [33]:

from matplotlib import pyplot as plt

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

In [34]:

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()

In [35]:

X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.DataFrame(boston.target, columns=['MEDV'])

1. Clean up any data problems

Load the Boston housing data. Fix any problems, if applicable.

In [36]:

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X[feature_cols], y, train_size = 0.7)
lr.fit(X_train, Y_train)

print('R^2 for training data')
print(lr.score(X_train, Y_train))

print('R^2 for testing data')
lr.score(X_test, Y_test)

Out[36]:

R^2 for training data
0.31467878566181484
R^2 for testing data

/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

0.35076275537144336

2. Select 3-4 variables with your dataset to perform a 50/50 test train split on

Use sklearn.
Score and plot your predictions.

In [37]:

X.columns

Out[37]:

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')

In [38]:

feature_cols = ['CRIM','ZN','INDUS','CHAS']

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X[feature_cols], y, train_size = 0.5)

Out[38]:

/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

In [39]:

print(X_train.shape, X_test.shape)

Out[39]:

(253, 4) (253, 4)

In [40]:

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, Y_train)

print('R^2 for training data')
print(lr.score(X_train, Y_train))

print('R^2 for testing data')
lr.score(X_test, Y_test)

Out[40]:

R^2 for training data
0.33083918056007233
R^2 for testing data

0.3221501061514699

3. Try 70/30 and 90/10

Score and plot.
How do your metrics change?

In [41]:

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X[feature_cols], y, train_size = 0.7)
lr.fit(X_train, Y_train)

print('R^2 for training data')
print(lr.score(X_train, Y_train))

print('R^2 for testing data')
lr.score(X_test, Y_test)

Out[41]:

R^2 for training data
0.31944655767562985
R^2 for testing data

/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2026: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

0.3459077926753505

4. Try K-Folds cross-validation with K between 5-10 for your regression.

What seems optimal?
How do your scores change?
What the variance of scores like?
Try different folds to get a sense of how this impacts your score.

In [42]:

from sklearn.model_selection import KFold
kf = KFold(n_splits=10, shuffle=True)

mse_values = []
scores = []
n = 0

print("~~~~ CROSS VALIDATION each fold ~~~~")
for train_index, test_index in kf.split(X_train, Y_train):
    lr = LinearRegression()
    lr.fit(X.iloc[train_index], y.iloc[train_index])
    scores.append(lr.score(X.iloc[test_index], y.iloc[test_index]))
    
    n += 1
    
    print('Model {}'.format(n))
    print('R2: {}\n'.format(scores[n-1]))


print("~~~~ SUMMARY OF CROSS VALIDATION ~~~~")
print('Mean of R2 for all folds: {}'.format(np.mean(scores)))

Out[42]:

~~~~ CROSS VALIDATION each fold ~~~~
Model 1
R2: 0.909700888095842

Model 2
R2: 0.8390783708576292

Model 3
R2: 0.8204759675987326

Model 4
R2: 0.8217744175782682

Model 5
R2: 0.8372196623283804

Model 6
R2: 0.8235077308686127

Model 7
R2: 0.8940098370905833

Model 8
R2: 0.8848771949493632

Model 9
R2: 0.9089210933849964

Model 10
R2: 0.8581680867196018

~~~~ SUMMARY OF CROSS VALIDATION ~~~~
Mean of R2 for all folds: 0.8597733249472009

5. [Bonus] optimize the $R^2$ score

Can you optimize your R^2 by selecting the best features and validating the model using either train/test split or K-Folds?

Your code will need to iterate through the different combinations of predictors, cross-validate the current model parameterization, and determine which set of features performed best.

The number of K-folds is up to you.

Hint: the itertools package is useful for combinations and permutations.

In [43]:

# A:

5.1 Can you explain what could be wrong with this approach?

In [44]:

# A:

6. [Bonus] Explore another target variable and practice `patsy` formulas

Can you find another response variable, given a combination of predictors, that can be predicted accurately through the exploration of different predictors in this dataset?

Try out using patsy to construct your target and predictor matrices from formula strings.

Tip: Check out pairplots, coefficients, and pearson scores.

In [45]:

import patsy

# A:

Train-test Split and Cross-Validation Lab

Review of train/test validation methods

Instructions

1. Clean up any data problems

2. Select 3-4 variables with your dataset to perform a 50/50 test train split on

3. Try 70/30 and 90/10

4. Try K-Folds cross-validation with K between 5-10 for your regression.

5. [Bonus] optimize the $R^2$ score

5.1 Can you explain what could be wrong with this approach?

6. [Bonus] Explore another target variable and practice `patsy` formulas

Product

Resources

Company

Train-test Split and Cross-Validation Lab

Review of train/test validation methods

Instructions

1. Clean up any data problems

2. Select 3-4 variables with your dataset to perform a 50/50 test train split on

3. Try 70/30 and 90/10

4. Try K-Folds cross-validation with K between 5-10 for your regression.

5. [Bonus] optimize the R2R^2R2 score

5.1 Can you explain what could be wrong with this approach?

6. [Bonus] Explore another target variable and practice patsy formulas

5. [Bonus] optimize the $R^2$ score

6. [Bonus] Explore another target variable and practice `patsy` formulas