GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_05/code/starter-code/starter-code-6.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Lesson 6 - Starter Code

In [1]:

%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
import sklearn.linear_model

# read in the mammal dataset
wd = '../../assets/dataset/msleep/'
mammals = pd.read_csv(wd+'msleep.csv')
mammals = mammals[mammals.brainwt.notnull()].copy()

Explore our mammals dataset

In [2]:

mammals.head()

Out[2]:

Lets check out a scatter plot of body wieght and brain weight

In [3]:

# create a matplotlib figure
plt.figure()
# generate a scatterplot inside the figure
plt.plot(mammals.bodywt, mammals.brainwt, '.')
# show the plot
plt.show()

Out[3]:

In [4]:

sns.lmplot('bodywt', 'brainwt', mammals)

Out[4]:

<seaborn.axisgrid.FacetGrid at 0x27039a0b278>

In [5]:

log_columns = ['bodywt', 'brainwt',]
log_mammals = mammals.copy()
log_mammals[log_columns] = log_mammals[log_columns].apply(np.log10)

In [6]:

sns.lmplot('bodywt', 'brainwt', log_mammals)

Out[6]:

<seaborn.axisgrid.FacetGrid at 0x27039825b00>

Guided Practice: Using Seaborn to generate single variable linear model plots (15 mins)

Update and complete the code below to use lmplot and display correlations between body weight and two dependent variables: sleep_rem and awake.

In [7]:

log_columns = ['bodywt', 'brainwt',]  # any others?
log_mammals = mammals.copy()
log_mammals[log_columns] = log_mammals[log_columns].apply(np.log10)

Complete below for sleep_rem and awake as a y, with variables you've already used as x.

In [1]:

x = 
y = 
sns.lmplot(x, y, mammals)
sns.lmplot(x, y, log_mammals)

Out[1]:

  File "<ipython-input-1-9015c725455f>", line 1
    x =
        ^
SyntaxError: invalid syntax

Introduction: Single Regression Analysis in statsmodels & scikit (10 mins)

In [ ]:

# this is the standard import if you're using "formula notation" (similar to R)
import statsmodels.formula.api as smf

X = mammals[['bodywt']]
y = mammals['brainwt']

# create a fitted model in one line
#formula notiation is the equivalent to writting out our models such that 'outcome = predictor'
#with the follwing syntax formula = 'outcome ~ predictor1 + predictor2 ... predictorN'
lm = smf.ols(formula='y ~ X', data=mammals).fit()
#print the full summary
lm.summary()

Use Statsmodels to make the prediction

In [ ]:

# you have to create a DataFrame since the Statsmodels formula interface expects it
X_new = pd.DataFrame({'X': [50]})
X_new.head()

In [ ]:

lm.predict(X_new)

Repeat in Scikit with handy plotting

When modeling with sklearn, you'll use the following base principals:

All sklearn estimators (modeling classes) are based on this base estimator. This allows you to easily rotate through estimators without changing much code.
All estimators take a matrix, X, either sparse or dense.
Many estimators also take a vector, y, when working on a supervised machine learning problem. Regressions are supervised learning because we already have examples of y given X.
All estimators have parameters that can be set. This allows for customization and higher level of detail to the learning process. The parameters are appropriate to each estimator algorithm.

In [ ]:

from sklearn import feature_selection, linear_model

def get_linear_model_metrics(X, y, algo):
    # get the pvalue of X given y. Ignore f-stat for now.
    pvals = feature_selection.f_regression(X, y)[1]
    # start with an empty linear regression object
    # .fit() runs the linear regression function on X and y
    algo.fit(X,y)
    residuals = (y-algo.predict(X)).values

    # print the necessary values
    print ('P Values:', pvals)
    print ('Coefficients:', algo.coef_)
    print ('y-intercept:', algo.intercept_)
    print ('R-Squared:', algo.score(X,y))
    
    plt.figure()
    plt.hist(residuals, bins=int(np.ceil(np.sqrt(len(y)))))
    # keep the model
    return algo

X = mammals[['bodywt']]
y = mammals['brainwt']
lm = linear_model.LinearRegression()
lm = get_linear_model_metrics(X, y, lm)

Demo: Significance is Key (20 mins)

What does our output tell us?

Our output tells us that:

The relationship between bodywt and brainwt isn't random (p value approaching 0)
The model explains, roughly, 87% of the variance of the dataset (the largest errors being in the large brain and body sizes)
With this current model, brainwt is roughly bodywt * 0.00096395
The residuals, or error in the prediction, is not normal, with outliers on the right. A better with will have similar to normally distributed error.

Evaluating Fit, Evaluating Sense

Although we know there is a better solution to the model, we should evaluate some other sense things first. For example, given this model, what is an animal's brainwt if their bodywt is 0?

In [ ]:

# prediction at 0?
print (lm.predict([[0]]))

In [ ]:

lm = linear_model.LinearRegression(fit_intercept=False)
lm = get_linear_model_metrics(X, y, lm)
# prediction at 0?
print (lm.predict([[0]]))

Intrepretation?

Answer:

Guided Practice: Using the LinearRegression object (15 mins)

We learned earlier that the data in its current state does not allow for the best linear regression fit.

With a partner, generate two more models using the log-transformed data to see how this transform changes the model's performance.

Complete the following code to update X and y to match the log-transformed data.

Complete the loop by setting the list to be one True and one False.

In [ ]:

#starter
X =
y =
loop = []
for boolean in loop:
    print 'y-intercept:', boolean
    lm = linear_model.LinearRegression(fit_intercept=boolean)
    get_linear_model_metrics(X, y, lm)
    print

Which model performed the best? The worst? Why?

Answer:

Advanced Methods!

We will go over different estimators in detail in the future but check it out in the docs if you're curious (and finish a little early)

In [ ]:

# loading other sklearn regression estimators
X = log_mammals[['bodywt']]
y = log_mammals['brainwt']

estimators = [
    linear_model.Lasso(),
    linear_model.Ridge(),
    linear_model.ElasticNet(),
]

for est in estimators:
    print est
    get_linear_model_metrics(X, y, est)
    print

Introduction: Multiple Regression Analysis using citi bike data (10 minutes)

In the previous example, one variable explained the variance of another; however, more often than not, we will need multiple variables.

For example, a house's price may be best measured by square feet, but a lot of other variables play a vital role: bedrooms, bathrooms, location, appliances, etc.

For a linear regression, we want these variables to be largely independent of each other, but all of them should help explain the Y variable.

We'll work with bikeshare data to showcase what this means and to explain a concept called multicollinearity.

In [ ]:

wd = '../../assets/dataset/bikeshare/'
bike_data = pd.read_csv(wd+'bikeshare.csv')
bike_data.head()

What is Multicollinearity?

With the bike share data, let's compare three data points: actual temperature, "feel" temperature, and guest ridership.

Our data is already normalized between 0 and 1, so we'll start off with the correlations and modeling.

In [ ]:

cmap = sns.diverging_palette(220, 10, as_cmap=True)

correlations = bike_data[['temp', 'atemp', 'casual']].corr()
print (correlations)
print (sns.heatmap(correlations, cmap=cmap))

What does the correlation matrix explain?

Answer:

We can measure this effect in the coefficients:

In [ ]:

y = bike_data['casual']
x_sets = (
    ['temp'],
    ['atemp'],
    ['temp', 'atemp'],
)

for x in x_sets:
    print (', '.join(x))
    get_linear_model_metrics(bike_data[x], y, linear_model.LinearRegression())
    print

Intrepretation?

Answer:

What happens if we use a second variable that isn't highly correlated with temperature, like humidity?

In [ ]:

y = bike_data['casual']
x = bike_data[['temp', 'hum']]
get_linear_model_metrics(x, y, linear_model.LinearRegression())

Guided Practice: Multicollinearity with dummy variables (15 mins)

There can be a similar effect from a feature set that is a singular matrix, which is when there is a clear relationship in the matrix (for example, the sum of all rows = 1).

Run through the following code on your own.

What happens to the coefficients when you include all weather situations instead of just including all except one?

In [ ]:

lm = linear_model.LinearRegression()
weather = pd.get_dummies(bike_data.weathersit)

get_linear_model_metrics(weather[[1, 2, 3, 4]], y, lm)
print
# drop the least significant, weather situation  = 4
get_linear_model_metrics(weather[[1, 2, 3]], y, lm)

Similar in Statsmodels

In [ ]:

# all dummies in the model
lm_stats = smf.ols(formula='y ~ weather[[1, 2, 3, 4]]', data=bike_data).fit()
lm_stats.summary()

In [ ]:

#droping one
lm_stats = smf.ols(formula='y ~ weather[[1, 2, 3]]', data=bike_data).fit()
lm_stats.summary()

What's the interpretation ? Do you want to keep all your dummy variables or drop one? Why?

Answer:

Guided Practice: Combining non-correlated features into a better model (15 mins)

In [ ]:

bike_data.dtypes

With a partner, complete this code together and visualize the correlations of all the numerical features built into the data set.

We want to:

Add the three significant weather situations into our current model.
Find two more features that are not correlated with current features, but could be strong indicators for predicting guest riders.

In [ ]:

#starter 
lm = linear_model.LinearRegression()
bikemodel_data = bike_data.join() # add in the three weather situations

cmap = sns.diverging_palette(220, 10, as_cmap=True)
correlations = # what are we getting the correlations of?
print correlations
print sns.heatmap(correlations, cmap=cmap)

columns_to_keep = [] #[which_variables?]
final_feature_set = bikemodel_data[columns_to_keep]

get_linear_model_metrics(final_feature_set, y, lm)

In [ ]:

#sklearn
final_feature_set = bikemodel_data[columns_to_keep]

get_linear_model_metrics(final_feature_set, np.log10(y+1), lm)

In [ ]:

#Stats models
log_y = np.log10(y+1)
lm = smf.ols(formula=' log_y ~ temp + hum + windspeed + weather_1 + weather_2 + weather_3 + holiday + hour_1 + hour_2 + hour_3 + hour_4 + hour_5 + hour_6 + hour_7 + hour_8 + hour_9 + hour_10 + hour_11 + hour_12 + hour_13 + hour_14 + hour_15 + hour_16 + hour_18 + hour_19 + hour_20 + hour_21 + hour_22 + hour_23', data=bikemodel_data).fit()
#print the full summary
lm.summary()

Independent Practice: Building models for other y variables (25 minutes)

We've completely a model together that explains casual guest riders. Now it's your turn to build another model, using a different y variable: registered riders.

Pay attention to:

the distribution of riders (should we rescale the data?)
checking correlations with variables and registered riders
having a feature space (our matrix) with low multicollinearity
model complexity vs explanation of variance: at what point do features in a model stop improving r-squared?
the linear assumption -- given all feature values being 0, should we have no ridership? negative ridership? positive ridership?

Bonus

Which variables would make sense to dummy (because they are categorical, not continuous)?
What features might explain ridership but aren't included in the data set?
Is there a way to build these using pandas and the features available?
Outcomes If your model at least improves upon the original model and the explanatory effects (coefficients) make sense, consider this a complete task.

In [ ]:

Lesson 6 - Starter Code

Explore our mammals dataset

Lets check out a scatter plot of body wieght and brain weight

Guided Practice: Using Seaborn to generate single variable linear model plots (15 mins)

Complete below for sleep_rem and awake as a y, with variables you've already used as x.

Introduction: Single Regression Analysis in statsmodels & scikit (10 mins)

Use Statsmodels to make the prediction

Repeat in Scikit with handy plotting

Demo: Significance is Key (20 mins)

What does our output tell us?

Evaluating Fit, Evaluating Sense

Intrepretation?

Guided Practice: Using the LinearRegression object (15 mins)

Which model performed the best? The worst? Why?

Advanced Methods!

Introduction: Multiple Regression Analysis using citi bike data (10 minutes)

What is Multicollinearity?

What does the correlation matrix explain?

We can measure this effect in the coefficients:

Intrepretation?

What happens if we use a second variable that isn't highly correlated with temperature, like humidity?

Guided Practice: Multicollinearity with dummy variables (15 mins)

Run through the following code on your own.

What happens to the coefficients when you include all weather situations instead of just including all except one?

Similar in Statsmodels

What's the interpretation ? Do you want to keep all your dummy variables or drop one? Why?

Guided Practice: Combining non-correlated features into a better model (15 mins)

With a partner, complete this code together and visualize the correlations of all the numerical features built into the data set.

Independent Practice: Building models for other y variables (25 minutes)

Pay attention to:

Bonus

Product

Resources

Company