GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-06-alt/code/solution-code/solution-code-6.ipynb
²³⁴⁷ views

Kernel: Python 2

Lesson 6 - Solution Code

In [3]:

%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

import statsmodels.formula.api as smf

# read in the mammal dataset
wd = '../../assets/dataset/msleep/'
mammals = pd.read_csv(wd+'msleep.csv')
mammals = mammals[mammals.brainwt.notnull()].copy()

Part 1:

Explore our mammals dataset

In [4]:

mammals.head()

Out[4]:

Check 1. Distribution

Lets check out a scatter plot of body wieght and brain weight

In [5]:

# create a matplotlib figure
plt.figure()
# generate a scatterplot inside the figure
plt.plot(mammals.bodywt, mammals.brainwt, '.')
# show the plot
plt.show()

Out[5]:

In [6]:

sns.lmplot('bodywt', 'brainwt', mammals)

Out[6]:

<seaborn.axisgrid.FacetGrid at 0x105fb7290>

//anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

Log transformation can help here.

Curious about the math? http://onlinestatbook.com/2/transformations/log.html

In [7]:

log_columns = ['bodywt', 'brainwt',]
log_mammals = mammals.copy()
log_mammals[log_columns] = log_mammals[log_columns].apply(np.log10)

In [8]:

g = sns.lmplot('bodywt', 'brainwt', log_mammals)
g.set_axis_labels( "Log Body Weight", "Log Brain Weight")

Out[8]:

<seaborn.axisgrid.FacetGrid at 0x109dee990>

Woohoo! This looks much better.

#Part 1- Student: Update and complete the code below to use lmplot and display correlations between body weight and two dependent variables: sleep_rem and awake.

Complete below for 2 new models:

With body weight as the x and y set as:

sleep_rem
awake

In [9]:

#1. add any additional variables that you would like to take the log of
log_columns = ['bodywt', 'brainwt',]  # any others?
log_mammals = mammals.copy()
log_mammals[log_columns] = log_mammals[log_columns].apply(np.log10)

Create lmplots for sleep_rem and awake as a y, with variables you've already used as x.

In [11]:

g = sns.lmplot("bodywt", "sleep_rem", mammals)
g.set_axis_labels( "Body Weight", "REM")
g = sns.lmplot("bodywt", "sleep_rem", log_mammals)
g.set_axis_labels( "Log Body Weight", "Log REM ")

Out[11]:

<seaborn.axisgrid.FacetGrid at 0x10a1dd510>

####play around with other outcomes

In [12]:

log_columns = ['bodywt', 'brainwt', 'awake', 'sleep_rem']  # any others?
log_mammals = mammals.copy()
log_mammals[log_columns] = log_mammals[log_columns].apply(np.log10)

# one other example, using brainwt and awake.
x = 'brainwt'
y = 'awake'
sns.lmplot(x, y, mammals)
sns.lmplot(x, y, log_mammals)

Out[12]:

<seaborn.axisgrid.FacetGrid at 0x10ac38c10>

Decision for Check 1. Distributrion

Answer: For this analysis we will log transform our data.

We decided above that we will need a log transformation. Let's take a look at both models to compare

In [17]:

# not transformed

X = mammals[['bodywt']]
y = mammals['brainwt']

# create a fitted model in one line
#formula notiation is the equivalent to writting out our models such that 'outcome = predictor'
#with the follwing syntax formula = 'outcome ~ predictor1 + predictor2 ... predictorN'
lm = smf.ols(formula='y ~ X', data=mammals).fit()
#print the full summary
lm.summary()

Out[17]:

Our output tells us that:

The relationship between bodywt and brainwt isn't random (p value approaching 0)
With this current model, log(brainwt) is roughly log(bodywt) * 0.0010
The model explains, roughly, 87% of the variance of the dataset

Student: repeat with the log transformation

In [19]:

# Log transformed
X = log_mammals[['bodywt']]
y = log_mammals['brainwt']

# create a fitted model in one line
#formula notiation is the equivalent to writting out our models such that 'outcome = predictor'
#with the follwing syntax formula = 'outcome ~ predictor1 + predictor2 ... predictorN'
lm = smf.ols(formula='y ~ X', data=mammals).fit()
#print the full summary
lm.summary()

Out[19]:

What does our output tell us?

Our output tells us that:

The relationship between bodywt and brainwt isn't random (p value approaching 0)
With this current model, log(brainwt) is roughly log(bodywt) * 0.7652
The model explains, roughly, 93% of the variance of the dataset (the largest errors being in the large brain and body sizes)

Bonus: Use Statsmodels to make the prediction

In [22]:

# you have to create a DataFrame since the Statsmodels formula interface expects it
X_new = pd.DataFrame({'X': [50]})
X_new.head()

Out[22]:

In [44]:

lm.predict(X_new)

Out[44]:

array([ 0.13411477])

Part 2: Multiple Regression Analysis using citi bike data

In the previous example, one variable explained the variance of another; however, more often than not, we will need multiple variables.

For example, a house's price may be best measured by square feet, but a lot of other variables play a vital role: bedrooms, bathrooms, location, appliances, etc.

For a linear regression, we want these variables to be largely independent of each other, but all of them should help explain the y variable.

We'll work with bikeshare data to showcase what this means and to explain a concept called multicollinearity.

In [22]:

wd = '../../assets/dataset/bikeshare/'
bike_data = pd.read_csv(wd+'bikeshare.csv')
bike_data.head()

Out[22]:

##Check 2. Multicollinearity What is Multicollinearity?

With the bike share data, let's compare three data points: actual temperature, "feel" temperature, and guest ridership.

Our data is already normalized between 0 and 1, so we'll start off with the correlations and modeling.

Students:

using the code from the demo create a correlation heat map comparing 'temp', 'atemp', 'casual'

In [23]:

cmap = sns.diverging_palette(220, 10, as_cmap=True)

correlations = bike_data[['temp', 'atemp', 'casual']].corr()
print correlations
print sns.heatmap(correlations, cmap=cmap)

Out[23]:

            temp     atemp    casual
temp    1.000000  0.987672  0.459616
atemp   0.987672  1.000000  0.454080
casual  0.459616  0.454080  1.000000
Axes(0.125,0.125;0.62x0.775)

####Question: What did we find?

The correlation matrix explains that:

both temperature fields are moderately correlated to guest ridership;
the two temperature fields are highly correlated to each other.

Including both of these fields in a model could introduce a pain point of multicollinearity, where it's more difficult for a model to determine which feature is effecting the predicted value.

###Demo: We can measure this effect in the coefficients:

Side note: this is a sneak peak at scikit learn

In [24]:

from sklearn import feature_selection, linear_model

def get_linear_model_metrics(X, y, algo):
    # get the pvalue of X given y. Ignore f-stat for now.
    pvals = feature_selection.f_regression(X, y)[1]
    # start with an empty linear regression object
    # .fit() runs the linear regression function on X and y
    algo.fit(X,y)
    residuals = (y-algo.predict(X)).values

    # print the necessary values
    print 'P Values:', pvals
    print 'Coefficients:', algo.coef_
    print 'y-intercept:', algo.intercept_
    print 'R-Squared:', algo.score(X,y)
    plt.figure()
    plt.hist(residuals, bins=np.ceil(np.sqrt(len(y))))
    # keep the model
    return algo

In [25]:

y = bike_data['casual']
x_sets = (
    ['temp'],
    ['atemp'],
    ['temp', 'atemp'],
)

for x in x_sets:
    print ', '.join(x)
    get_linear_model_metrics(bike_data[x], y, linear_model.LinearRegression())
    print

Out[25]:

temp
P Values: [ 0.]
Coefficients: [ 117.68705779]
y-intercept: -22.812739188
R-Squared: 0.21124654163

atemp
P Values: [ 0.]
Coefficients: [ 130.27875081]
y-intercept: -26.3071675481
R-Squared: 0.206188705733

temp, atemp
P Values: [ 0.  0.]
Coefficients: [ 116.34021588    1.52795677]
y-intercept: -22.8703398286
R-Squared: 0.21124723661

//anaconda/lib/python2.7/site-packages/scipy/linalg/basic.py:884: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)

Intrepretation:

Even though the 2-variable model temp + atemp has a higher explanation of variance than two variables on their own, and both variables are considered significant (p values approaching 0), we can see that together, their coefficients are wildly different.

This can introduce error in how we explain models.

What happens if we use a second variable that isn't highly correlated with temperature, like humidity?

In [26]:

y = bike_data['casual']
x = bike_data[['temp', 'hum']]
get_linear_model_metrics(x, y, linear_model.LinearRegression())

Out[26]:

P Values: [ 0.  0.]
Coefficients: [ 112.02457031  -80.87301833]
y-intercept: 30.7273338581
R-Squared: 0.310901196913

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Guided Practice: Multicollinearity with dummy variables (15 mins)

There can be a similar effect from a feature set that is a singular matrix, which is when there is a clear relationship in the matrix (for example, the sum of all rows = 1).

Run through the following code on your own.

What happens to the coefficients when you include all weather situations instead of just including all except one?

In [30]:

lm = linear_model.LinearRegression()
weather = pd.get_dummies(bike_data.weathersit)

get_linear_model_metrics(weather[[1, 2, 3, 4]], y, lm)
print
# drop the least significant, weather situation  = 4
get_linear_model_metrics(weather[[1, 2, 3]], y, lm)

Out[30]:

P Values: [  3.75616929e-73   3.43170021e-22   1.57718666e-55   2.46181288e-01]
Coefficients: [  4.05930101e+12   4.05930101e+12   4.05930101e+12   4.05930101e+12]
y-intercept: -4.05930100616e+12
R-Squared: 0.0233497737473

P Values: [  3.75616929e-73   3.43170021e-22   1.57718666e-55]
Coefficients: [ 37.87876398  26.92862383  13.38900634]
y-intercept: 2.66666666652
R-Squared: 0.0233906873841

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Similar in Statsmodels

In [31]:

# all dummies in the model
lm_stats = smf.ols(formula='y ~ weather[[1, 2, 3, 4]]', data=bike_data).fit()
lm_stats.summary()

Out[31]:

Students: Now drop one

In [32]:

#droping one
lm_stats = smf.ols(formula='y ~ weather[[1, 2, 3]]', data=bike_data).fit()
lm_stats.summary()

Out[32]:

Interpretation:

This model makes more sense, because we can more easily explain the variables compared to the one we left out.

For example, this suggests that a clear day (weathersit:1) on average brings in about 38 more riders hourly than a day with heavy snow.

In fact, since the weather situations "degrade" in quality (1 is the nicest day, 4 is the worst), the coefficients now reflect that well.

However at this point, there is still a lot of work to do, because weather on its own fails to explain ridership well.

In [103]:

bike_data.dtypes

Out[103]:

instant         int64
dteday         object
season          int64
yr              int64
mnth            int64
hr              int64
holiday         int64
weekday         int64
workingday      int64
weathersit      int64
temp          float64
atemp         float64
hum           float64
windspeed     float64
casual          int64
registered      int64
cnt             int64
dtype: object

With a partner, complete this code together and visualize the correlations of all the numerical features built into the data set.

We want to:

Id categorical variables
Create dummies
Find at least two more features that are not correlated with current features, but could be strong indicators for predicting guest riders.

In [46]:

#starter 

#Dummies example: 
weather = pd.get_dummies(bike_data.weathersit)
#create new names for our new dummy variables
weather.columns = ['weather_' + str(i) for i in weather.columns]
#join those new variables back into the larger dataset
bikemodel_data = bike_data.join(weather)
print bikemodel_data.columns

#Select columns to keep. Don't forget to set a reference category for your dummies (aka drop one)
columns_to_keep = ['temp', 'weather_1', 'weather_2', 'weather_3'] #[which_variables?]

#checking for colinearity
cmap = sns.diverging_palette(220, 10, as_cmap=True)
correlations = bikemodel_data[columns_to_keep].corr()# what are we getting the correlations of?
print correlations
print sns.heatmap(correlations, cmap=cmap)

Out[46]:

Index([u'instant', u'dteday', u'season', u'yr', u'mnth', u'hr', u'holiday',
       u'weekday', u'workingday', u'weathersit', u'temp', u'atemp', u'hum',
       u'windspeed', u'casual', u'registered', u'cnt', u'weather_1',
       u'weather_2', u'weather_3', u'weather_4'],
      dtype='object')
               temp  weather_1  weather_2  weather_3
temp       1.000000   0.101044  -0.069657  -0.062406
weather_1  0.101044   1.000000  -0.822961  -0.412414
weather_2 -0.069657  -0.822961   1.000000  -0.177417
weather_3 -0.062406  -0.412414  -0.177417   1.000000
Axes(0.125,0.125;0.62x0.775)

In [36]:

#solution
weather = pd.get_dummies(bike_data.weathersit)
weather.columns = ['weather_' + str(i) for i in weather.columns]

hours = pd.get_dummies(bike_data.hr)
hours.columns = ['hour_' + str(i) for i in hours.columns]

season = pd.get_dummies(bike_data.season)
season.columns = ['season_' + str(i) for i in season.columns]


bikemodel_data = bike_data.join(weather) # add in the three weather situations
bikemodel_data = bikemodel_data.join(hours)
bikemodel_data = bikemodel_data.join(season)

cmap = sns.diverging_palette(220, 10, as_cmap=True)

columns_to_keep = ['temp', 'hum', 'windspeed', 'weather_1', 'weather_2', 'weather_3', 'holiday',]
columns_to_keep.extend(['hour_' + str(i) for i in range(1, 24)])

correlations = bikemodel_data[columns_to_keep].corr()
print correlations
print sns.heatmap(correlations, cmap=cmap)

Out[36]:

               temp       hum  windspeed  weather_1  weather_2  weather_3  \
temp       1.000000 -0.069881  -0.023125   0.101044  -0.069657  -0.062406   
hum       -0.069881  1.000000  -0.290105  -0.383425   0.220758   0.309737   
windspeed -0.023125 -0.290105   1.000000   0.005150  -0.049241   0.070018   
weather_1  0.101044 -0.383425   0.005150   1.000000  -0.822961  -0.412414   
weather_2 -0.069657  0.220758  -0.049241  -0.822961   1.000000  -0.177417   
weather_3 -0.062406  0.309737   0.070018  -0.412414  -0.177417   1.000000   
holiday   -0.027340 -0.010588   0.003988   0.009167   0.004910  -0.023664   
hour_1    -0.040738  0.083197  -0.053580   0.008819  -0.006750  -0.005379   
hour_2    -0.045627  0.096198  -0.060241   0.005156  -0.003921  -0.002518   
hour_3    -0.046575  0.108659  -0.065444  -0.001685   0.003843  -0.003117   
hour_4    -0.053459  0.121990  -0.057285  -0.000450   0.000506   0.000096   
hour_5    -0.065571  0.124406  -0.067411  -0.004791   0.011541  -0.010083   
hour_6    -0.069911  0.126481  -0.055217  -0.014011   0.017969  -0.004410   
hour_7    -0.062825  0.112289  -0.044717  -0.020841   0.015641   0.011168   
hour_8    -0.045570  0.081720  -0.023117  -0.022657   0.025452  -0.001427   
hour_9    -0.021986  0.037325   0.001989  -0.029315   0.035263  -0.005625   
hour_10    0.003896 -0.012090   0.020399  -0.020236   0.026106  -0.006675   
hour_11    0.027808 -0.060432   0.029448  -0.018420   0.028068  -0.012973   
hour_12    0.047007 -0.098114   0.044294  -0.021224   0.025918  -0.004659   
hour_13    0.062752 -0.125421   0.053938  -0.009517   0.011360  -0.001596   
hour_14    0.073992 -0.141266   0.072461  -0.002867   0.002216   0.001548   
hour_15    0.077838 -0.146532   0.077046   0.003782  -0.008235   0.006789   
hour_16    0.073918 -0.142656   0.080822   0.018486  -0.026678   0.009842   
hour_17    0.062626 -0.123506   0.074068   0.016674  -0.028636   0.017174   
hour_18    0.047992 -0.098888   0.059114   0.013256  -0.021142   0.010026   
hour_19    0.029525 -0.059376   0.034269   0.018700  -0.019835  -0.000463   
hour_20    0.012609 -0.027918   0.008759   0.025354  -0.032907   0.008977   
hour_21   -0.001830  0.004671  -0.015770   0.021120  -0.021142  -0.002561   
hour_22   -0.013554  0.028089  -0.026419   0.018700  -0.017220  -0.004659   
hour_23   -0.023847  0.049900  -0.043234   0.008417  -0.013952   0.007928   

            holiday    hour_1    hour_2    hour_3    ...      hour_14  \
temp      -0.027340 -0.040738 -0.045627 -0.046575    ...     0.073992   
hum       -0.010588  0.083197  0.096198  0.108659    ...    -0.141266   
windspeed  0.003988 -0.053580 -0.060241 -0.065444    ...     0.072461   
weather_1  0.009167  0.008819  0.005156 -0.001685    ...    -0.002867   
weather_2  0.004910 -0.006750 -0.003921  0.003843    ...     0.002216   
weather_3 -0.023664 -0.005379 -0.002518 -0.003117    ...     0.001548   
holiday    1.000000  0.000293  0.000744 -0.003602    ...     0.000045   
hour_1     0.000293  1.000000 -0.043188 -0.042618    ...    -0.043627   
hour_2     0.000744 -0.043188  1.000000 -0.042340    ...    -0.043343   
hour_3    -0.003602 -0.042618 -0.042340  1.000000    ...    -0.042771   
hour_4    -0.000093 -0.042618 -0.042340 -0.041782    ...    -0.042771   
hour_5     0.000643 -0.043251 -0.042969 -0.042402    ...    -0.043406   
hour_6     0.000244 -0.043502 -0.043219 -0.042648    ...    -0.043658   
hour_7     0.000144 -0.043564 -0.043281 -0.042710    ...    -0.043721   
hour_8     0.000144 -0.043564 -0.043281 -0.042710    ...    -0.043721   
hour_9     0.000144 -0.043564 -0.043281 -0.042710    ...    -0.043721   
hour_10    0.000144 -0.043564 -0.043281 -0.042710    ...    -0.043721   
hour_11    0.000144 -0.043564 -0.043281 -0.042710    ...    -0.043721   
hour_12    0.000095 -0.043596 -0.043312 -0.042740    ...    -0.043752   
hour_13    0.000045 -0.043627 -0.043343 -0.042771    ...    -0.043784   
hour_14    0.000045 -0.043627 -0.043343 -0.042771    ...     1.000000   
hour_15    0.000045 -0.043627 -0.043343 -0.042771    ...    -0.043784   
hour_16   -0.000004 -0.043658 -0.043374 -0.042802    ...    -0.043815   
hour_17   -0.000004 -0.043658 -0.043374 -0.042802    ...    -0.043815   
hour_18    0.000095 -0.043596 -0.043312 -0.042740    ...    -0.043752   
hour_19    0.000095 -0.043596 -0.043312 -0.042740    ...    -0.043752   
hour_20    0.000095 -0.043596 -0.043312 -0.042740    ...    -0.043752   
hour_21    0.000095 -0.043596 -0.043312 -0.042740    ...    -0.043752   
hour_22    0.000095 -0.043596 -0.043312 -0.042740    ...    -0.043752   
hour_23    0.000095 -0.043596 -0.043312 -0.042740    ...    -0.043752   

            hour_15   hour_16   hour_17   hour_18   hour_19   hour_20  \
temp       0.077838  0.073918  0.062626  0.047992  0.029525  0.012609   
hum       -0.146532 -0.142656 -0.123506 -0.098888 -0.059376 -0.027918   
windspeed  0.077046  0.080822  0.074068  0.059114  0.034269  0.008759   
weather_1  0.003782  0.018486  0.016674  0.013256  0.018700  0.025354   
weather_2 -0.008235 -0.026678 -0.028636 -0.021142 -0.019835 -0.032907   
weather_3  0.006789  0.009842  0.017174  0.010026 -0.000463  0.008977   
holiday    0.000045 -0.000004 -0.000004  0.000095  0.000095  0.000095   
hour_1    -0.043627 -0.043658 -0.043658 -0.043596 -0.043596 -0.043596   
hour_2    -0.043343 -0.043374 -0.043374 -0.043312 -0.043312 -0.043312   
hour_3    -0.042771 -0.042802 -0.042802 -0.042740 -0.042740 -0.042740   
hour_4    -0.042771 -0.042802 -0.042802 -0.042740 -0.042740 -0.042740   
hour_5    -0.043406 -0.043437 -0.043437 -0.043375 -0.043375 -0.043375   
hour_6    -0.043658 -0.043690 -0.043690 -0.043627 -0.043627 -0.043627   
hour_7    -0.043721 -0.043752 -0.043752 -0.043690 -0.043690 -0.043690   
hour_8    -0.043721 -0.043752 -0.043752 -0.043690 -0.043690 -0.043690   
hour_9    -0.043721 -0.043752 -0.043752 -0.043690 -0.043690 -0.043690   
hour_10   -0.043721 -0.043752 -0.043752 -0.043690 -0.043690 -0.043690   
hour_11   -0.043721 -0.043752 -0.043752 -0.043690 -0.043690 -0.043690   
hour_12   -0.043752 -0.043784 -0.043784 -0.043721 -0.043721 -0.043721   
hour_13   -0.043784 -0.043815 -0.043815 -0.043752 -0.043752 -0.043752   
hour_14   -0.043784 -0.043815 -0.043815 -0.043752 -0.043752 -0.043752   
hour_15    1.000000 -0.043815 -0.043815 -0.043752 -0.043752 -0.043752   
hour_16   -0.043815  1.000000 -0.043846 -0.043784 -0.043784 -0.043784   
hour_17   -0.043815 -0.043846  1.000000 -0.043784 -0.043784 -0.043784   
hour_18   -0.043752 -0.043784 -0.043784  1.000000 -0.043721 -0.043721   
hour_19   -0.043752 -0.043784 -0.043784 -0.043721  1.000000 -0.043721   
hour_20   -0.043752 -0.043784 -0.043784 -0.043721 -0.043721  1.000000   
hour_21   -0.043752 -0.043784 -0.043784 -0.043721 -0.043721 -0.043721   
hour_22   -0.043752 -0.043784 -0.043784 -0.043721 -0.043721 -0.043721   
hour_23   -0.043752 -0.043784 -0.043784 -0.043721 -0.043721 -0.043721   

            hour_21   hour_22   hour_23  
temp      -0.001830 -0.013554 -0.023847  
hum        0.004671  0.028089  0.049900  
windspeed -0.015770 -0.026419 -0.043234  
weather_1  0.021120  0.018700  0.008417  
weather_2 -0.021142 -0.017220 -0.013952  
weather_3 -0.002561 -0.004659  0.007928  
holiday    0.000095  0.000095  0.000095  
hour_1    -0.043596 -0.043596 -0.043596  
hour_2    -0.043312 -0.043312 -0.043312  
hour_3    -0.042740 -0.042740 -0.042740  
hour_4    -0.042740 -0.042740 -0.042740  
hour_5    -0.043375 -0.043375 -0.043375  
hour_6    -0.043627 -0.043627 -0.043627  
hour_7    -0.043690 -0.043690 -0.043690  
hour_8    -0.043690 -0.043690 -0.043690  
hour_9    -0.043690 -0.043690 -0.043690  
hour_10   -0.043690 -0.043690 -0.043690  
hour_11   -0.043690 -0.043690 -0.043690  
hour_12   -0.043721 -0.043721 -0.043721  
hour_13   -0.043752 -0.043752 -0.043752  
hour_14   -0.043752 -0.043752 -0.043752  
hour_15   -0.043752 -0.043752 -0.043752  
hour_16   -0.043784 -0.043784 -0.043784  
hour_17   -0.043784 -0.043784 -0.043784  
hour_18   -0.043721 -0.043721 -0.043721  
hour_19   -0.043721 -0.043721 -0.043721  
hour_20   -0.043721 -0.043721 -0.043721  
hour_21    1.000000 -0.043721 -0.043721  
hour_22   -0.043721  1.000000 -0.043721  
hour_23   -0.043721 -0.043721  1.000000  

[30 rows x 30 columns]
Axes(0.125,0.125;0.62x0.775)

Independent Practice: Building model to predict guest ridership (25 minutes)

Pay attention to:

Which variables would make sense to dummy (because they are categorical, not continuous)?
the distribution of riders (should we rescale the data?)
checking correlations with variables and guest riders
having a feature space (our matrix) with low multicollinearity
the linear assumption -- given all feature values being 0, should we have no ridership? negative ridership? positive ridership?
What features might explain ridership but aren't included in the data set?

###Outcomes: If your model has an r-squared above .4, this a relatively effective model for the data available. Kudos! Move on to the bonus!

In [111]:

bikemodel_data.columns

Out[111]:

Index([u'instant', u'dteday', u'season', u'yr', u'mnth', u'hr', u'holiday',
       u'weekday', u'workingday', u'weathersit', u'temp', u'atemp', u'hum',
       u'windspeed', u'casual', u'registered', u'cnt', u'weather_1',
       u'weather_2', u'weather_3', u'weather_4', u'hour_0', u'hour_1',
       u'hour_2', u'hour_3', u'hour_4', u'hour_5', u'hour_6', u'hour_7',
       u'hour_8', u'hour_9', u'hour_10', u'hour_11', u'hour_12', u'hour_13',
       u'hour_14', u'hour_15', u'hour_16', u'hour_17', u'hour_18', u'hour_19',
       u'hour_20', u'hour_21', u'hour_22', u'hour_23', u'season_1',
       u'season_2', u'season_3', u'season_4'],
      dtype='object')

In [124]:

y = bike_data['registered']
log_y = np.log10(y+1)
lm = smf.ols(formula=' log_y ~ temp + hum + windspeed + weather_1 + weather_2 + weather_3 + holiday + hour_1 + hour_2 + hour_3 + hour_4 + hour_5 + hour_6 + hour_7 + hour_8 + hour_9 + hour_10 + hour_11 + hour_12 + hour_13 + hour_14 + hour_15 + hour_16 + hour_18 + hour_19 + hour_20 + hour_21 + hour_22 + hour_23', data=bikemodel_data).fit()
#print the full summary
lm.summary()

Out[124]:

1: What's the strongest predictor? It depends on what you included in the model. Note that the largest impact could be in either the positive or negative direction (consider using absolute values to quickly evaluate).

2: How well did your model do? Check out your R-square value, and 95%CIs. We will dive into this topic more in the next class

3: How can you improve it? Cross-validation for one! Next class...

###Bonus:

We've completed a model that explains casual guest riders. Now it's your turn to build another model, using a different y (outcome) variable: registered riders.

Bonus 1: What's the strongest predictor?

Bonus 2: How well did your model do?

Bonus 3: How can you improve it?