Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-06-alt/code/starter-code/starter-code-6.ipynb
1905 views
Kernel: Python 2

Lesson 6 - Starter Code

%matplotlib inline import numpy as np import pandas as pd from matplotlib import pyplot as plt import seaborn as sns sns.set_style("darkgrid") import statsmodels.formula.api as smf # read in the mammal dataset wd = '../../assets/dataset/msleep/' mammals = pd.read_csv(wd+'msleep.csv') mammals = mammals[mammals.brainwt.notnull()].copy()

Part 1:

Explore our mammals dataset

mammals.head()

Check 1. Distribution

Lets check out a scatter plot of body wieght and brain weight

# create a matplotlib figure plt.figure() # generate a scatterplot inside the figure plt.plot(mammals.bodywt, mammals.brainwt, '.') # show the plot plt.show()
Image in a Jupyter notebook
sns.lmplot('bodywt', 'brainwt', mammals)
<seaborn.axisgrid.FacetGrid at 0x104421a90>
//anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison if self._edgecolors == str('face'):
Image in a Jupyter notebook

Log transformation can help here.

Curious about the math? http://onlinestatbook.com/2/transformations/log.html

log_columns = ['bodywt', 'brainwt',] log_mammals = mammals.copy() log_mammals[log_columns] = log_mammals[log_columns].apply(np.log10)
g = sns.lmplot('bodywt', 'brainwt', log_mammals) g.set_axis_labels( "Log Body Weight", "Log Brain Weight")
<seaborn.axisgrid.FacetGrid at 0x10a038b10>
Image in a Jupyter notebook

Woohoo! This looks much better.

#Part 1- Student: Update and complete the code below to use lmplot and display correlations between body weight and two dependent variables: sleep_rem and awake.

Complete below for 2 new models:

With body weight as the x and y set as:

  1. sleep_rem

  2. awake

#1. add any additional variables that you would like to take the log of log_columns = ['bodywt', 'brainwt',] # any others? log_mammals = mammals.copy() log_mammals[log_columns] = log_mammals[log_columns].apply(np.log10)

Create the lmplots

g = sns.lmplot(X, Y, mammals) g.set_axis_labels( "Body Weight", "REM") g = sns.lmplot(X, Y, log_mammals) g.set_axis_labels( "Log Body Weight", "Log REM ")
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-18-5329563f75c7> in <module>() ----> 1 g = sns.lmplot(X, Y, mammals) 2 g.set_axis_labels( "Body Weight", "REM") 3 g = sns.lmplot(X, Y, log_mammals) 4 g.set_axis_labels( "Log Body Weight", "Log REM ") NameError: name 'X' is not defined

####play around with other outcomes

log_columns = ['bodywt', 'brainwt', 'awake', 'sleep_rem'] # any others? log_mammals = mammals.copy() log_mammals[log_columns] = log_mammals[log_columns].apply(np.log10) # one other example, using brainwt and awake. x = 'brainwt' y = 'awake' sns.lmplot(x, y, mammals) sns.lmplot(x, y, log_mammals)
<seaborn.axisgrid.FacetGrid at 0x10add0750>
Image in a Jupyter notebookImage in a Jupyter notebook

Decision for Check 1. Distributrion

Answer:

We decided above that we will need a log transformation. Let's take a look at both models to compare
# not transformed X = mammals[['bodywt']] y = mammals['brainwt'] # create a fitted model in one line #formula notiation is the equivalent to writting out our models such that 'outcome = predictor' #with the follwing syntax formula = 'outcome ~ predictor1 + predictor2 ... predictorN' lm = smf.ols(formula='y ~ X', data=mammals).fit() #print the full summary lm.summary()

Our output tells us that:

  • The relationship between bodywt and brainwt isn't random (p value approaching 0)

  • With this current model, log(brainwt) is roughly log(bodywt) * 0.0010

  • The model explains, roughly, 87% of the variance of the dataset

Student: repeat with the log transformation

# Log transformed X = y = # create a fitted model in one line #formula notiation is the equivalent to writting out our models such that 'outcome = predictor' #with the follwing syntax formula = 'outcome ~ predictor1 + predictor2 ... predictorN' #print the full summary
File "<ipython-input-21-4798b1fa002f>", line 2 X = ^ SyntaxError: invalid syntax

What does our output tell us?

Our output tells us that:

Bonus: Use Statsmodels to make the prediction

# you have to create a DataFrame since the Statsmodels formula interface expects it X_new = pd.DataFrame({'X': [50]}) X_new.head()

Predict X_new

#prediction

Part 2: Multiple Regression Analysis using citi bike data

In the previous example, one variable explained the variance of another; however, more often than not, we will need multiple variables.

For example, a house's price may be best measured by square feet, but a lot of other variables play a vital role: bedrooms, bathrooms, location, appliances, etc.

For a linear regression, we want these variables to be largely independent of each other, but all of them should help explain the y variable.

We'll work with bikeshare data to showcase what this means and to explain a concept called multicollinearity.

wd = '../../assets/dataset/bikeshare/' bike_data = pd.read_csv(wd+'bikeshare.csv') bike_data.head()

##Check 2. Multicollinearity What is Multicollinearity?

With the bike share data, let's compare three data points: actual temperature, "feel" temperature, and guest ridership.

Our data is already normalized between 0 and 1, so we'll start off with the correlations and modeling.

Students:

using the code from the demo create a correlation heat map comparing 'temp', 'atemp', 'casual'

#cmap...

####Question: What did we find?

The correlation matrix explains that:

###Demo: We can measure this effect in the coefficients:

Side note: this is a sneak peak at scikit learn

from sklearn import feature_selection, linear_model def get_linear_model_metrics(X, y, algo): # get the pvalue of X given y. Ignore f-stat for now. pvals = feature_selection.f_regression(X, y)[1] # start with an empty linear regression object # .fit() runs the linear regression function on X and y algo.fit(X,y) residuals = (y-algo.predict(X)).values # print the necessary values print 'P Values:', pvals print 'Coefficients:', algo.coef_ print 'y-intercept:', algo.intercept_ print 'R-Squared:', algo.score(X,y) plt.figure() plt.hist(residuals, bins=np.ceil(np.sqrt(len(y)))) # keep the model return algo
y = bike_data['casual'] x_sets = ( ['temp'], ['atemp'], ['temp', 'atemp'], ) for x in x_sets: print ', '.join(x) get_linear_model_metrics(bike_data[x], y, linear_model.LinearRegression()) print
temp P Values: [ 0.] Coefficients: [ 117.68705779] y-intercept: -22.812739188 R-Squared: 0.21124654163 atemp P Values: [ 0.] Coefficients: [ 130.27875081] y-intercept: -26.3071675481 R-Squared: 0.206188705733 temp, atemp P Values: [ 0. 0.] Coefficients: [ 116.34021588 1.52795677] y-intercept: -22.8703398286 R-Squared: 0.21124723661
//anaconda/lib/python2.7/site-packages/scipy/linalg/basic.py:884: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver. warnings.warn(mesg, RuntimeWarning)
Image in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebook

Intrepretation:

What happens if we use a second variable that isn't highly correlated with temperature, like humidity?

y = bike_data['casual'] x = bike_data[['temp', 'hum']] get_linear_model_metrics(x, y, linear_model.LinearRegression())
P Values: [ 0. 0.] Coefficients: [ 112.02457031 -80.87301833] y-intercept: 30.7273338581 R-Squared: 0.310901196913
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Image in a Jupyter notebook

Guided Practice: Multicollinearity with dummy variables (15 mins)

There can be a similar effect from a feature set that is a singular matrix, which is when there is a clear relationship in the matrix (for example, the sum of all rows = 1).

Run through the following code on your own.

What happens to the coefficients when you include all weather situations instead of just including all except one?

lm = linear_model.LinearRegression() weather = pd.get_dummies(bike_data.weathersit) get_linear_model_metrics(weather[[1, 2, 3, 4]], y, lm) print # Set one weather as the reference (drop it), weather situation = 4 get_linear_model_metrics(weather[[1, 2, 3]], y, lm)
P Values: [ 3.75616929e-73 3.43170021e-22 1.57718666e-55 2.46181288e-01] Coefficients: [ 4.05930101e+12 4.05930101e+12 4.05930101e+12 4.05930101e+12] y-intercept: -4.05930100616e+12 R-Squared: 0.0233497737473 P Values: [ 3.75616929e-73 3.43170021e-22 1.57718666e-55] Coefficients: [ 37.87876398 26.92862383 13.38900634] y-intercept: 2.66666666652 R-Squared: 0.0233906873841
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Image in a Jupyter notebookImage in a Jupyter notebook

Similar in Statsmodels

# all dummies in the model lm_stats = smf.ols(formula='y ~ weather[[1, 2, 3, 4]]', data=bike_data).fit() lm_stats.summary()

Students: Now drop one

#droping one

Interpretation:

This model makes more sense, because we can more easily explain the variables compared to the one we left out.

For example, this suggests that a clear day (weathersit:1) on average brings in about 38 more riders hourly than a day with heavy snow.

In fact, since the weather situations "degrade" in quality (1 is the nicest day, 4 is the worst), the coefficients now reflect that well.

However at this point, there is still a lot of work to do, because weather on its own fails to explain ridership well.

Checkout our data again

bike_data.dtypes
instant int64 dteday object season int64 yr int64 mnth int64 hr int64 holiday int64 weekday int64 workingday int64 weathersit int64 temp float64 atemp float64 hum float64 windspeed float64 casual int64 registered int64 cnt int64 dtype: object
bike_data.describe()

#Part 3- Building a model to predict guest ridership With a partner, complete this code together and visualize the correlations of all the numerical features built into the data set.

We want to:

  • Id categorical variables

  • Create dummies (weather situation is done for you in the starter code)

  • Find at least two more features that are not correlated with current features, but could be strong indicators for predicting guest riders.

#starter code (hints!) #Dummies example: weather = pd.get_dummies(bike_data.weathersit) #create new names for our new dummy variables weather.columns = ['weather_' + str(i) for i in weather.columns] #join those new variables back into the larger dataset bikemodel_data = bike_data.join(weather) print bikemodel_data.columns #Select columns to keep. Don't forget to set a reference category for your dummies (aka drop one) columns_to_keep = ['temp', 'weather_1', 'weather_2', 'weather_3'] #[which_variables?] #checking for colinearity cmap = sns.diverging_palette(220, 10, as_cmap=True) correlations = bikemodel_data[columns_to_keep].corr()# what are we getting the correlations of? print correlations print sns.heatmap(correlations, cmap=cmap)
Index([u'instant', u'dteday', u'season', u'yr', u'mnth', u'hr', u'holiday', u'weekday', u'workingday', u'weathersit', u'temp', u'atemp', u'hum', u'windspeed', u'casual', u'registered', u'cnt', u'weather_1', u'weather_2', u'weather_3', u'weather_4'], dtype='object') temp weather_1 weather_2 weather_3 temp 1.000000 0.101044 -0.069657 -0.062406 weather_1 0.101044 1.000000 -0.822961 -0.412414 weather_2 -0.069657 -0.822961 1.000000 -0.177417 weather_3 -0.062406 -0.412414 -0.177417 1.000000 Axes(0.125,0.125;0.62x0.775)
Image in a Jupyter notebook

Independent Practice: Building model to predict guest ridership

Pay attention to:

  • Which variables would make sense to dummy (because they are categorical, not continuous)?

  • the distribution of riders (should we rescale the data?)

  • checking correlations with variables and guest riders

  • having a feature space (our matrix) with low multicollinearity

  • the linear assumption -- given all feature values being 0, should we have no ridership? negative ridership? positive ridership?

  • What features might explain ridership but aren't included in the data set?

###You're done when: If your model has an r-squared above .4, this a relatively effective model for the data available. Kudos! Move on to the bonus!

#your code here...
#and here
#add as many cells as you need :)

####1: What's the strongest predictor?

Answer:

####2: How well did your model do?

Answer:

####3: How can you improve it?

Answer:

###Bonus:

We've completed a model that explains casual guest riders. Now it's your turn to build another model, using a different y (outcome) variable: registered riders.

Bonus 1: What's the strongest predictor?

Bonus 2: How well did your model do?

Bonus 3: How can you improve it?