Path: blob/master/april_18/lessons/lesson-06-alt/code/solution-code/solution-code-6.ipynb
1905 views
Lesson 6 - Solution Code
Part 1:
Explore our mammals dataset
Check 1. Distribution
Lets check out a scatter plot of body wieght and brain weight
Log transformation can help here.
Curious about the math? http://onlinestatbook.com/2/transformations/log.html
Woohoo! This looks much better.
#Part 1- Student: Update and complete the code below to use lmplot and display correlations between body weight and two dependent variables: sleep_rem and awake.
Complete below for 2 new models:
With body weight as the x and y set as:
sleep_rem
awake
Create lmplots for sleep_rem and awake as a y, with variables you've already used as x.
####play around with other outcomes
Decision for Check 1. Distributrion
Answer: For this analysis we will log transform our data.
We decided above that we will need a log transformation. Let's take a look at both models to compare
Our output tells us that:
The relationship between bodywt and brainwt isn't random (p value approaching 0)
With this current model, log(brainwt) is roughly log(bodywt) * 0.0010
The model explains, roughly, 87% of the variance of the dataset
Student: repeat with the log transformation
What does our output tell us?
Our output tells us that:
The relationship between bodywt and brainwt isn't random (p value approaching 0)
With this current model, log(brainwt) is roughly log(bodywt) * 0.7652
The model explains, roughly, 93% of the variance of the dataset (the largest errors being in the large brain and body sizes)
Bonus: Use Statsmodels to make the prediction
Part 2: Multiple Regression Analysis using citi bike data
In the previous example, one variable explained the variance of another; however, more often than not, we will need multiple variables.
For example, a house's price may be best measured by square feet, but a lot of other variables play a vital role: bedrooms, bathrooms, location, appliances, etc.
For a linear regression, we want these variables to be largely independent of each other, but all of them should help explain the y variable.
We'll work with bikeshare data to showcase what this means and to explain a concept called multicollinearity.
##Check 2. Multicollinearity What is Multicollinearity?
With the bike share data, let's compare three data points: actual temperature, "feel" temperature, and guest ridership.
Our data is already normalized between 0 and 1, so we'll start off with the correlations and modeling.
Students:
using the code from the demo create a correlation heat map comparing 'temp', 'atemp', 'casual'
####Question: What did we find?
The correlation matrix explains that:
both temperature fields are moderately correlated to guest ridership;
the two temperature fields are highly correlated to each other.
Including both of these fields in a model could introduce a pain point of multicollinearity, where it's more difficult for a model to determine which feature is effecting the predicted value.
###Demo: We can measure this effect in the coefficients:
Side note: this is a sneak peak at scikit learn
Intrepretation:
Even though the 2-variable model temp + atemp has a higher explanation of variance than two variables on their own, and both variables are considered significant (p values approaching 0), we can see that together, their coefficients are wildly different.
This can introduce error in how we explain models.
What happens if we use a second variable that isn't highly correlated with temperature, like humidity?
Guided Practice: Multicollinearity with dummy variables (15 mins)
There can be a similar effect from a feature set that is a singular matrix, which is when there is a clear relationship in the matrix (for example, the sum of all rows = 1).
Run through the following code on your own.
What happens to the coefficients when you include all weather situations instead of just including all except one?
Similar in Statsmodels
Students: Now drop one
Interpretation:
This model makes more sense, because we can more easily explain the variables compared to the one we left out.
For example, this suggests that a clear day (weathersit:1) on average brings in about 38 more riders hourly than a day with heavy snow.
In fact, since the weather situations "degrade" in quality (1 is the nicest day, 4 is the worst), the coefficients now reflect that well.
However at this point, there is still a lot of work to do, because weather on its own fails to explain ridership well.
With a partner, complete this code together and visualize the correlations of all the numerical features built into the data set.
We want to:
Id categorical variables
Create dummies
Find at least two more features that are not correlated with current features, but could be strong indicators for predicting guest riders.
Independent Practice: Building model to predict guest ridership (25 minutes)
Pay attention to:
Which variables would make sense to dummy (because they are categorical, not continuous)?
the distribution of riders (should we rescale the data?)
checking correlations with variables and guest riders
having a feature space (our matrix) with low multicollinearity
the linear assumption -- given all feature values being 0, should we have no ridership? negative ridership? positive ridership?
What features might explain ridership but aren't included in the data set?
###Outcomes: If your model has an r-squared above .4, this a relatively effective model for the data available. Kudos! Move on to the bonus!
1: What's the strongest predictor? It depends on what you included in the model. Note that the largest impact could be in either the positive or negative direction (consider using absolute values to quickly evaluate).
2: How well did your model do? Check out your R-square value, and 95%CIs. We will dive into this topic more in the next class
3: How can you improve it? Cross-validation for one! Next class...
###Bonus:
We've completed a model that explains casual guest riders. Now it's your turn to build another model, using a different y (outcome) variable: registered riders.
Bonus 1: What's the strongest predictor?
Bonus 2: How well did your model do?
Bonus 3: How can you improve it?