Path: blob/master/lessons/lesson_06/code/solution-code/solution-code-7rev.ipynb
1904 views
Class 7- solution
###Create sample data and fit a model
File "<ipython-input-6-f61f2d8c60eb>", line 3
print metrics.mean_squared_error(df['y'], lm.predict(df[['x']]))
^
SyntaxError: invalid syntax
Cross validation
Intro to cross validation with bike share data from last time. We will be modeling casual ridership.
####Create dummy variables and set outcome (dependent) variable
Create a cross valiation with 5 folds
Check
While the cross validated approach here generated more overall error, which of the two approaches would predict new data more accurately: the single model or the cross validated, averaged one? Why?
Answer: this score will be lower with the single model in the case, but we're trading off bias error for generalized error
###Advanced: There are ways to improve our model with regularization. Let's check out the effects on MSE and R2
Figuring out the alphas can be done by "hand"
Or we can use grid search to make this faster
Best score
mean squared error here comes in negative, so let's make it positive.
explains which grid_search setup worked best
shows all the grid pairings and their performances.
Gradient Descent
For the DP example below, it might be a great idea for students to take the code and implement a stopping point, similar to what n_iter would do in gradient descent.
There can be a great conversation about stopping early and still kinda getting the right result vs taking a longer time to solve and having a more precise model.
That solution is below.
##Demo: Application of Gradient Descent
###Check: Untuned, how well did gradient descent perform compared to OLS?
Previous Result (from above):
Answer: similar R2, MSE is lower for GR
#Independent Practice: Bike data revisited
There are tons of ways to approach a regression problem. The regularization techniques appended to ordinary least squares optimizes the size of coefficients to best account for error. Gradient Descent also introduces learning rate (how aggressively do we solve the problem), epsilon (at what point do we say the error margin is acceptable), and iterations (when should we stop no matter what?)
For this deliverable, our goals are to:
implement the gradient descent approach to our bike-share modeling problem,
show how gradient descent solves and optimizes the solution,
demonstrate the grid_search module!
While exploring the Gradient Descent regressor object, you'll build a grid search using the stochastic gradient descent estimator for the bike-share data set. Continue with either the model you evaluated last class or the simpler one from today. In particular, be sure to implement the "param_grid" in the grid search to get answers for the following questions:
With a set of alpha values between 10^-10 and 10^-1, how does the mean squared error change?
Based on the data, we know when to properly use l1 vs l2 regularization. By using a grid search with l1_ratios between 0 and 1 (increasing every 0.05), does that statement hold true? If not, did gradient descent have enough iterations?
How do these results change when you alter the learning rate (eta0)?
Bonus: Can you see the advantages and disadvantages of using gradient descent after finishing this exercise?
Starter Code
Independent Practice Solution
This code shows the variety of challenges and some student gotchas. The plots will help showcase what should be learned.
With a set of alpha values between 10^-10 and 10^-1, how does the mean squared error change?
We know when to properly use l1 vs l2 regularization based on the data. By using a grid search with l1_ratios between 0 and 1 (increasing every 0.05), does that statement hold true?
(if it didn't look like it, did gradient descent have enough iterations?)
How do results change when you alter the learning rate (power_t)?
With the alphas available, it looks like at mean squared error stays generally flat with incredibly small alpha values, but starting at , the error begins to elbow. We probably don't have much of a different in performance with other alpha values.
At alpha values of either .1 or 1, the l1_ratio works best closer to 1! Interesting. At other values of alpha they should see similar results, though the graphs aren't as clear.
Here it should be apparent that as the initial learning rate increases, the error should also increase. And what happens when the initial learning rate is too high? A dramatic increase in error. Students should recognize the importance of learning rate and what values it should be set at, the smaller generally the better.