Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_06/code/starter-code/starter-code-7.ipynb
1904 views
Kernel: Python [conda env:Anaconda3]

Class 7- Starter code

import numpy as np import pandas as pd from sklearn import linear_model, metrics

Create sample data and fit a model

df = pd.DataFrame({'x': range(100), 'y': range(100)}) biased_df = df.copy() biased_df.loc[:20, 'x'] = 1 biased_df.loc[:20, 'y'] = 1 def append_jitter(series): jitter = np.random.random_sample(size=100) return series + jitter df['x'] = append_jitter(df.x) df['y'] = append_jitter(df.y) biased_df['x'] = append_jitter(biased_df.x) biased_df['y'] = append_jitter(biased_df.y)
df.head()
biased_df.head()
## fit lm = linear_model.LinearRegression().fit(df[['x']], df['y']) print( metrics.mean_squared_error(df['y'], lm.predict(df[['x']])))
0.126927599748127
## biased fit lm = linear_model.LinearRegression().fit(biased_df[['x']], biased_df['y']) print (metrics.mean_squared_error(df['y'], lm.predict(df[['x']])))
0.13087057695577367

Cross validation

Intro to cross validation with bike share data from last time. We will be modeling casual ridership.

from sklearn import cross_validation wd = '../../assets/dataset/' bikeshare = pd.read_csv(wd + 'bikeshare.csv')
C:\Users\jlandesman\Anaconda3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)

####Create dummy variables and set outcome (dependent) variable

weather = pd.get_dummies(bikeshare.weathersit, prefix='weather') modeldata = bikeshare[['temp', 'hum']].join(weather[['weather_1', 'weather_2', 'weather_3']]) y = bikeshare.casual

Create a cross valiation with 5 folds

kf = cross_validation.KFold(len(modeldata), n_folds=5, shuffle=True)
mse_values = [] scores = [] n= 0 print("~~~~ CROSS VALIDATION each fold ~~~~") for train_index, test_index in kf: lm = linear_model.LinearRegression().fit(modeldata.iloc[train_index], y.iloc[train_index]) mse_values.append(metrics.mean_squared_error(y.iloc[test_index], lm.predict(modeldata.iloc[test_index]))) scores.append(lm.score(modeldata, y)) n+=1 print('Model', n) print('MSE:', mse_values[n-1]) print('R2:', scores[n-1]) print("~~~~ SUMMARY OF CROSS VALIDATION ~~~~") print('Mean of MSE for all folds:', np.mean(mse_values)) print('Mean of R2 for all folds:', np.mean(scores))
~~~~ CROSS VALIDATION each fold ~~~~ Model 1 MSE: 1626.6025972208786 R2: 0.31191910435156234 Model 2 MSE: 1757.7461445641272 R2: 0.31192945997811383 Model 3 MSE: 1717.9364681090635 R2: 0.31193168013105443 Model 4 MSE: 1699.7992285897262 R2: 0.3119164006711529 Model 5 MSE: 1561.925140067944 R2: 0.31192445079699627 ~~~~ SUMMARY OF CROSS VALIDATION ~~~~ Mean of MSE for all folds: 1672.801915710348 Mean of R2 for all folds: 0.31192421918577595
lm = linear_model.LinearRegression().fit(modeldata, y) print "~~~~ Single Model ~~~~" print 'MSE of single model:', metrics.mean_squared_error(y, lm.predict(modeldata)) print 'R2: ', lm.score(modeldata, y)
~~~~ Single Model ~~~~ MSE of single model: 1672.58110765 R2: 0.311934605989

Check

While the cross validated approach here generated more overall error, which of the two approaches would predict new data more accurately: the single model or the cross validated, averaged one? Why?

Answer:

There are ways to improve our model with regularization.

Let's check out the effects on MSE and R2

lm = linear_model.LinearRegression().fit(modeldata, y) print "~~~ OLS ~~~" print 'OLS MSE: ', metrics.mean_squared_error(y, lm.predict(modeldata)) print 'OLS R2:', lm.score(modeldata, y) lm = linear_model.Lasso().fit(modeldata, y) print "~~~ Lasso ~~~" print 'Lasso MSE: ', metrics.mean_squared_error(y, lm.predict(modeldata)) print 'Lasso R2:', lm.score(modeldata, y) lm = linear_model.Ridge().fit(modeldata, y) print "~~~ Ridge ~~~" print 'Ridge MSE: ', metrics.mean_squared_error(y, lm.predict(modeldata)) print 'Ridge R2:', lm.score(modeldata, y)
~~~ OLS ~~~ OLS MSE: 1672.58110765 OLS R2: 0.311934605989 ~~~ Lasso ~~~ Lasso MSE: 1725.41581608 Lasso R2: 0.290199495922 ~~~ Ridge ~~~ Ridge MSE: 1672.60490113 Ridge R2: 0.311924817843

Figuring out the alphas can be done by "hand"

alphas = np.logspace(-10, 10, 21) for a in alphas: print 'Alpha:', a lm = linear_model.Ridge(alpha=a) lm.fit(modeldata, y) print lm.coef_ print metrics.mean_squared_error(y, lm.predict(modeldata))
Alpha: 1e-10 [ 112.68901765 -84.01121684 -24.68489063 -21.00314493 -21.71893628] 1672.58110765 Alpha: 1e-09 [ 112.68901765 -84.01121684 -24.6848906 -21.00314491 -21.71893626] 1672.58110765 Alpha: 1e-08 [ 112.68901765 -84.01121684 -24.6848904 -21.00314471 -21.71893606] 1672.58110765 Alpha: 1e-07 [ 112.68901763 -84.01121682 -24.68488837 -21.00314268 -21.71893403] 1672.58110765 Alpha: 1e-06 [ 112.68901745 -84.01121667 -24.68486804 -21.00312237 -21.71891373] 1672.58110765 Alpha: 1e-05 [ 112.68901562 -84.01121509 -24.68466472 -21.00291929 -21.71871079] 1672.58110765 Alpha: 0.0001 [ 112.68899732 -84.01119938 -24.68263174 -21.00088873 -21.71668161] 1672.58110765 Alpha: 0.001 [ 112.68881437 -84.01104228 -24.66232204 -20.98060316 -21.69640993] 1672.58110774 Alpha: 0.01 [ 112.68698753 -84.00947323 -24.46121539 -20.77973778 -21.49568404] 1672.58111645 Alpha: 0.1 [ 112.66896732 -83.99396383 -22.63109556 -18.95202277 -19.66942371] 1672.58185208 Alpha: 1.0 [ 112.50129738 -83.84805622 -13.38214934 -9.72671278 -10.46162477] 1672.60490113 Alpha: 10.0 [ 110.96062533 -82.49604961 -3.94431741 -0.51765034 -1.45024412] 1672.83347262 Alpha: 100.0 [ 97.69060562 -71.17602377 -0.31585194 1.18284675 -1.33281591] 1686.31830362 Alpha: 1000.0 [ 44.59923075 -30.85843772 5.07876321 0.05369643 -5.107457 ] 1937.81576044 Alpha: 10000.0 [ 7.03007064 -5.07733082 3.29039029 -1.2136063 -2.06842808] 2314.83675678 Alpha: 100000.0 [ 0.75195708 -0.56490872 0.52067881 -0.25075496 -0.26895254] 2415.77806566 Alpha: 1000000.0 [ 0.07576571 -0.05727511 0.05520142 -0.0273591 -0.02774349] 2429.28026459 Alpha: 10000000.0 [ 0.00758239 -0.00573569 0.0055535 -0.00276043 -0.00278317] 2430.68891798 Alpha: 100000000.0 [ 0.0007583 -0.00057365 0.00055569 -0.00027629 -0.00027841] 2430.83041212 Alpha: 1000000000.0 [ 7.58303020e-05 -5.73659720e-05 5.55719458e-05 -2.76314619e-05 -2.78414555e-05] 2430.84456787 Alpha: 10000000000.0 [ 7.58303603e-06 -5.73660542e-06 5.55722818e-06 -2.76317091e-06 -2.78415441e-06] 2430.84598351

Or we can use grid search to make this faster

from sklearn import grid_search alphas = np.logspace(-10, 10, 21) gs = grid_search.GridSearchCV( estimator=linear_model.Ridge(), param_grid={'alpha': alphas}, scoring='mean_squared_error') gs.fit(modeldata, y)
GridSearchCV(cv=None, error_score='raise', estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001), fit_params={}, iid=True, n_jobs=1, param_grid={'alpha': array([ 1.00000e-10, 1.00000e-09, 1.00000e-08, 1.00000e-07, 1.00000e-06, 1.00000e-05, 1.00000e-04, 1.00000e-03, 1.00000e-02, 1.00000e-01, 1.00000e+00, 1.00000e+01, 1.00000e+02, 1.00000e+03, 1.00000e+04, 1.00000e+05, 1.00000e+06, 1.00000e+07, 1.00000e+08, 1.00000e+09, 1.00000e+10])}, pre_dispatch='2*n_jobs', refit=True, scoring='mean_squared_error', verbose=0)
Best score
print gs.best_score_
-1814.09369133
mean squared error here comes in negative, so let's make it positive.
print -gs.best_score_
1814.09369133
explains which grid_search setup worked best
print gs.best_estimator_
Ridge(alpha=10.0, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.001)
shows all the grid pairings and their performances.
print gs.grid_scores_
[mean: -1817.58711, std: 542.14315, params: {'alpha': 1e-10}, mean: -1817.58711, std: 542.14315, params: {'alpha': 1.0000000000000001e-09}, mean: -1817.58711, std: 542.14315, params: {'alpha': 1e-08}, mean: -1817.58711, std: 542.14315, params: {'alpha': 9.9999999999999995e-08}, mean: -1817.58711, std: 542.14315, params: {'alpha': 9.9999999999999995e-07}, mean: -1817.58711, std: 542.14317, params: {'alpha': 1.0000000000000001e-05}, mean: -1817.58707, std: 542.14331, params: {'alpha': 0.0001}, mean: -1817.58663, std: 542.14477, params: {'alpha': 0.001}, mean: -1817.58230, std: 542.15933, params: {'alpha': 0.01}, mean: -1817.54318, std: 542.30102, params: {'alpha': 0.10000000000000001}, mean: -1817.20111, std: 543.63587, params: {'alpha': 1.0}, mean: -1814.09369, std: 556.35563, params: {'alpha': 10.0}, mean: -1818.51694, std: 653.68607, params: {'alpha': 100.0}, mean: -2125.58777, std: 872.45270, params: {'alpha': 1000.0}, mean: -2458.08836, std: 951.30428, params: {'alpha': 10000.0}, mean: -2532.21151, std: 962.80083, params: {'alpha': 100000.0}, mean: -2541.38479, std: 963.98339, params: {'alpha': 1000000.0}, mean: -2542.32833, std: 964.10141, params: {'alpha': 10000000.0}, mean: -2542.42296, std: 964.11321, params: {'alpha': 100000000.0}, mean: -2542.43242, std: 964.11439, params: {'alpha': 1000000000.0}, mean: -2542.43337, std: 964.11450, params: {'alpha': 10000000000.0}]

Gradient Descent

num_to_approach, start, steps, optimized = 6.2, 0., [-1, 1], False while not optimized: current_distance = num_to_approach - start got_better = False next_steps = [start + i for i in steps] for n in next_steps: distance = np.abs(num_to_approach - n) if distance < current_distance: got_better = True print distance, 'is better than', current_distance current_distance = distance start = n if got_better: print 'found better solution! using', current_distance a += 1 else: optimized = True print start, 'is closest to', num_to_approach
5.2 is better than 6.2 found better solution! using 5.2 4.2 is better than 5.2 found better solution! using 4.2 3.2 is better than 4.2 found better solution! using 3.2 2.2 is better than 3.2 found better solution! using 2.2 1.2 is better than 2.2 found better solution! using 1.2 0.2 is better than 1.2 found better solution! using 0.2 6.0 is closest to 6.2

###Bonus: implement a stopping point, similar to what n_iter would do in gradient descent when we've reached "good enough"

##Demo: Application of Gradient Descent

lm = linear_model.SGDRegressor() lm.fit(modeldata, y) print "Gradient Descent R2:", lm.score(modeldata, y) print "Gradient Descent MSE:", metrics.mean_squared_error(y, lm.predict(modeldata))
Gradient Descent R2: 0.30853517891 Gradient Descent MSE: 1680.84459185

###Check: Untuned, how well did gradient descent perform compared to OLS?

Answer:

#Independent Practice: Bike data revisited

There are tons of ways to approach a regression problem. The regularization techniques appended to ordinary least squares optimizes the size of coefficients to best account for error. Gradient Descent also introduces learning rate (how aggressively do we solve the problem), epsilon (at what point do we say the error margin is acceptable), and iterations (when should we stop no matter what?)

For this deliverable, our goals are to:

  • implement the gradient descent approach to our bike-share modeling problem,

  • show how gradient descent solves and optimizes the solution,

  • demonstrate the grid_search module!

While exploring the Gradient Descent regressor object, you'll build a grid search using the stochastic gradient descent estimator for the bike-share data set. Continue with either the model you evaluated last class or the simpler one from today. In particular, be sure to implement the "param_grid" in the grid search to get answers for the following questions:

  • With a set of alpha values between 10^-10 and 10^-1, how does the mean squared error change?

  • Based on the data, we know when to properly use l1 vs l2 regularization. By using a grid search with l1_ratios between 0 and 1 (increasing every 0.05), does that statement hold true? If not, did gradient descent have enough iterations?

  • How do these results change when you alter the learning rate (eta0)?

Bonus: Can you see the advantages and disadvantages of using gradient descent after finishing this exercise?

Starter Code

params = {} # put your gradient descent parameters here gs = grid_search.GridSearchCV( estimator=linear_model.SGDRegressor(), cv=cross_validation.KFold(len(modeldata), n_folds=5, shuffle=True), param_grid=params, scoring='mean_squared_error', ) gs.fit(modeldata, y) print 'BEST ESTIMATOR' print -gs.best_score_ print gs.best_estimator_ print 'ALL ESTIMATORS' print gs.grid_scores_
## go for it!