Path: blob/master/lessons/lesson_06/code/regularization.ipynb
1904 views
Regularization
Polynomial regression
Given the following function of the "ground truth", and a few sample data points we will use for regression. The example is by Mathieu Blondel & Jake Vanderplas (source).
Obviously linear regression won't bring you far:
Now try a few polynomial regressions to fit the given sample data points.
It's actually a result from algebra that you can fit any finite set of data points with a polynomial.
In fact, for any set of data points, there exists a polynomial of degree that goes right through them.
This is great if you'd want to approximate your data arbitrarily closely.
It's not great if you're afraid of overfitting your data
Overfitting
Suppose you want to find a model behind some data, which also contains some arbitrary noise.
Obviously you could fit this noise by an arbitrarily complex model.
It makes sense that that is obviously not what you want.
It seems that a second or third degree polynomial performs better than a fifth one on unseen data, which makes sense, since that's how we generated the samples.
Let's compare the different models once more:
Regularization
If your model is very complex (i.e., lots of features, possibly a polynomial fit, etc.), you need to worry more about overfitting.
You'll need regularization when your model is complex, which happens when you have little data or many features.
The example below uses the same dataset as above, but with fewer samples, and a relatively high degree model.
We'll fit the (unregularized)
LinearRegression
, as well as the (regularized)Ridge
andLasso
model.Lasso regression imposes an L1 prior on the coefficient, causing many coeffiecients to be zero.
Ridge regression imposes an L2 prior on the coefficient, causing outliers to be less likely, and coeffiecients to be small across the board.
Indeed, the unregularized
LinearRegression
leads to a model that is too complex and tries to fit the noise.Note the differences in the (averaged) mean square error, or MSE, as well the complexity in the plots
Note that the metric is not helpful here.
Increasing complexity
Let's try a few degrees with a regularized model.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-24-c3b16118dc32> in <module>()
1 test_models = [LinearRegression(), Ridge(alpha=10), Lasso(alpha=10)]
2
----> 3 scores = [analyze_performance(my_model) for my_model in test_models]
<ipython-input-24-c3b16118dc32> in <listcomp>(.0)
1 test_models = [LinearRegression(), Ridge(alpha=10), Lasso(alpha=10)]
2
----> 3 scores = [analyze_performance(my_model) for my_model in test_models]
<ipython-input-17-d737b3d94a36> in analyze_performance(test_model)
3 for degree in range(0, 30):
4 model = make_pipeline(PolynomialFeatures(degree), test_model)
----> 5 scores['overfit'][degree] = model.fit(X, y_sample).score(X, y_sample)
6 cv_scores = []
7 for k in range(15): # Compute a few R2 scores and print average performance
/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
248 Xt, fit_params = self._fit(X, y, **fit_params)
249 if self._final_estimator is not None:
--> 250 self._final_estimator.fit(Xt, y, **fit_params)
251 return self
252
/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py in fit(self, X, y, sample_weight)
480 n_jobs_ = self.n_jobs
481 X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 482 y_numeric=True, multi_output=True)
483
484 if sample_weight is not None and np.atleast_1d(sample_weight).ndim > 1:
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
581 y = y.astype(np.float64)
582
--> 583 check_consistent_length(X, y)
584
585 return X, y
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
202 if len(uniques) > 1:
203 raise ValueError("Found input variables with inconsistent numbers of"
--> 204 " samples: %r" % [int(l) for l in lengths])
205
206
ValueError: Found input variables with inconsistent numbers of samples: [8, 30]
We could try a few different values for as well.
We see that that Ridge and Lasso keep performing well for higher degrees, because of their regularization.
Exercises
(Not verified yet.)
Take a dataset from the previous Linear Regression notebook (eg Princeton salaries or Boston house prices) and try to repeat the exercises using regularization.