Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_10-sub-Jacob_Koehler/04-Regression-Regularization.ipynb
1904 views
Kernel: Python 3

Regularized Methods

  • Feature Scaling

  • Test/Train split

  • Ridge, LASSO, Elastic Net Regression methods


In a regular linear scenario, we start with a regular linear function.

y^=b+ax0\hat y = b + ax_0

The mean square error of these predictions would be given by:

RSS(a,b)=i=1n(yi(axi+b))2RSS(a, b) = \sum_{i = 1}^n(y_i - (ax_i + b))^2

From this basic MSEMSE formulation, we can introduce some Regularized methods that add a regularization term to the MSEMSE. We will look at three methods that offer slight variations on this term.

Feature Scaling

To use these methods, we want to scale our data. Many Machine Learning algorithms don't do well with data operating on very different scales. Using the MinMaxScaler normalizes the data and brings the values between 0 and 1. The StandardScaler method is less sensitive to wide ranges of values. We will use both on our Ames housing data. To begin, we need to select the numeric columns from the DataFrame so we can transform them only.

%matplotlib notebook import matplotlib.pyplot as plt import numpy as np import pandas as pd
#get our data and select the numer ames = pd.read_csv('data/ames_housing.csv') y = ames['SalePrice'] ames = ames.drop('SalePrice', axis = 1)
ames_numeric = ames.select_dtypes(include = 'int64') ames_numeric.head()

Using the Scaler on a DataFrame

Below, we can compare the results of the two scaling transformations by passing a list of column names to the scaler. Note the practice of initializing the object, fitting it, and transforming.

from sklearn.preprocessing import StandardScaler, MinMaxScaler
std_scaled = StandardScaler() minmax_scaled = MinMaxScaler()
cols = ames_numeric.columns
std_df = std_scaled.fit_transform(ames[[name for name in cols]]) minmax_df = minmax_scaled.fit_transform(ames[[name for name in cols]])
pd.DataFrame(std_df).head()
pd.DataFrame(minmax_df).head()

Fit a Linear Model on Scaled Data

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
y = np.log(y)
ames_numeric_scaled = std_scaled.fit_transform(ames[[name for name in cols]])
lm.fit(ames_numeric_scaled, y)
/anaconda3/lib/python3.6/site-packages/scipy/linalg/basic.py:1226: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver. warnings.warn(mesg, RuntimeWarning)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
from sklearn.metrics import mean_squared_error
predictions = lm.predict(ames_numeric_scaled)
mse = mean_squared_error(y, predictions)
rmse = np.sqrt(mse) score = lm.score(ames_numeric_scaled, predictions)
print('R-squared score: {}'.format(score), '\nRMSE: {:.4f}'.format(rmse))
R-squared score: 1.0 RMSE: 0.1457

Splitting the Data

As we have seen, we will tend to overfit the data if we use the entire dataset to determine the model. To account for this, we will split our datasets into a training set to build our model on, and a test set to evaluate the performance of the model. We have a handy sklearn method for doing this, who by default splits the data into 80% for training and 20% for testing.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(ames_numeric_scaled, y)
lm.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
pred = lm.predict(X_test)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse) rmse
0.1907947761739675

Regularized Methods Comparison

crime = pd.read_csv('data/crime_data.csv', index_col = 'Unnamed: 0')
crime.head()
y = crime['ViolentCrimesPerPop']
X = crime.drop('ViolentCrimesPerPop', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.fit_transform(X_test)
lm = LinearRegression() lm.fit(X_train_scaled, y_train) predictions = lm.predict(X_test_scaled) rmse = np.sqrt(mean_squared_error(y_test, predictions)) score = lm.score(X_test_scaled, y_test) print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))
The r2 value is : 0.4786 The RMSE value is 415.6595

Ridge Regression

RSS(w,b)=i=1N(yi(wxi+b))2+αj=1pwj2RSS(w, b) = \sum_{i = 1} ^ N (y_i - (wx_i + b))^2 + \alpha \sum_{j = 1}^p w_j^2

Many feature coefficients will be determined with small values. Larger α\alpha means larger penalty, zero is base LinearRegression, and the default for sklearn's implementation is 1.0.

from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha = 1, solver = "cholesky")
ridge_reg.fit(X_train_scaled, y_train)
Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='cholesky', tol=0.001)
rpred = ridge_reg.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, rpred)) score = ridge_reg.score(X_test_scaled, y_test) print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))
The r2 value is : 0.4757 The RMSE value is 416.7998
np.sum(ridge_reg.coef_ != 0)
88
crime.shape
(1994, 89)
ridge_reg = Ridge(alpha = 20, solver = "cholesky") ridge_reg.fit(X_train_scaled, y_train) rpred = ridge_reg.predict(X_test_scaled) rmse = np.sqrt(mean_squared_error(y_test, rpred)) score = ridge_reg.score(X_test_scaled, y_test) print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))
The r2 value is : 0.5295 The RMSE value is 394.8400

Lasso Regression

RSS(w,b)=i=1N(yi(wxi+b))2+αj=1pwjRSS(w, b) = \sum_{i = 1} ^ N (y_i - (wx_i + b))^2 + \alpha \sum_{j = 1}^p |w_j|

Now, we end up in effect setting variables with low influence to a coefficient of zero. Compared to Ridge, we would use Lasso if there are only a few variables with substantial effects.

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha = 2.0)
lasso_reg.fit(X_train_scaled, y_train)
Lasso(alpha=2.0, copy_X=True, fit_intercept=True, max_iter=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
lpred = lasso_reg.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, lpred)) score = ridge_reg.score(X_test_scaled, y_test) print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))
The r2 value is : 0.5295 The RMSE value is 423.7267
np.sum(lasso_reg.coef_ != 0)
18
for e in sorted (list(zip(list(X), lasso_reg.coef_)), key = lambda e: -abs(e[1])): if e[1] != 0: print('\t{}, {:.3f}'.format(e[0], e[1]))
PctKidsBornNeverMar, 1987.236 PctKids2Par, -842.249 PctVacantBoarded, 343.748 PctHousOccup, -331.982 PctForeignBorn, 328.357 MalePctDivorce, 292.700 NumInShelters, 248.496 MedOwnCostPctIncNoMtg, -192.645 PctWorkMom, -141.906 pctWInvInc, -136.523 pctUrban, 123.742 PctEmplManu, -113.120 MedYrHousBuilt, 82.142 pctWPubAsst, 71.706 agePct12t29, -59.647 RentQrange, 47.841 PctSameCity85, 45.718 OwnOccHiQuart, 15.365

Elastic Net

RSS(w,b)=i=1N(yi(wxi+b))2+rαi=1nwj+1r2αj=1pwj2RSS(w, b) = \sum_{i = 1} ^ N (y_i - (wx_i + b))^2 + r\alpha\sum_{i = 1}^n |w_j| + \frac{1-r}{2} \alpha \sum_{j = 1}^p w_j^2
from sklearn.linear_model import ElasticNet
elastic_reg = ElasticNet(alpha = .05, l1_ratio=0.4) elastic_reg.fit(X_train_scaled, y_train) epred = elastic_reg.predict(X_test_scaled) rmse = np.sqrt(mean_squared_error(y_test, epred)) rmse
395.9689402572352
ridge_score = ridge_reg.score(X_test_scaled, y_test) lasso_score = lasso_reg.score(X_test_scaled, y_test) elastic_score = elastic_reg.score(X_test_scaled, y_test)
print("Ridge: {:.4f}".format(ridge_score), "\nLasso: {:.4f}".format(lasso_score), "\nElastic Net: {:.4f}".format(elastic_score))
Ridge: 0.5295 Lasso: 0.4582 Elastic Net: 0.5268
pd.DataFrame(scaled, columns = cols)

PROBLEM

Return to your Ames Data. We have covered a lot of ground today, so let's summarize the things we could do to improve the performance of our original model that compared the Above Ground Living Area to the Logarithm of the Sale Price.

1. Clean data, drop missing values 2. Transform data, code variables using either ordinal values or OneHotEncoder methods 3. Create more features from existing features 4. Split our data into testing and training sets 5. Normalize quantitative features 6. Use Regularized Regression methods and Polynomial regression to improve performance of model
Can you use some or all of these ideas to improve upon your initial model?

Additional Resources

The last two lessons have pulled heavily from these resources. I recommend them all strongly as excellent resources: