GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_10-sub-Jacob_Koehler/04-Regression-Regularization.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Regularized Methods

Feature Scaling
Test/Train split
Ridge, LASSO, Elastic Net Regression methods

In a regular linear scenario, we start with a regular linear function.

\hat y = b + ax_0

The mean square error of these predictions would be given by:

RSS(a, b) = \sum_{i = 1}^n(y_i - (ax_i + b))^2

From this basic $MSE$ formulation, we can introduce some Regularized methods that add a regularization term to the $MSE$ . We will look at three methods that offer slight variations on this term.

Feature Scaling

To use these methods, we want to scale our data. Many Machine Learning algorithms don't do well with data operating on very different scales. Using the MinMaxScaler normalizes the data and brings the values between 0 and 1. The StandardScaler method is less sensitive to wide ranges of values. We will use both on our Ames housing data. To begin, we need to select the numeric columns from the DataFrame so we can transform them only.

In [1]:

%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:

#get our data and select the numer
ames = pd.read_csv('data/ames_housing.csv')
y = ames['SalePrice']
ames = ames.drop('SalePrice', axis = 1)

In [3]:

ames_numeric = ames.select_dtypes(include = 'int64')
ames_numeric.head()

Out[3]:

Using the Scaler on a DataFrame

Below, we can compare the results of the two scaling transformations by passing a list of column names to the scaler. Note the practice of initializing the object, fitting it, and transforming.

In [4]:

from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [5]:

std_scaled = StandardScaler()
minmax_scaled = MinMaxScaler()

In [6]:

cols = ames_numeric.columns

In [7]:

std_df = std_scaled.fit_transform(ames[[name for name in cols]])
minmax_df = minmax_scaled.fit_transform(ames[[name for name in cols]])

In [8]:

pd.DataFrame(std_df).head()

Out[8]:

In [9]:

pd.DataFrame(minmax_df).head()

Out[9]:

Fit a Linear Model on Scaled Data

In [10]:

from sklearn.linear_model import LinearRegression

In [11]:

lm = LinearRegression()

In [12]:

y = np.log(y)

In [13]:

ames_numeric_scaled = std_scaled.fit_transform(ames[[name for name in cols]])

In [14]:

lm.fit(ames_numeric_scaled, y)

Out[14]:

/anaconda3/lib/python3.6/site-packages/scipy/linalg/basic.py:1226: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [15]:

from sklearn.metrics import mean_squared_error

In [16]:

predictions = lm.predict(ames_numeric_scaled)

In [17]:

mse = mean_squared_error(y, predictions)

In [18]:

rmse = np.sqrt(mse)
score = lm.score(ames_numeric_scaled, predictions)

In [19]:

print('R-squared score: {}'.format(score), '\nRMSE: {:.4f}'.format(rmse))

Out[19]:

R-squared score: 1.0 
RMSE: 0.1457

In [ ]:

In [ ]:

In [ ]:

Splitting the Data

As we have seen, we will tend to overfit the data if we use the entire dataset to determine the model. To account for this, we will split our datasets into a training set to build our model on, and a test set to evaluate the performance of the model. We have a handy sklearn method for doing this, who by default splits the data into 80% for training and 20% for testing.

In [20]:

from sklearn.model_selection import train_test_split

In [21]:

X_train, X_test, y_train, y_test = train_test_split(ames_numeric_scaled, y)

In [22]:

lm.fit(X_train, y_train)

Out[22]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [23]:

pred = lm.predict(X_test)

In [24]:

mse = mean_squared_error(y_test, pred)

In [25]:

rmse = np.sqrt(mse)
rmse

Out[25]:

0.1907947761739675

Regularized Methods Comparison

In [26]:

crime = pd.read_csv('data/crime_data.csv', index_col = 'Unnamed: 0')

In [27]:

crime.head()

Out[27]:

In [28]:

y = crime['ViolentCrimesPerPop']

In [29]:

X = crime.drop('ViolentCrimesPerPop', axis = 1)

In [30]:

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [31]:

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [32]:

lm = LinearRegression()
lm.fit(X_train_scaled, y_train)
predictions = lm.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, predictions))
score = lm.score(X_test_scaled, y_test)
print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))

Out[32]:

The r2 value is : 0.4786 
The RMSE value is 415.6595

Ridge Regression

RSS(w, b) = \sum_{i = 1} ^ N (y_i - (wx_i + b))^2 + \alpha \sum_{j = 1}^p w_j^2

Many feature coefficients will be determined with small values. Larger $\alpha$ means larger penalty, zero is base LinearRegression, and the default for sklearn's implementation is 1.0.

In [33]:

from sklearn.linear_model import Ridge

In [34]:

ridge_reg = Ridge(alpha = 1, solver = "cholesky")

In [35]:

ridge_reg.fit(X_train_scaled, y_train)

Out[35]:

Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='cholesky', tol=0.001)

In [36]:

rpred = ridge_reg.predict(X_test_scaled)

In [37]:

rmse = np.sqrt(mean_squared_error(y_test, rpred))
score = ridge_reg.score(X_test_scaled, y_test)
print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))

Out[37]:

The r2 value is : 0.4757 
The RMSE value is 416.7998

In [38]:

np.sum(ridge_reg.coef_ != 0)

Out[38]:

88

In [39]:

crime.shape

Out[39]:

(1994, 89)

In [40]:

ridge_reg = Ridge(alpha = 20, solver = "cholesky")
ridge_reg.fit(X_train_scaled, y_train)
rpred = ridge_reg.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, rpred))
score = ridge_reg.score(X_test_scaled, y_test)
print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))

Out[40]:

The r2 value is : 0.5295 
The RMSE value is 394.8400

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

Lasso Regression

RSS(w, b) = \sum_{i = 1} ^ N (y_i - (wx_i + b))^2 + \alpha \sum_{j = 1}^p |w_j|

Now, we end up in effect setting variables with low influence to a coefficient of zero. Compared to Ridge, we would use Lasso if there are only a few variables with substantial effects.

In [41]:

from sklearn.linear_model import Lasso

In [42]:

lasso_reg = Lasso(alpha = 2.0)

In [43]:

lasso_reg.fit(X_train_scaled, y_train)

Out[43]:

Lasso(alpha=2.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [44]:

lpred = lasso_reg.predict(X_test_scaled)

In [45]:

rmse = np.sqrt(mean_squared_error(y_test, lpred))
score = ridge_reg.score(X_test_scaled, y_test)
print('The r2 value is : {:.4f}'.format(score), '\nThe RMSE value is {:.4f}'.format(rmse))

Out[45]:

The r2 value is : 0.5295 
The RMSE value is 423.7267

In [46]:

np.sum(lasso_reg.coef_ != 0)

Out[46]:

18

In [47]:

for e in sorted (list(zip(list(X), lasso_reg.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))

Out[47]:

	PctKidsBornNeverMar, 1987.236
	PctKids2Par, -842.249
	PctVacantBoarded, 343.748
	PctHousOccup, -331.982
	PctForeignBorn, 328.357
	MalePctDivorce, 292.700
	NumInShelters, 248.496
	MedOwnCostPctIncNoMtg, -192.645
	PctWorkMom, -141.906
	pctWInvInc, -136.523
	pctUrban, 123.742
	PctEmplManu, -113.120
	MedYrHousBuilt, 82.142
	pctWPubAsst, 71.706
	agePct12t29, -59.647
	RentQrange, 47.841
	PctSameCity85, 45.718
	OwnOccHiQuart, 15.365

In [ ]:

Elastic Net

RSS(w, b) = \sum_{i = 1} ^ N (y_i - (wx_i + b))^2 + r\alpha\sum_{i = 1}^n |w_j| + \frac{1-r}{2} \alpha \sum_{j = 1}^p w_j^2

In [48]:

from sklearn.linear_model import ElasticNet

In [49]:

elastic_reg = ElasticNet(alpha = .05, l1_ratio=0.4)
elastic_reg.fit(X_train_scaled, y_train)
epred = elastic_reg.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, epred))
rmse

Out[49]:

395.9689402572352

In [50]:

ridge_score = ridge_reg.score(X_test_scaled, y_test)
lasso_score = lasso_reg.score(X_test_scaled, y_test)
elastic_score = elastic_reg.score(X_test_scaled, y_test)

In [51]:

print("Ridge: {:.4f}".format(ridge_score), "\nLasso: {:.4f}".format(lasso_score),
      "\nElastic Net: {:.4f}".format(elastic_score))

Out[51]:

Ridge: 0.5295 
Lasso: 0.4582 
Elastic Net: 0.5268

In [ ]:

pd.DataFrame(scaled, columns = cols)

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

PROBLEM

Return to your Ames Data. We have covered a lot of ground today, so let's summarize the things we could do to improve the performance of our original model that compared the Above Ground Living Area to the Logarithm of the Sale Price.

1. Clean data, drop missing values 2. Transform data, code variables using either ordinal values or OneHotEncoder methods 3. Create more features from existing features 4. Split our data into testing and training sets 5. Normalize quantitative features 6. Use Regularized Regression methods and Polynomial regression to improve performance of model

Can you use some or all of these ideas to improve upon your initial model?

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

Additional Resources

The last two lessons have pulled heavily from these resources. I recommend them all strongly as excellent resources:

SciKitLearn documentation on Regression: http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
Aurelien Geron, Hands on Machine Learning with SciKitLearn and TensorFlow
James et. al, An Introduction to Statistical Learning: With Applications in R
Philipp K. Janert, Data Analysis with OpenSource Tools
University of Michigan Coursera Class on Machine Learning with SciKitLearn: https://www.coursera.org/learn/python-machine-learning
Stanford University course on Machine Learning: https://www.coursera.org/learn/machine-learning

In [ ]: