GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_05/code/linear_regression_sklearn_intro - this is in lesson 6.ipynb
¹⁹⁰⁴ views

Kernel: Python 3

Linear Regression

Authors: Kevin Markham (Washington, D.C.), Ed Podojil (New York City)

Learning Objectives

Define data modeling and simple linear regression.
Build a linear regression model using a data set that meets the linearity assumption using the scikit-learn library.
Understand and identify multicollinearity in a multiple regression.

Lesson Guide

Introduce the Bikeshare Data Set

We'll be working with a data set from Capital Bikeshare that was used in a Kaggle competition (data dictionary).

The objective of the competition is to predict total ridership of Capital Bikeshare in any given hour.

Demand forecasting is a common data science application. If we can predict the quantity of demand, total ridership in a given hour, we can create analytical tools to improve the bikeshare system. Some applications would be:

Find where to site new bikeshare stations and know how large of a station to build.
Calculate the expected wear and tear on bikes and what the replacement costs will be.
Use a slightly different research design to forecast full and empty stations and send a service vehicle to "rebalance" the bikes from one station to another, as sometimes bikeshare stations have no bikes or are completely full and prevent use of the station.

Businesses aren't new to demand forecasting, but older methods suffered from poor predictions at atypical small locations. Modern approaches incorporate clusters and online data from Twitter and Google Trends to improve prediction in these small locations.

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14
plt.style.use("fivethirtyeight")

Read In the Capital Bikeshare Data

In [2]:

# Read the data and set the datetime as the index.
url = '../assets/dataset/bikeshare/bikeshare.csv'
bikes = pd.read_csv(url, parse_dates=True)

Notice that we used index_col to set an index or primary key for our data. In this case, the index of each row will be set to the value of its datetime field.

We also ask Pandas to parse dates (if parse_dates=True, for the index only). So, rather than reading in a string, Pandas converts the index string to a datetime object.

In [3]:

# Preview the first five rows of the DataFrame.
bikes.head()

Out[3]:

What does each observation represent?

In [4]:

# A:

What is the response variable (as defined by Kaggle)?

In [5]:

# A:

How many features are there?

In [6]:

# A:

Variable	Description
datetime	hourly date + timestamp
season	1=winter, 2=spring, 3=summer, 4=fall
holiday	whether the day is considered a holiday
workingday	whether the day is neither a weekend nor holiday
weather	See Below
temp	temperature in Celsius
atemp	"feels like" temperature in Celsius
humidity	relative humidity
windspeed	wind speed
casual	number of non-registered user rentals initiated
registered	number of registered user rentals initiated
count	number of total rentals

Details on Weather Variable

1: Clear, Few clouds, Partly cloudy, Partly cloudy

2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

"count" is a method in Pandas (and a very non-specific name), so it's best to name that column something else

In general, you may want to rename columns if it is not obvious what might be stored in them. Although we will only rename the target column here, a few examples might be to rename:

old name	new name
temp	temp_celcius
windspeed	windspeed_knots
casual	num_casual_users
registered	num_registered_users
season	season_num
holiday	is_holiday
workingday	is_workingday
humidity	humidity_percent

Without having to check, these new names make it obvious what is stored in each column. The downside is slightly longer column names, which could affect table readability in Jupyter. It would be ideal to use very specific names in CSV files to assist others reading them. In your own code, use whatever makes sense for your work -- if you are viewing lots of Pandas tables, you may want to use shorter names. However, readable specific names are preferred in Python code since it prevents mistakes.

In [7]:

# Use the .rename() method to rename count to total
bikes.rename(columns={'cnt':'total_rentals'}, inplace=True)

Visualizing the Data

It is important to have a general feeling for what the data looks like before building a model. Ideally, before creating the model you would have some sense of which variables might matter most to predict the response. This dataset is fairly intuitive (and the purpose of this lesson is not visualization), so we will keep the visualization short.

In [8]:

# Pandas scatterplot
bikes.plot(kind='scatter', x='temp', y='total_rentals', alpha=0.2);

Out[8]:

In [9]:

# Seaborn scatterplot with regression line
sns.lmplot(x='temp', y='total_rentals', data=bikes, aspect=1.5, scatter_kws={'alpha':0.2});

Out[9]:

Linear Regression Basics

Form of Linear Regression

Recall that each model always contains some amount of random irreducible error $\epsilon$ . So, given a prediction $\hat{y}$ , the actual $y = \hat{y} + \epsilon$ . Below, we will assume $y$ is exactly linear.

We are often taught the formula for a line is: $y = mx + b$ .
Note this can alternatively be written: $y = \alpha + \beta X$ .

Here, we will generalize this to $n$ independent variables as follows:

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon$

$y$ is the response.
$\beta_0$ is the intercept.
$\beta_1$ is the coefficient for $x_1$ (the first feature).
$\beta_n$ is the coefficient for $x_n$ (the nth feature).
$\epsilon$ is the error term

A practical example of this applied to our data might be:

$total\_rides = 20 + -2 \cdot temp + -3 \cdot windspeed\ +\ ...\ +\ 0.1 \cdot registered$

This equation is still called linear because the highest degree of the independent variables (e.g. $x_i$ ) is 1. Note that because the $\beta$ values are constants, they will not be independent variables in the final model, as seen above.

The $\beta$ values are called the model coefficients:

These values are estimated (or "learned") during the model fitting process using the least squares criterion.
Specifically, we are trying to find the line (mathematically) that minimizes the sum of squared residuals (or "sum of squared errors").
Once we've learned these coefficients, we can use the model to predict the response.

In the diagram above:

The black dots are the observed values of x and y.
The blue line is our least squares line.
The red lines are the residuals, which are the vertical distances between the observed values and the least squares line.

Overview of Supervised Learning

Benefits and Drawbacks of scikit-learn

Benefits:

Consistent interface to machine learning models.
Provides many tuning parameters but with sensible defaults.
Exceptional documentation.
Rich set of functionality for companion tasks.
Active community for development and support.

Potential drawbacks:

Harder (than R) to get started with machine learning.
Less emphasis (than R) on model interpretability.
- scikit-learn tends not to run detailed statistical tests, e.g. ANOVA.
- For more detail on model fit, try the statsmodels library.

Ben Lorica: Six Reasons Why I Recommend scikit-learn

Requirements for Working With Data in scikit-learn

Features and response should be separate objects.
Features and response should be entirely numeric.
Features and response should be NumPy arrays (or easily converted to NumPy arrays).
Features and response should have specific shapes (outlined below).

Building a Linear Regression Model in sklearn

Create a feature matrix called X that holds a `DataFrame` with only the temp variable and a `Series` called y that has the "total_rentals" column.

In [10]:

# Create X and y.
feature_cols = ['temp']
X = bikes[feature_cols]
y = bikes.total_rentals

In [11]:

# Check X's type.
print((type(X)))
print((type(X.values)))

Out[11]:

<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>

In [12]:

# Check y's type.
print((type(y)))
print((type(y.values)))

Out[12]:

<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>

In [13]:

# Check X's shape (n = number of observations, p = number of features).
print((X.shape))

Out[13]:

(17379, 1)

In [14]:

# Check y's shape (single dimension with length n).
# The comma indicates the datatype is a tuple.
print((y.shape))

Out[14]:

(17379,)

scikit-learn's Four-Step Modeling Pattern

Step 1: Import the class you plan to use.

In [15]:

from sklearn.linear_model import LinearRegression

Step 2: "Instantiate" the "estimator."

"Estimator" is scikit-learn's term for "model."
"Instantiate" means "make an instance of."

In [16]:

# Make an instance of a LinearRegression object.
lr = LinearRegression()
type(lr)

Out[16]:

sklearn.linear_model.base.LinearRegression

Created an object that "knows" how to do linear regression, and is just waiting for data.
Name of the object does not matter.
All parameters not specified are set to their defaults.
Can specify tuning parameters (aka "hyperparameters") during this step.

To view the possible parameters, either use the help built-in function or evaluate the newly instantiated model, as follows:

In [17]:

# help(lr)
lr

Out[17]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Step 3: Fit the model with data (aka "model training").

Model is "learning" the relationship between X and y in our "training data."
Process through which learning occurs varies by model.
Occurs in-place.

In [18]:

lr.fit(X, y)

Out[18]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Once a model has been fit with data, it's called a "fitted model."

Step 4: Predict the response for a new observation.

New observations are called "out-of-sample" data.
Uses the information it learned during the model training process.

In [19]:

# Per future warning, one-dimensional arrays must be reshaped using the following.
lr.predict(np.array([0]).reshape(1,-1))

Out[19]:

array([-0.03559611])

Let's ask the model to make two predictions, one when the temp is 0 and another when the temp is 10. To do this, our feature matrix is always a 2-D array where each row is a list of features. Since we only have a single feature, the temperature, each row will contain only a single value.

In [20]:

X_new = [[0], [10]]
lr.predict(X_new)

Out[20]:

array([-3.55961126e-02,  3.81291363e+03])

Returns a NumPy array, and we keep track of what the numbers "mean."
Can predict for multiple observations at once.

What we just predicted using our model is, "If the temperature is 0 degrees, the total number of bike rentals will be ~6.046, and if the temperature is 10 degrees the total number of bike rentals will ~97.751."

Build a Linear Regression Model

Let's specifically make a linear regression model and look at the intercept and coefficients.

Instantiate and fit a `LinearRegression` model on X and y from the `linear_model` section of scikit-learn.

In [21]:

# Import, instantiate, fit.
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)

Out[21]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [22]:

# Print the coefficients.
print(linreg.intercept_)
print(linreg.coef_)

Out[22]:

-0.03559611264211071
[381.29492226]

Interpreting the intercept ( $\beta_0$ ):

It is the value of $y$ when all independent variables are 0.
Here, it is the estimated number of rentals when the temperature is 0 degrees Celsius.
Note: It does not always make sense to interpret the intercept. (Why?)

Interpreting the "temp" coefficient ( $\beta_1$ ):

Interpretation: An increase of 1 degree Celcius is associated with increasing the number of total rentals by $\beta_1$ .
Here, a temperature increase of 1 degree Celsius is associated with a rental increase of 9.17 bikes.
This is not a statement of causation.
$\beta_1$ would be negative if an increase in temperature was associated with a decrease in total rentals.
$\beta_1$ would be zero if temperature is not associated with total rentals.

Using the Model for Prediction

While plenty of insight can be found in reading coefficients, the most common uses of data science focus on prediction. In scikit-learn we can make predictions from a fitted model using .predict(), but we will also go through the calculation by hand to understand it.

How many bike rentals would we predict if the temperature was 25 degrees Celsius?

Explore the intercept and coefficients of the linear model.

You can search for "sklearn linear regression" and explore the attributes section of the documentation to learn how to do this.

In [23]:

# Manually calculate the prediction.

In [24]:

# Use the predict method.

Does the Scale of the Features Matter?

Let's say that temperature was measured in Fahrenheit, rather than Celsius. How would that affect the model?

In [25]:

# Create a new column for Fahrenheit temperature.
bikes['temp_F'] = bikes.temp * 1.8 + 32
bikes.head()

Out[25]:

In [26]:

# Seaborn scatterplot with regression line
sns.lmplot(x='temp_F', y='total_rentals', data=bikes, aspect=1.5, scatter_kws={'alpha':0.2});

Out[26]:

Rebuild the `LinearRegression` from above using the `temp_F` features instead.

In [27]:

# Create X and y.
feature_cols = ['temp_F']
X = bikes[feature_cols]
y = bikes.total_rentals

# Instantiate and fit.
linreg = LinearRegression()
linreg.fit(X, y)

# Print the coefficients.
print(linreg.intercept_)
print(linreg.coef_)

Out[27]:

-6778.611991831338
[211.83051237]

Convert 25 degrees Celsius to Fahrenheit.

In [28]:

25 * 1.8 + 32

Out[28]:

77.0

Predict rentals for 77 degrees Fahrenheit.

In [29]:

linreg.predict(77)

Out[29]:

array([9532.33746037])

Conclusion: The scale of the features is irrelevant for linear regression models. When changing the scale, we simply change our interpretation of the coefficients.

In [30]:

# Remove the temp_F column.
bikes.drop('temp_F', axis=1, inplace=True)

Work With Multiple Features

We've demonstrated simple linear regression with one feature to gain an intuition, but the benefit of modeling is the ability to reason about hundreds of features at once. There is no limit to the number of features you can use. However, often a small set of features accounts for most of the variance (assuming there is a linear relationship at all). We will start by using four features.

Visualizing the Data (Part 2)

Explore more features.

In [31]:

# Create feature column variables
feature_cols = ['temp', 'season', 'weather', 'humidity']

Create a subset of scatterplot matrix using Seaborn.

We can use pairplot with the y_vars argument to only show relationships with the total_rentals variable

In [32]:

# multiple scatterplots in Seaborn
sns.pairplot(bikes, x_vars=feature_cols, y_vars='total_rentals', kind='reg');

Out[32]:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'weather'

During handling of the above exception, another exception occurred:
KeyError                                  Traceback (most recent call last)
<ipython-input-32-6103d56d2b6a> in <module>()
      1 # multiple scatterplots in Seaborn
----> 2 sns.pairplot(bikes, x_vars=feature_cols, y_vars='total_rentals', kind='reg');

~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\axisgrid.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, size, aspect, dropna, plot_kws, diag_kws, grid_kws)
   2074     elif kind == "reg":
   2075         from .regression import regplot  # Avoid circular import
-> 2076         plotter(regplot, **plot_kws)
   2077 
   2078     # Add a legend
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\axisgrid.py in map(self, func, **kwargs)
   1301 
   1302                     color = self.palette[k] if kw_color is None else kw_color
-> 1303                     func(data_k[x_var], data_k[y_var],
   1304                          label=label_k, color=color, **kwargs)
   1305 
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'weather'

Recreate the same functionality using Pandas.

In [ ]:

# Multiple scatterplots in Pandas
fig, axs = plt.subplots(1, len(feature_cols), sharey=True)
for index, feature in enumerate(feature_cols):
    bikes.plot(kind='scatter', x=feature, y='total_rentals', ax=axs[index], figsize=(16, 3))

In [ ]:

# alternative way in Pandas (might take a while)
# scatter_matrix does a pairplot of *every* column

grr = pd.tools.plotting.scatter_matrix(bikes[['total_rentals'] + feature_cols], figsize=(15, 15), alpha=0.7)

Are you seeing anything you didn't expect?

Explore the season variable using a cross-tab.

In [ ]:

# Cross-tabulation of season and month
pd.crosstab(bikes.season, bikes.index.month)

Explore the season variable using a box plot.

In [ ]:

# Box plot of rentals, grouped by season
bikes.boxplot(column='total_rentals', by='season');

Notably:

A line can't capture a nonlinear relationship.

Look at rentals over time.

In [ ]:

# Line plot of rentals
bikes.total_rentals.plot();

What does this tell us?

There are more rentals in the winter than the spring, but only because the system is experiencing overall growth and the winter months happen to come after the spring months.

Look at the correlation matrix for the bikes `DataFrame`.

In [ ]:

# Correlation matrix (ranges from 1 to -1)
bikes.corr()

Use a heat map to make it easier to read the correlation matrix.

In [ ]:

# Visualize correlation matrix in Seaborn using a heat map.
sns.heatmap(bikes.corr())

What relationships do you notice?

In [ ]:

# A:

Adding More Features to the Model

In the previous example, one variable explained the variance of another; however, more often than not, we will need multiple variables.

For example, a house's price may be best measured by square feet, but a lot of other variables play a vital role: bedrooms, bathrooms, location, appliances, etc.
For a linear regression, we want these variables to be largely independent of one another, but all of them should help explain the y variable.

We'll work with bikeshare data to showcase what this means and to explain a concept called multicollinearity.

Create another `LinearRegression` instance that is fit using temp, season, weather, and humidity.

In [ ]:

# Create a list of features.
feature_cols = ['temp', 'season', 'weather', 'humidity']

In [ ]:

# Create X and y.
X = bikes[feature_cols]
y = bikes.total_rentals

# Instantiate and fit.
linreg = LinearRegression()
linreg.fit(X, y)

# Print the coefficients.
print(linreg.intercept_)
print(linreg.coef_)

Display the linear regression coefficient along with the feature names.

In [ ]:

# Pair the feature names with the coefficients.
list(zip(feature_cols, linreg.coef_))

Interpreting the coefficients:

Holding all other features fixed, a 1-unit increase in temperature is associated with a rental increase of 7.86 bikes.
Holding all other features fixed, a 1-unit increase in season is associated with a rental increase of 22.5 bikes.
Holding all other features fixed, a 1-unit increase in weather is associated with a rental increase of 6.67 bikes.
Holding all other features fixed, a 1-unit increase in humidity is associated with a rental decrease of 3.12 bikes.

Does anything look incorrect and does not reflect reality?

What Is Multicollinearity?

Multicollinearity happens when two or more features are highly correlated with each other. The problem is that due to the high correlation, it's hard to disambiguate which feature has what kind of effect on the outcome. In other words, the features mask each other.

There is a second related issue called variance inflation where including correlated features increases the variability of our model and p-values by widening the standard errors. This can be measured with the variance inflation factor, which we will not cover here.

With the bikeshare data, let's compare three data points: actual temperature, "feel" temperature, and guest ridership.

In [ ]:

cmap = sns.diverging_palette(220, 10, as_cmap=True)
correlations = bikes[['temp', 'atemp', 'casual']].corr()
print(correlations)
print(sns.heatmap(correlations, cmap=cmap))

Create a linear model that predicts `total_rentals` using `temp` and `atemp`.

In [ ]:

# Create a list of features.
feature_cols = ['temp', 'atemp']

In [ ]:

# Create X and y.
X = bikes[feature_cols]
y = bikes.total_rentals

# Instantiate and fit.
linreg = LinearRegression()
linreg.fit(X, y)

# Print the coefficients.
print(linreg.intercept_)
print(linreg.coef_)

Go back and remove either `temp` or `atemp` from the feature list. How do the coefficients change?

In [ ]:

# A:

How to Select a Model

We can make linear models now, but how do we select the best model to use for our applications? We will offer a general procedure and a simple metric that works well in many cases. That said, it's important to keep the business context in mind and know that there are alternative metrics that can work better.

Feature Selection

How do we choose which features to include in the model? We're going to use train/test split (and eventually cross-validation).

Why not use p-values or R-squared for feature selection?

Linear models rely upon a lot of assumptions (such as the features being independent), and if those assumptions are violated, p-values and R-squared are less reliable. Train/test split relies on fewer assumptions.
If all of the assumptions of a linear model are met, p-values suggest a coefficient that differs from zero at a level of statistical significance. This does not mean that
1. the feature causes the response
2. the feature strongly predicts the response.
Adding features to your model that are unrelated to the response will always increase the R-squared value, and adjusted R-squared does not sufficiently account for this (although, AIC and BIC do).
p-values and R-squared are proxies for our goal of generalization, whereas train/test split and cross-validation attempt to directly estimate how well the model will generalize to out-of-sample data.

More generally:

There are different methodologies that can be used for solving any given data science problem, and this course follows a machine learning methodology.
This course focuses on general purpose approaches that can be applied to any model, rather than model-specific approaches.

Evaluation Metrics for Regression Problems

Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. We need evaluation metrics designed for comparing continuous values.

Here are three common evaluation metrics for regression problems:

Mean absolute error (MAE) is the mean of the absolute value of the errors:

\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|

Mean squared error (MSE) is the mean of the squared errors:

\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2

Root mean squared error (RMSE) is the square root of the mean of the squared errors:

\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}

In [ ]:

# Example true and predicted response values
true = [10, 7, 5, 5]
pred = [8, 6, 5, 10]

Calculate MAE, MSE, and RMSE using imports from sklearn metrics and NumPy.

In [ ]:

# Calculate these metrics by hand!
from sklearn import metrics
import numpy as np
print('MAE:', metrics.mean_absolute_error(true, pred))
print('MSE:', metrics.mean_squared_error(true, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(true, pred)))

Let's compare these metrics:

MAE is the easiest to understand, because it's the average error.
MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world. Also, MSE is continuous and differentiable, making it easier to use than MAE for optimization.
RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

Here's an additional example, to demonstrate how MSE/RMSE punishes larger errors:

In [ ]:

# Same true values as above
true = [10, 7, 5, 5]

# New set of predicted values
pred = [10, 7, 5, 13]

# MAE is the same as before.
print('MAE:', metrics.mean_absolute_error(true, pred))

# MSE and RMSE are larger than before.
print('MSE:', metrics.mean_squared_error(true, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(true, pred)))

Comparing Models With Train/Test Split and RMSE

In [ ]:

from sklearn.model_selection import train_test_split

# Define a function that accepts a list of features and returns testing RMSE.
def train_test_rmse(df, feature_cols):
    X = df[feature_cols]
    y = df.total_rentals
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
    
    linreg = LinearRegression()
    linreg.fit(X_train, y_train)
    
    y_pred = linreg.predict(X_test)
    return np.sqrt(metrics.mean_squared_error(y_test, y_pred))

In [ ]:

# Compare different sets of features.
print(train_test_rmse(bikes, ['temp', 'season', 'weather', 'humidity']))
print(train_test_rmse(bikes, ['temp', 'season', 'weather']))
print(train_test_rmse(bikes, ['temp', 'season', 'humidity']))

In [ ]:

# Using these as features is not allowed!
print(train_test_rmse(bikes, ['casual', 'registered']))

Comparing Testing RMSE With Null RMSE

Null RMSE is the RMSE that could be achieved by always predicting the mean response value. It is a benchmark against which you may want to measure your regression model.

In [ ]:

# Split X and y into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)

# Create a NumPy array with the same shape as y_test.
y_null = np.zeros_like(y_test, dtype=float)

# Fill the array with the mean value of y_test.
y_null.fill(y_test.mean())
y_null

In [ ]:

# Compute null RMSE.
np.sqrt(metrics.mean_squared_error(y_test, y_null))

Feature Engineering to Improve Performance

Machine learning models are very powerful, but they cannot automatically handle every aspect of our data. We have to explicitly modify our features to have relationships that our models can understand. In this case, we will need to pull out features to have a linear relationship with our response variable.

Handling Categorical Features

scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?

Ordered categories: Transform them to sensible numeric values (example: small=1, medium=2, large=3)
Unordered categories: Use dummy encoding (0/1). Here, each possible category would become a separate feature.

What are the categorical features in our data set?

Ordered categories: weather (already encoded with sensible numeric values)
Unordered categories: season (needs dummy encoding), holiday (already dummy encoded), workingday (already dummy encoded)

For season, we can't simply leave the encoding as 1 = spring, 2 = summer, 3 = fall, and 4 = winter, because that would imply an ordered relationship. Instead, we create multiple dummy variables.

Create dummy variables using `get_dummies` from Pandas.

In [ ]:

season_dummies = pd.get_dummies(bikes.season, prefix='season')

Inspect the `DataFrame` of `dummies`.

In [ ]:

# Print five random rows.
season_dummies.sample(n=5, random_state=1)

However, we actually only need three dummy variables (not four), and thus we'll drop the first dummy variable.

Why? Because three dummies captures all of the "information" about the season feature, and implicitly defines spring (season 1) as the baseline level.

This circles back to the concept multicollinearity, except instead of one feature being highly correlated to another, the information gained from three features is directly correlated to the fourth.

Drop the first column.

In [ ]:

season_dummies.drop(season_dummies.columns[0], axis=1, inplace=True)

Reinspect the `DataFrame` of `dummies`.

In [ ]:

# Print five random rows.
season_dummies.sample(n=5, random_state=1)

In general, if you have a categorical feature with k possible values, you create k-1 dummy variables.

If that's confusing, think about why we only need one dummy variable for holiday, not two dummy variables (holiday_yes and holiday_no).

We now need to concatenate the two `DataFrames` together.

In [ ]:

# Concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns).
bikes_dummies = pd.concat([bikes, season_dummies], axis=1)

# Print 5 random rows.
bikes_dummies.sample(n=5, random_state=1)

Rerun the linear regression with dummy variables included.

In [ ]:

# Include dummy variables for season in the model.
feature_cols = ['temp', 'season_2', 'season_3', 'season_4', 'humidity']
X = bikes_dummies[feature_cols]
y = bikes_dummies.total_rentals

linreg = LinearRegression()
linreg.fit(X, y)

list(zip(feature_cols, linreg.coef_))

How do we interpret the season coefficients? They are measured against the baseline (spring):

Holding all other features fixed, summer is associated with a rental decrease of 3.39 bikes compared to the spring.
Holding all other features fixed, fall is associated with a rental decrease of 41.7 bikes compared to the spring.
Holding all other features fixed, winter is associated with a rental increase of 64.4 bikes compared to the spring.

Would it matter if we changed which season was defined as the baseline?

No, it would simply change our interpretation of the coefficients.

In most situations, it is best to have the dummy that is your baseline be the category that has the largest representation.

Important: Dummy encoding is relevant for all machine learning models, not just linear regression models.

In [ ]:

# Compare original season variable with dummy variables.
print(train_test_rmse(bikes_dummies, ['temp', 'season', 'humidity']))
print(train_test_rmse(bikes_dummies, ['temp', 'season_2', 'season_3', 'season_4', 'humidity']))

Feature Engineering

See if you can create the following features:

hour: as a single numeric feature (0 through 23)
hour: as a categorical feature (use 23 dummy variables)
daytime: as a single categorical feature (daytime=1 from 7 a.m. to 8 p.m., and daytime=0 otherwise)

Then, try using each of the three features (on its own) with train_test_rmse to see which one performs the best!

Extract hour of the day to use as a feature.

In [ ]:

Encode `hour` as a categorical feature.

In [ ]:

Generate a `daytime` variable based on hour of the day.

In [ ]:

Test the root mean squared error of our various `hour` encodings.

In [ ]:

Bonus Material: Regularization

Regularization is a method for "constraining" or "regularizing" the size of the coefficients, thus "shrinking" them toward zero.
It reduces model variance and thus minimizes overfitting.
If the model is too complex, it tends to reduce variance more than it increases bias, resulting in a model that is more likely to generalize.

Our goal is to locate the optimum model complexity, and thus regularization is useful when we believe our model is too complex.

How Does Regularization Work?

For a normal linear regression model, we estimate the coefficients using the least squares criterion, which minimizes the residual sum of squares (RSS).

For a regularized linear regression model, we minimize the sum of RSS and a "penalty term" that penalizes coefficient size.

Ridge regression (or "L2 regularization") minimizes: $\text{RSS} + \alpha \sum_{j=1}^p \beta_j^2$

Lasso regression (or "L1 regularization") minimizes: $\text{RSS} + \alpha \sum_{j=1}^p |\beta_j|$

$p$ is the number of features.
$\beta_j$ is a model coefficient.
$\alpha$ is a tuning parameter:
- A tiny $\alpha$ imposes no penalty on the coefficient size, and is equivalent to a normal linear regression model.
- Increasing the $\alpha$ penalizes the coefficients and thus shrinks them.

Lasso and Ridge Path Diagrams

A larger alpha (toward the left of each diagram) results in more regularization:

Lasso regression shrinks coefficients all the way to zero, thus removing them from the model.
Ridge regression shrinks coefficients toward zero, but they rarely reach zero.

Source code for the diagrams: Lasso regression and Ridge regression

Advice for Applying Regularization

Should features be standardized?

Yes, because otherwise, features would be penalized simply because of their scale.
Also, standardizing avoids penalizing the intercept, which wouldn't make intuitive sense.

How should you choose between lasso regression and ridge regression?

Lasso regression is preferred if we believe many features are irrelevant or if we prefer a sparse model.
Ridge can work particularly well if there is a high degree of multicollinearity in your model.
If model performance is your primary concern, it is best to try both.
Elastic net regression is a combination of lasso regression and ridge Regression.

Ridge Regression

Ridge documentation
alpha: must be positive, increase for more regularization
normalize: scales the features (without using StandardScaler)

In [ ]:

# Include dummy variables for season in the model.
feature_cols = ['temp', 'atemp', 'season_2', 'season_3', 'season_4', 'humidity']

X = bikes_dummies[feature_cols]
y = bikes_dummies.total_rentals

In [ ]:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [ ]:

# alpha=0 is equivalent to linear regression.
from sklearn.linear_model import Ridge

# Instantiate the model.
#(Alpha of zero has no regularization strength, essentially a basic linear regression.)
ridgereg = Ridge(alpha=0, normalize=True)

# Fit the model.
ridgereg.fit(X_train, y_train)

# Predict with fitted model.
y_pred = ridgereg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [ ]:

# Coefficients for a non-regularized linear regression
list(zip(feature_cols, ridgereg.coef_))

To interpret these coefficients we need to convert them back to original units, which is a reason to do normalization by hand. However, in this form the coefficients have a special meaning. The intercept is now the average of our outcome, and the magnitude of each coefficient in the model is a measure of how important it is in the model. We call this feature importance.

In [ ]:

# Try alpha=0.1.
ridgereg = Ridge(alpha=0.1, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [ ]:

# Examine the coefficients.
list(zip(feature_cols, ridgereg.coef_))

While the MSE barely improved, we can see there are significant changes in the weight of our coefficients. Particularly season_2 whose coefficient has greatly decreased toward 0.

Fitting and using a Lasso Regression in scikit-learn is very similar.

In addition to the typical lasso and ridge there is a third type of regression, Elastic Net which combines the penalties of the ridge and lasso methods.

Comparing Linear Regression With Other Models

Advantages of linear regression:

Simple to explain.
Highly interpretable.
Model training and prediction are fast.
No tuning is required (excluding regularization).
Features don't need scaling.
Can perform well with a small number of observations.
Well understood.

Disadvantages of linear regression:

Presumes a linear relationship between the features and the response.
Performance is (generally) not competitive with the best supervised learning methods due to high bias.
Can't automatically learn feature interactions.

In [ ]: