Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_05/code/linear_regression_sklearn_intro - this is in lesson 6.ipynb
1904 views
Kernel: Python 3

Linear Regression

Authors: Kevin Markham (Washington, D.C.), Ed Podojil (New York City)

Learning Objectives

  • Define data modeling and simple linear regression.

  • Build a linear regression model using a data set that meets the linearity assumption using the scikit-learn library.

  • Understand and identify multicollinearity in a multiple regression.

Lesson Guide

Introduce the Bikeshare Data Set


We'll be working with a data set from Capital Bikeshare that was used in a Kaggle competition (data dictionary).

The objective of the competition is to predict total ridership of Capital Bikeshare in any given hour.

Demand forecasting is a common data science application. If we can predict the quantity of demand, total ridership in a given hour, we can create analytical tools to improve the bikeshare system. Some applications would be:

  • Find where to site new bikeshare stations and know how large of a station to build.

  • Calculate the expected wear and tear on bikes and what the replacement costs will be.

  • Use a slightly different research design to forecast full and empty stations and send a service vehicle to "rebalance" the bikes from one station to another, as sometimes bikeshare stations have no bikes or are completely full and prevent use of the station.

Businesses aren't new to demand forecasting, but older methods suffered from poor predictions at atypical small locations. Modern approaches incorporate clusters and online data from Twitter and Google Trends to improve prediction in these small locations.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline plt.rcParams['figure.figsize'] = (8, 6) plt.rcParams['font.size'] = 14 plt.style.use("fivethirtyeight")

Read In the Capital Bikeshare Data

# Read the data and set the datetime as the index. url = '../assets/dataset/bikeshare/bikeshare.csv' bikes = pd.read_csv(url, parse_dates=True)

Notice that we used index_col to set an index or primary key for our data. In this case, the index of each row will be set to the value of its datetime field.

We also ask Pandas to parse dates (if parse_dates=True, for the index only). So, rather than reading in a string, Pandas converts the index string to a datetime object.

# Preview the first five rows of the DataFrame. bikes.head()

What does each observation represent?

# A:

What is the response variable (as defined by Kaggle)?

# A:

How many features are there?

# A:
VariableDescription
datetimehourly date + timestamp
season1=winter, 2=spring, 3=summer, 4=fall
holidaywhether the day is considered a holiday
workingdaywhether the day is neither a weekend nor holiday
weatherSee Below
temptemperature in Celsius
atemp"feels like" temperature in Celsius
humidityrelative humidity
windspeedwind speed
casualnumber of non-registered user rentals initiated
registerednumber of registered user rentals initiated
countnumber of total rentals

Details on Weather Variable

1: Clear, Few clouds, Partly cloudy, Partly cloudy

2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

"count" is a method in Pandas (and a very non-specific name), so it's best to name that column something else

In general, you may want to rename columns if it is not obvious what might be stored in them. Although we will only rename the target column here, a few examples might be to rename:

old namenew name
temptemp_celcius
windspeedwindspeed_knots
casualnum_casual_users
registerednum_registered_users
seasonseason_num
holidayis_holiday
workingdayis_workingday
humidityhumidity_percent

Without having to check, these new names make it obvious what is stored in each column. The downside is slightly longer column names, which could affect table readability in Jupyter. It would be ideal to use very specific names in CSV files to assist others reading them. In your own code, use whatever makes sense for your work -- if you are viewing lots of Pandas tables, you may want to use shorter names. However, readable specific names are preferred in Python code since it prevents mistakes.

# Use the .rename() method to rename count to total bikes.rename(columns={'cnt':'total_rentals'}, inplace=True)

Visualizing the Data

It is important to have a general feeling for what the data looks like before building a model. Ideally, before creating the model you would have some sense of which variables might matter most to predict the response. This dataset is fairly intuitive (and the purpose of this lesson is not visualization), so we will keep the visualization short.

# Pandas scatterplot bikes.plot(kind='scatter', x='temp', y='total_rentals', alpha=0.2);
Image in a Jupyter notebook
# Seaborn scatterplot with regression line sns.lmplot(x='temp', y='total_rentals', data=bikes, aspect=1.5, scatter_kws={'alpha':0.2});
Image in a Jupyter notebook

Linear Regression Basics


Form of Linear Regression

Recall that each model always contains some amount of random irreducible error ϵ\epsilon. So, given a prediction y^\hat{y}, the actual y=y^+ϵy = \hat{y} + \epsilon. Below, we will assume yy is exactly linear.

  • We are often taught the formula for a line is: y=mx+by = mx + b.

  • Note this can alternatively be written: y=α+βXy = \alpha + \beta X.


Here, we will generalize this to nn independent variables as follows:

y=β0+β1x1+β2x2+...+βnxn+ϵy = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon

  • yy is the response.

  • β0\beta_0 is the intercept.

  • β1\beta_1 is the coefficient for x1x_1 (the first feature).

  • βn\beta_n is the coefficient for xnx_n (the nth feature).

  • ϵ\epsilon is the error term

A practical example of this applied to our data might be:

total_rides=20+−2⋅temp+−3⋅windspeed + ... + 0.1⋅registeredtotal\_rides = 20 + -2 \cdot temp + -3 \cdot windspeed\ +\ ...\ +\ 0.1 \cdot registered

This equation is still called linear because the highest degree of the independent variables (e.g. xix_i) is 1. Note that because the β\beta values are constants, they will not be independent variables in the final model, as seen above.


The β\beta values are called the model coefficients:

  • These values are estimated (or "learned") during the model fitting process using the least squares criterion.

  • Specifically, we are trying to find the line (mathematically) that minimizes the sum of squared residuals (or "sum of squared errors").

  • Once we've learned these coefficients, we can use the model to predict the response.

Estimating coefficients

In the diagram above:

  • The black dots are the observed values of x and y.

  • The blue line is our least squares line.

  • The red lines are the residuals, which are the vertical distances between the observed values and the least squares line.

Overview of Supervised Learning


Supervised learning diagram

Benefits and Drawbacks of scikit-learn

Benefits:

  • Consistent interface to machine learning models.

  • Provides many tuning parameters but with sensible defaults.

  • Exceptional documentation.

  • Rich set of functionality for companion tasks.

  • Active community for development and support.

Potential drawbacks:

  • Harder (than R) to get started with machine learning.

  • Less emphasis (than R) on model interpretability.

    • scikit-learn tends not to run detailed statistical tests, e.g. ANOVA.

    • For more detail on model fit, try the statsmodels library.

Ben Lorica: Six Reasons Why I Recommend scikit-learn

Requirements for Working With Data in scikit-learn

  1. Features and response should be separate objects.

  2. Features and response should be entirely numeric.

  3. Features and response should be NumPy arrays (or easily converted to NumPy arrays).

  4. Features and response should have specific shapes (outlined below).

Building a Linear Regression Model in sklearn

Create a feature matrix called X that holds a DataFrame with only the temp variable and a Series called y that has the "total_rentals" column.

# Create X and y. feature_cols = ['temp'] X = bikes[feature_cols] y = bikes.total_rentals
# Check X's type. print((type(X))) print((type(X.values)))
<class 'pandas.core.frame.DataFrame'> <class 'numpy.ndarray'>
# Check y's type. print((type(y))) print((type(y.values)))
<class 'pandas.core.series.Series'> <class 'numpy.ndarray'>
# Check X's shape (n = number of observations, p = number of features). print((X.shape))
(17379, 1)
# Check y's shape (single dimension with length n). # The comma indicates the datatype is a tuple. print((y.shape))
(17379,)

scikit-learn's Four-Step Modeling Pattern

Step 1: Import the class you plan to use.

from sklearn.linear_model import LinearRegression

Step 2: "Instantiate" the "estimator."

  • "Estimator" is scikit-learn's term for "model."

  • "Instantiate" means "make an instance of."

# Make an instance of a LinearRegression object. lr = LinearRegression() type(lr)
sklearn.linear_model.base.LinearRegression
  • Created an object that "knows" how to do linear regression, and is just waiting for data.

  • Name of the object does not matter.

  • All parameters not specified are set to their defaults.

  • Can specify tuning parameters (aka "hyperparameters") during this step.

To view the possible parameters, either use the help built-in function or evaluate the newly instantiated model, as follows:

# help(lr) lr
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Step 3: Fit the model with data (aka "model training").

  • Model is "learning" the relationship between X and y in our "training data."

  • Process through which learning occurs varies by model.

  • Occurs in-place.

lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
  • Once a model has been fit with data, it's called a "fitted model."

Step 4: Predict the response for a new observation.

  • New observations are called "out-of-sample" data.

  • Uses the information it learned during the model training process.

# Per future warning, one-dimensional arrays must be reshaped using the following. lr.predict(np.array([0]).reshape(1,-1))
array([-0.03559611])

Let's ask the model to make two predictions, one when the temp is 0 and another when the temp is 10. To do this, our feature matrix is always a 2-D array where each row is a list of features. Since we only have a single feature, the temperature, each row will contain only a single value.

X_new = [[0], [10]] lr.predict(X_new)
array([-3.55961126e-02, 3.81291363e+03])
  • Returns a NumPy array, and we keep track of what the numbers "mean."

  • Can predict for multiple observations at once.

What we just predicted using our model is, "If the temperature is 0 degrees, the total number of bike rentals will be ~6.046, and if the temperature is 10 degrees the total number of bike rentals will ~97.751."

Build a Linear Regression Model


Let's specifically make a linear regression model and look at the intercept and coefficients.

Instantiate and fit a LinearRegression model on X and y from the linear_model section of scikit-learn.

# Import, instantiate, fit. from sklearn.linear_model import LinearRegression linreg = LinearRegression() linreg.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
# Print the coefficients. print(linreg.intercept_) print(linreg.coef_)
-0.03559611264211071 [381.29492226]

Interpreting the intercept (β0\beta_0):

  • It is the value of yy when all independent variables are 0.

  • Here, it is the estimated number of rentals when the temperature is 0 degrees Celsius.

  • Note: It does not always make sense to interpret the intercept. (Why?)

Interpreting the "temp" coefficient (β1\beta_1):

  • Interpretation: An increase of 1 degree Celcius is associated with increasing the number of total rentals by β1\beta_1.

  • Here, a temperature increase of 1 degree Celsius is associated with a rental increase of 9.17 bikes.

  • This is not a statement of causation.

  • β1\beta_1 would be negative if an increase in temperature was associated with a decrease in total rentals.

  • β1\beta_1 would be zero if temperature is not associated with total rentals.

Using the Model for Prediction


While plenty of insight can be found in reading coefficients, the most common uses of data science focus on prediction. In scikit-learn we can make predictions from a fitted model using .predict(), but we will also go through the calculation by hand to understand it.

How many bike rentals would we predict if the temperature was 25 degrees Celsius?

Explore the intercept and coefficients of the linear model.

You can search for "sklearn linear regression" and explore the attributes section of the documentation to learn how to do this.

# Manually calculate the prediction.
# Use the predict method.

Does the Scale of the Features Matter?

Let's say that temperature was measured in Fahrenheit, rather than Celsius. How would that affect the model?

# Create a new column for Fahrenheit temperature. bikes['temp_F'] = bikes.temp * 1.8 + 32 bikes.head()
# Seaborn scatterplot with regression line sns.lmplot(x='temp_F', y='total_rentals', data=bikes, aspect=1.5, scatter_kws={'alpha':0.2});
Image in a Jupyter notebook

Rebuild the LinearRegression from above using the temp_F features instead.

# Create X and y. feature_cols = ['temp_F'] X = bikes[feature_cols] y = bikes.total_rentals # Instantiate and fit. linreg = LinearRegression() linreg.fit(X, y) # Print the coefficients. print(linreg.intercept_) print(linreg.coef_)
-6778.611991831338 [211.83051237]

Convert 25 degrees Celsius to Fahrenheit.

25 * 1.8 + 32
77.0

Predict rentals for 77 degrees Fahrenheit.

linreg.predict(77)
array([9532.33746037])

Conclusion: The scale of the features is irrelevant for linear regression models. When changing the scale, we simply change our interpretation of the coefficients.

# Remove the temp_F column. bikes.drop('temp_F', axis=1, inplace=True)

Work With Multiple Features


We've demonstrated simple linear regression with one feature to gain an intuition, but the benefit of modeling is the ability to reason about hundreds of features at once. There is no limit to the number of features you can use. However, often a small set of features accounts for most of the variance (assuming there is a linear relationship at all). We will start by using four features.

Visualizing the Data (Part 2)

Explore more features.

# Create feature column variables feature_cols = ['temp', 'season', 'weather', 'humidity']

Create a subset of scatterplot matrix using Seaborn.

We can use pairplot with the y_vars argument to only show relationships with the total_rentals variable

# multiple scatterplots in Seaborn sns.pairplot(bikes, x_vars=feature_cols, y_vars='total_rentals', kind='reg');
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2524 try: -> 2525 return self._engine.get_loc(key) 2526 except KeyError: pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'weather' During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-32-6103d56d2b6a> in <module>() 1 # multiple scatterplots in Seaborn ----> 2 sns.pairplot(bikes, x_vars=feature_cols, y_vars='total_rentals', kind='reg'); ~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\axisgrid.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, size, aspect, dropna, plot_kws, diag_kws, grid_kws) 2074 elif kind == "reg": 2075 from .regression import regplot # Avoid circular import -> 2076 plotter(regplot, **plot_kws) 2077 2078 # Add a legend ~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\axisgrid.py in map(self, func, **kwargs) 1301 1302 color = self.palette[k] if kw_color is None else kw_color -> 1303 func(data_k[x_var], data_k[y_var], 1304 label=label_k, color=color, **kwargs) 1305 ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key) 2137 return self._getitem_multilevel(key) 2138 else: -> 2139 return self._getitem_column(key) 2140 2141 def _getitem_column(self, key): ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key) 2144 # get column 2145 if self.columns.is_unique: -> 2146 return self._get_item_cache(key) 2147 2148 # duplicate columns & possible reduce dimensionality ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item) 1840 res = cache.get(item) 1841 if res is None: -> 1842 values = self._data.get(item) 1843 res = self._box_item_values(item, values) 1844 cache[item] = res ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath) 3841 3842 if not isna(item): -> 3843 loc = self.items.get_loc(item) 3844 else: 3845 indexer = np.arange(len(self.items))[isna(self.items)] ~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2525 return self._engine.get_loc(key) 2526 except KeyError: -> 2527 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2528 2529 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'weather'
Image in a Jupyter notebook

Recreate the same functionality using Pandas.

# Multiple scatterplots in Pandas fig, axs = plt.subplots(1, len(feature_cols), sharey=True) for index, feature in enumerate(feature_cols): bikes.plot(kind='scatter', x=feature, y='total_rentals', ax=axs[index], figsize=(16, 3))
# alternative way in Pandas (might take a while) # scatter_matrix does a pairplot of *every* column grr = pd.tools.plotting.scatter_matrix(bikes[['total_rentals'] + feature_cols], figsize=(15, 15), alpha=0.7)

Are you seeing anything you didn't expect?

Explore the season variable using a cross-tab.

# Cross-tabulation of season and month pd.crosstab(bikes.season, bikes.index.month)

Explore the season variable using a box plot.

# Box plot of rentals, grouped by season bikes.boxplot(column='total_rentals', by='season');

Notably:

  • A line can't capture a nonlinear relationship.

Look at rentals over time.

# Line plot of rentals bikes.total_rentals.plot();

What does this tell us?

There are more rentals in the winter than the spring, but only because the system is experiencing overall growth and the winter months happen to come after the spring months.

Look at the correlation matrix for the bikes DataFrame.

# Correlation matrix (ranges from 1 to -1) bikes.corr()

Use a heat map to make it easier to read the correlation matrix.

# Visualize correlation matrix in Seaborn using a heat map. sns.heatmap(bikes.corr())

What relationships do you notice?

# A:

Adding More Features to the Model

In the previous example, one variable explained the variance of another; however, more often than not, we will need multiple variables.

  • For example, a house's price may be best measured by square feet, but a lot of other variables play a vital role: bedrooms, bathrooms, location, appliances, etc.

  • For a linear regression, we want these variables to be largely independent of one another, but all of them should help explain the y variable.

We'll work with bikeshare data to showcase what this means and to explain a concept called multicollinearity.

Create another LinearRegression instance that is fit using temp, season, weather, and humidity.

# Create a list of features. feature_cols = ['temp', 'season', 'weather', 'humidity']
# Create X and y. X = bikes[feature_cols] y = bikes.total_rentals # Instantiate and fit. linreg = LinearRegression() linreg.fit(X, y) # Print the coefficients. print(linreg.intercept_) print(linreg.coef_)

Display the linear regression coefficient along with the feature names.

# Pair the feature names with the coefficients. list(zip(feature_cols, linreg.coef_))

Interpreting the coefficients:

  • Holding all other features fixed, a 1-unit increase in temperature is associated with a rental increase of 7.86 bikes.

  • Holding all other features fixed, a 1-unit increase in season is associated with a rental increase of 22.5 bikes.

  • Holding all other features fixed, a 1-unit increase in weather is associated with a rental increase of 6.67 bikes.

  • Holding all other features fixed, a 1-unit increase in humidity is associated with a rental decrease of 3.12 bikes.

Does anything look incorrect and does not reflect reality?

What Is Multicollinearity?


Multicollinearity happens when two or more features are highly correlated with each other. The problem is that due to the high correlation, it's hard to disambiguate which feature has what kind of effect on the outcome. In other words, the features mask each other.

There is a second related issue called variance inflation where including correlated features increases the variability of our model and p-values by widening the standard errors. This can be measured with the variance inflation factor, which we will not cover here.

With the bikeshare data, let's compare three data points: actual temperature, "feel" temperature, and guest ridership.

cmap = sns.diverging_palette(220, 10, as_cmap=True) correlations = bikes[['temp', 'atemp', 'casual']].corr() print(correlations) print(sns.heatmap(correlations, cmap=cmap))

Create a linear model that predicts total_rentals using temp and atemp.

# Create a list of features. feature_cols = ['temp', 'atemp']
# Create X and y. X = bikes[feature_cols] y = bikes.total_rentals # Instantiate and fit. linreg = LinearRegression() linreg.fit(X, y) # Print the coefficients. print(linreg.intercept_) print(linreg.coef_)

Go back and remove either temp or atemp from the feature list. How do the coefficients change?

# A:

How to Select a Model


We can make linear models now, but how do we select the best model to use for our applications? We will offer a general procedure and a simple metric that works well in many cases. That said, it's important to keep the business context in mind and know that there are alternative metrics that can work better.

Feature Selection

How do we choose which features to include in the model? We're going to use train/test split (and eventually cross-validation).

Why not use p-values or R-squared for feature selection?

  • Linear models rely upon a lot of assumptions (such as the features being independent), and if those assumptions are violated, p-values and R-squared are less reliable. Train/test split relies on fewer assumptions.

  • If all of the assumptions of a linear model are met, p-values suggest a coefficient that differs from zero at a level of statistical significance. This does not mean that

    1. the feature causes the response

    2. the feature strongly predicts the response.

  • Adding features to your model that are unrelated to the response will always increase the R-squared value, and adjusted R-squared does not sufficiently account for this (although, AIC and BIC do).

  • p-values and R-squared are proxies for our goal of generalization, whereas train/test split and cross-validation attempt to directly estimate how well the model will generalize to out-of-sample data.

More generally:

  • There are different methodologies that can be used for solving any given data science problem, and this course follows a machine learning methodology.

  • This course focuses on general purpose approaches that can be applied to any model, rather than model-specific approaches.

Evaluation Metrics for Regression Problems

Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. We need evaluation metrics designed for comparing continuous values.

Here are three common evaluation metrics for regression problems:

Mean absolute error (MAE) is the mean of the absolute value of the errors:

1n∑i=1n∣yi−y^i∣\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|

Mean squared error (MSE) is the mean of the squared errors:

1n∑i=1n(yi−y^i)2\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2

Root mean squared error (RMSE) is the square root of the mean of the squared errors:

1n∑i=1n(yi−y^i)2\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}
# Example true and predicted response values true = [10, 7, 5, 5] pred = [8, 6, 5, 10]

Calculate MAE, MSE, and RMSE using imports from sklearn metrics and NumPy.

# Calculate these metrics by hand! from sklearn import metrics import numpy as np print('MAE:', metrics.mean_absolute_error(true, pred)) print('MSE:', metrics.mean_squared_error(true, pred)) print('RMSE:', np.sqrt(metrics.mean_squared_error(true, pred)))

Let's compare these metrics:

  • MAE is the easiest to understand, because it's the average error.

  • MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world. Also, MSE is continuous and differentiable, making it easier to use than MAE for optimization.

  • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

Here's an additional example, to demonstrate how MSE/RMSE punishes larger errors:

# Same true values as above true = [10, 7, 5, 5] # New set of predicted values pred = [10, 7, 5, 13] # MAE is the same as before. print('MAE:', metrics.mean_absolute_error(true, pred)) # MSE and RMSE are larger than before. print('MSE:', metrics.mean_squared_error(true, pred)) print('RMSE:', np.sqrt(metrics.mean_squared_error(true, pred)))

Comparing Models With Train/Test Split and RMSE

from sklearn.model_selection import train_test_split # Define a function that accepts a list of features and returns testing RMSE. def train_test_rmse(df, feature_cols): X = df[feature_cols] y = df.total_rentals X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123) linreg = LinearRegression() linreg.fit(X_train, y_train) y_pred = linreg.predict(X_test) return np.sqrt(metrics.mean_squared_error(y_test, y_pred))
# Compare different sets of features. print(train_test_rmse(bikes, ['temp', 'season', 'weather', 'humidity'])) print(train_test_rmse(bikes, ['temp', 'season', 'weather'])) print(train_test_rmse(bikes, ['temp', 'season', 'humidity']))
# Using these as features is not allowed! print(train_test_rmse(bikes, ['casual', 'registered']))

Comparing Testing RMSE With Null RMSE

Null RMSE is the RMSE that could be achieved by always predicting the mean response value. It is a benchmark against which you may want to measure your regression model.

# Split X and y into training and testing sets. X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123) # Create a NumPy array with the same shape as y_test. y_null = np.zeros_like(y_test, dtype=float) # Fill the array with the mean value of y_test. y_null.fill(y_test.mean()) y_null
# Compute null RMSE. np.sqrt(metrics.mean_squared_error(y_test, y_null))

Feature Engineering to Improve Performance


Machine learning models are very powerful, but they cannot automatically handle every aspect of our data. We have to explicitly modify our features to have relationships that our models can understand. In this case, we will need to pull out features to have a linear relationship with our response variable.

Handling Categorical Features

scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?

  • Ordered categories: Transform them to sensible numeric values (example: small=1, medium=2, large=3)

  • Unordered categories: Use dummy encoding (0/1). Here, each possible category would become a separate feature.

What are the categorical features in our data set?

  • Ordered categories: weather (already encoded with sensible numeric values)

  • Unordered categories: season (needs dummy encoding), holiday (already dummy encoded), workingday (already dummy encoded)

For season, we can't simply leave the encoding as 1 = spring, 2 = summer, 3 = fall, and 4 = winter, because that would imply an ordered relationship. Instead, we create multiple dummy variables.

Create dummy variables using get_dummies from Pandas.

season_dummies = pd.get_dummies(bikes.season, prefix='season')

Inspect the DataFrame of dummies.

# Print five random rows. season_dummies.sample(n=5, random_state=1)

However, we actually only need three dummy variables (not four), and thus we'll drop the first dummy variable.

Why? Because three dummies captures all of the "information" about the season feature, and implicitly defines spring (season 1) as the baseline level.

This circles back to the concept multicollinearity, except instead of one feature being highly correlated to another, the information gained from three features is directly correlated to the fourth.

Drop the first column.

season_dummies.drop(season_dummies.columns[0], axis=1, inplace=True)

Reinspect the DataFrame of dummies.

# Print five random rows. season_dummies.sample(n=5, random_state=1)

In general, if you have a categorical feature with k possible values, you create k-1 dummy variables.

If that's confusing, think about why we only need one dummy variable for holiday, not two dummy variables (holiday_yes and holiday_no).

We now need to concatenate the two DataFrames together.

# Concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns). bikes_dummies = pd.concat([bikes, season_dummies], axis=1) # Print 5 random rows. bikes_dummies.sample(n=5, random_state=1)

Rerun the linear regression with dummy variables included.

# Include dummy variables for season in the model. feature_cols = ['temp', 'season_2', 'season_3', 'season_4', 'humidity'] X = bikes_dummies[feature_cols] y = bikes_dummies.total_rentals linreg = LinearRegression() linreg.fit(X, y) list(zip(feature_cols, linreg.coef_))

How do we interpret the season coefficients? They are measured against the baseline (spring):

  • Holding all other features fixed, summer is associated with a rental decrease of 3.39 bikes compared to the spring.

  • Holding all other features fixed, fall is associated with a rental decrease of 41.7 bikes compared to the spring.

  • Holding all other features fixed, winter is associated with a rental increase of 64.4 bikes compared to the spring.

Would it matter if we changed which season was defined as the baseline?

  • No, it would simply change our interpretation of the coefficients.

In most situations, it is best to have the dummy that is your baseline be the category that has the largest representation.

Important: Dummy encoding is relevant for all machine learning models, not just linear regression models.

# Compare original season variable with dummy variables. print(train_test_rmse(bikes_dummies, ['temp', 'season', 'humidity'])) print(train_test_rmse(bikes_dummies, ['temp', 'season_2', 'season_3', 'season_4', 'humidity']))

Feature Engineering

See if you can create the following features:

  • hour: as a single numeric feature (0 through 23)

  • hour: as a categorical feature (use 23 dummy variables)

  • daytime: as a single categorical feature (daytime=1 from 7 a.m. to 8 p.m., and daytime=0 otherwise)

Then, try using each of the three features (on its own) with train_test_rmse to see which one performs the best!

Extract hour of the day to use as a feature.

Encode hour as a categorical feature.

Generate a daytime variable based on hour of the day.

Test the root mean squared error of our various hour encodings.

Bonus Material: Regularization


  • Regularization is a method for "constraining" or "regularizing" the size of the coefficients, thus "shrinking" them toward zero.

  • It reduces model variance and thus minimizes overfitting.

  • If the model is too complex, it tends to reduce variance more than it increases bias, resulting in a model that is more likely to generalize.

Our goal is to locate the optimum model complexity, and thus regularization is useful when we believe our model is too complex.

How Does Regularization Work?

For a normal linear regression model, we estimate the coefficients using the least squares criterion, which minimizes the residual sum of squares (RSS).

For a regularized linear regression model, we minimize the sum of RSS and a "penalty term" that penalizes coefficient size.

Ridge regression (or "L2 regularization") minimizes: RSS+α∑j=1pβj2\text{RSS} + \alpha \sum_{j=1}^p \beta_j^2

Lasso regression (or "L1 regularization") minimizes: RSS+α∑j=1p∣βj∣\text{RSS} + \alpha \sum_{j=1}^p |\beta_j|

  • pp is the number of features.

  • βj\beta_j is a model coefficient.

  • α\alpha is a tuning parameter:

    • A tiny α\alpha imposes no penalty on the coefficient size, and is equivalent to a normal linear regression model.

    • Increasing the α\alpha penalizes the coefficients and thus shrinks them.

Lasso and Ridge Path Diagrams

A larger alpha (toward the left of each diagram) results in more regularization:

  • Lasso regression shrinks coefficients all the way to zero, thus removing them from the model.

  • Ridge regression shrinks coefficients toward zero, but they rarely reach zero.

Source code for the diagrams: Lasso regression and Ridge regression

Lasso and Ridge Coefficient Plots

Advice for Applying Regularization

Should features be standardized?

  • Yes, because otherwise, features would be penalized simply because of their scale.

  • Also, standardizing avoids penalizing the intercept, which wouldn't make intuitive sense.

How should you choose between lasso regression and ridge regression?

  • Lasso regression is preferred if we believe many features are irrelevant or if we prefer a sparse model.

  • Ridge can work particularly well if there is a high degree of multicollinearity in your model.

  • If model performance is your primary concern, it is best to try both.

  • Elastic net regression is a combination of lasso regression and ridge Regression.

Ridge Regression

  • Ridge documentation

  • alpha: must be positive, increase for more regularization

  • normalize: scales the features (without using StandardScaler)

# Include dummy variables for season in the model. feature_cols = ['temp', 'atemp', 'season_2', 'season_3', 'season_4', 'humidity'] X = bikes_dummies[feature_cols] y = bikes_dummies.total_rentals
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# alpha=0 is equivalent to linear regression. from sklearn.linear_model import Ridge # Instantiate the model. #(Alpha of zero has no regularization strength, essentially a basic linear regression.) ridgereg = Ridge(alpha=0, normalize=True) # Fit the model. ridgereg.fit(X_train, y_train) # Predict with fitted model. y_pred = ridgereg.predict(X_test) print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
# Coefficients for a non-regularized linear regression list(zip(feature_cols, ridgereg.coef_))

To interpret these coefficients we need to convert them back to original units, which is a reason to do normalization by hand. However, in this form the coefficients have a special meaning. The intercept is now the average of our outcome, and the magnitude of each coefficient in the model is a measure of how important it is in the model. We call this feature importance.

# Try alpha=0.1. ridgereg = Ridge(alpha=0.1, normalize=True) ridgereg.fit(X_train, y_train) y_pred = ridgereg.predict(X_test) print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
# Examine the coefficients. list(zip(feature_cols, ridgereg.coef_))

While the MSE barely improved, we can see there are significant changes in the weight of our coefficients. Particularly season_2 whose coefficient has greatly decreased toward 0.

Fitting and using a Lasso Regression in scikit-learn is very similar.

In addition to the typical lasso and ridge there is a third type of regression, Elastic Net which combines the penalties of the ridge and lasso methods.

Comparing Linear Regression With Other Models

Advantages of linear regression:

  • Simple to explain.

  • Highly interpretable.

  • Model training and prediction are fast.

  • No tuning is required (excluding regularization).

  • Features don't need scaling.

  • Can perform well with a small number of observations.

  • Well understood.

Disadvantages of linear regression:

  • Presumes a linear relationship between the features and the response.

  • Performance is (generally) not competitive with the best supervised learning methods due to high bias.

  • Can't automatically learn feature interactions.