Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_05/README.md
1904 views
---
title: Intro to Regression Analysis duration: "02:50" creator: name: Ed Podojil city: NYC dataset: animal weights and sleeping habit dataset: citibike ridership and weather data
---

Intro to Regression Analysis

DS | Lesson 5

LEARNING OBJECTIVES

After this lesson, you will be able to:

  • Define data modeling and simple linear regression

  • Build a linear regression model using a dataset that meets the linearity assumption using the sci-kit learn library

  • Understand and identify multicollinearity in a multiple regression.

STUDENT PRE-WORK

Before this lesson, you should already be able to:

  • Effectively show correlations between an independent variable x and a dependent variable y

  • Be familiar with the get_dummies function in pandas

  • Understand the difference between vectors, matrices, Series, and DataFrames

  • Understand the concepts of outliers and distance.

  • Be able to interpret p values and confidence intervals

INSTRUCTOR PREP

Before this lesson, instructors will need to:

A notebook of [linear regression examples](./code/Linear Regression with Statsmodels and Scikit-Learn.ipynb ) is available, using both statsmodels and scikit-learn, including quadratic and exponential regressions.

LESSON GUIDE

TIMINGTYPETOPIC
5 minOpeningWhere are we in the Data Science Workflow?
10 minIntroductionSimple Linear Regression
10 minDemoRegressing and Normal Distributions
15 minGuided PracticeSeaborn & Single Variable Linear Model Plots
10 minIntroductionSingle Regression Analysis in sklearn
20 minDemoSignificance is Key
15 minGuided PracticeUsing the LinearRegression Object
20 minIndependent PracticeBase Linear Regression Classes
10 minIntroductionMultiple Regression Analysis
15 minGuided PracticeMulticollinearity with Dummy Variables
15 minGuided PracticeCombining Non-Correlated Features
25 minIndependent PracticeBuilding Models for Other Y Variables
5 minConclusionTopic Review

Opening (5 mins)

Where are we in the data science workflow?

The data we are working with for this lesson has been acquired and parsed. Through today's process we will be refining the data and building models, using some plotting to represent the results.

Introduction: Simple Linear Regression (10 mins)

It starts with a simple correlation

A linear regression is an explanation of a continuous variable given a series of independent variables. In it's simplest form, a linear regression reminds us of a basic algebraic function - a line of best fit:

y = mx + b

That is: given some value x, its power in explanation m, and a starting point b, explain the value y.

However, the power of a linear regression is that we can use linear algebra to explain multiple x's together in order to explain y:

y = betas * X + alpha (+ error)

Our terminology is now:

Given a matrix X, their relative coefficients beta, and a y-intercept alpha, explain a dependent vector, y.

A linear regression works best when:

  • The data is normally distributed (but doesn't have to be)

  • The Xs significantly explain y (have low p-values)

  • The Xs are independent of each other (low multicollinearity)

  • The resulting values passes linear assumptions (dependent on problem)

Check: What is linear regression and when can it be applied?

Demo: Regressing and normal distributions (10 mins)

When working with linear regressions, it helps to have data with normal distributions. Linear regressions have linear solutions, and we want this linear solution to explain the majority, "normal" part of our data; not the outliers! If the data is not normally distributed, the model could introduce bias, a term we will be discussing in more detail later on in the course.

For example, let's look at explaining the relationship between an animal's body weight, and their brain weight.

- Import libraries import numpy as np import pandas as pd from matplotlib import pyplot as plt import seaborn as sns - Read in the mammal dataset, and drop missing data. mammals = pd.read_csv('../data/msleep.csv') mammals = mammals[mammals.brainwt.notnull()].copy() - Create a matplotlib figure plt.figure() - Generate a scatterplot inside the figure plt.plot(mammals.bodywt, mammals.brainwt, '.') - Show the plot plt.show()

01-mammals-plot

In the plot, it's apparent that there is a relationship between the two values, but as it stands, it's not a linear solution. Using the seaborn library, we can plot the linear regression fit with these two variables:

- Generate a plot of a single variable linear model of brain weight given body weight: sns.lmplot('bodywt', 'brainwt', mammals)

02-mammals-plot

Notice:

  1. The lmplot() function returns a straight line. That is why it is a linear solution. If we had multiple variables, the solution would be a linear plane.

  2. The linear solution does explain a portion of the data well, but because both "bodywt" and "brainwt" are log-log distributions, outliers effect the weight of the solution poorly. We can see this from the wide and inconsistently shaped confidence intervals that seaborn's lmplot generates.

Because both values are a log-log distribution, some math properties allow us to transform them into normal distributions. Then, we can solve for the linear regression!

- Create a new data set that converts all numeric variables into log10 log_mammals = mammals[['bodywt', 'brainwt']].apply(np.log10) sns.lmplot('bodywt', 'brainwt', log_mammals)

03-mammals-plot

Check: Does this explain the animal's brain weight better or worse than the original data?

Even though we changed the way the data was shaped, this is still a linear result: it's just linear in the log10 of the data, instead of in the data's natural state.

Guided Practice: Using Seaborn to generate single variable linear model plots (15 mins)

Update and complete the code below to use lmplot and display correlations between body weight and two dependent variables: sleep_rem and awake.

log_columns = ['bodywt', 'brainwt',] # any others? log_mammals = mammals.copy() log_mammals[log_columns] = log_mammals[log_columns].apply(np.log10) - Complete below for sleep_rem and awake as a y, with variables you've already used as x. sns.lmplot(x, y, mammals) sns.lmplot(x, y, log_mammals)

Check: Were you successfully able to use the code to determine correlations between body weight and the variables?

Introduction: Single Regression Analysis in sklearn (10 mins)

Defining model objects

When modeling with sklearn, you'll use the following base principals.

  1. All sklearn estimators (modeling classes) are based on this base estimator. This allows you to easily rotate through estimators without changing much code.

  2. All estimators take a matrix, X, either sparse or dense.

  3. Many estimators also take a vector, y, when working on a supervised machine learning problem. Regressions are supervised learning because we already have examples of y given X.

  4. All estimators have parameters that can be set. This allows for customization and higher level of detail to the learning process. The parameters are appropriate to each estimator algorithm.

- Generate an instance of an estimator class estimator = base_models.AnySKLearnObject() - Fit your data estimator.fit(X, y) - Score it with the default scoring method (recommended to use the metrics module in the future) estimator.score(X, y) - Predict a new set of data estimator.predict(new_X) - Transform a new X if changes were made to the original X while fitting estimator.transform(new_X)

For today, our LinearRegression() does not have a transform function... but some do! We will not be using it today.

With this information, we can build a simple process for linear regressions that take advantage of a feature_selection function and the linear regression estimator, as well as get familiar with how to implement parameters.

Check: Describe some of the base principles for sklearn model objects.

Demo: Significance is Key (20 mins)

With the sklearn library, we can generate an sklearn model object and explore important evaluation values for linear regression.

- Import sklearn from sklearn import feature_selection, linear_model def get_linear_model_metrics(X, y, algo): - Get the pvalue of X given y. Ignore f-stat for now. pvals = feature_selection.f_regression(X, y)[1] - Start with an empty linear regression object - .fit() runs the linear regression function on X and y algo.fit(X,y) residuals = (y-algo.predict(X)).values - Print the necessary values print 'P Values:', pvals print 'Coefficients:', algo.coef_ print 'y-intercept:', algo.intercept_ print 'R-Squared:', algo.score(X,y) plt.figure() plt.hist(residuals, bins=np.ceil(np.sqrt(len(y)))) - Keep the model return algo X = mammals[['bodywt']] y = mammals['brainwt'] lm = linear_model.LinearRegression() lm = get_linear_model_metrics(X, y, lm)
P Values: [ 9.15540205e-26] Coefficients: [ 0.00096395] y-intercept: 0.0859173102936 R-Squared: 0.871949198087

04-mammals-residuals-hist

Check: What does our ouput tell us?

Our output tells us that:

  • The relationship between bodywt and brainwt isn't random (p value approaching 0)

  • The model explains, roughly, 87% of the variance of the dataset (the largest errors being in the large brain and body sizes)

  • With this current model, brainwt is roughly bodywt * 0.00096395

  • The residuals, or error in the prediction, is not normal, with outliers on the right. A better with will have similar to normally distributed error.

Evaluating Fit, Evaluating Sense

Although we know there is a better solution to the model, we should evaluate some other sense things first. For example, given this model, what is an animal's brainwt if their bodywt is 0?

print lm.predict([[0]])
array([ 0.08591731])

Check: What would we expect an animal's brainwt to be if their bodywt is 0?

With linear modeling we call this part of the linear assumption. Consider it a test to the model. If an animal's body weights nothing, we expect their brain to be nonexistent. That given, we can improve the model by telling sklearn's LinearRegression object we do not want to fit a y intercept.

lm = linear_model.LinearRegression(fit_intercept=False) lm = get_linear_model_metrics(X, y, lm) print lm.predict([[0]])
P Values: [ 9.15540205e-26] Coefficients: [ 0.00098291] y-intercept: 0.0 R-Squared: 0.864418807451 [ 0.]
  • Now, the model fits where brainwt = 0, bodywt = 0.

  • Because we start at 0, the large outliers have a greater effect, so the coefficient has increased.

  • Fitting the this linear assumption also explains slightly less of the variance.

Check: Is this a better or worse model? Why?

Guided Practice: Using the LinearRegression object (15 mins)

We learned earlier that the the data in its current state does not allow for the best linear regression fit. With a partner, generate two more models using the log-transformed data to see how this transform changes the model's performance. Complete the following code to update X and y to match the log-transformed data. Complete the loop by setting the list to be one True and one False.

X = y = loop = [] for boolean in loop: print 'y-intercept:', boolean lm = linear_model.LinearRegression(fit_intercept=boolean) get_linear_model_metrics(X, y, lm) print

Check: Out of the four, which model performed the best? The worst? Why?

Independent Practice: Base linear regression classes (20 minutes)

Next class we'll go into further detail on other regression techniques, but for now, experiment with the model evaluation function we have (get_linear_model_metrics) with the following sklearn estimator classes to show how easy it is to implement different estimators:

  • linear_model.Lasso()

  • linear_model.Ridge()

  • linear_model.ElasticNet()

Check: Did the implementation run without error? What were the r-squared outputs for each estimator?

Introduction: Multiple Regression Analysis (10 minutes)

In the previous example, one variable explained the variance of another; however, more often than not, we will need multiple variables. For example, a house's price may be best measured by square feet, but a lot of other variables play a vital role: bedrooms, bathrooms, location, appliances, etc. For a linear regression, we want these variables to be largely independent of each other, but all of them should help explain the y variable.

We'll work with bike-share data to showcase what this means and to explain a concept called multicollinearity.

What is Multicollinearity?

With the bike share data, let's compare three data points: actual temperature, "feel" temperature, and guest ridership. Our data is already normalized between 0 and 1, so we'll start off with the correlations and modeling.

bike_data = pd.read_csv('data/bikeshare.csv') cmap = sns.diverging_palette(220, 10, as_cmap=True) correlations = bike_data[['temp', 'atemp', 'casual']].corr() print correlations print sns.heatmap(correlations, cmap=cmap)
temp atemp casual temp 1.000000 0.987672 0.459616 atemp 0.987672 1.000000 0.454080 casual 0.459616 0.454080 1.000000

05-correlation-matrix

The correlation matrix explains:

  • both temperature fields are moderately correlated to guest ridership;

  • the two temperature fields are highly correlated to each other.

Including both of these fields in a model could introduce a pain point of multicollinearity, where it's more difficult for a model to determine which feature is effecting the predicted value.

We can measure this effect in the coefficients:

y = bike_data['casual'] x_sets = ( ['temp'], ['atemp'], ['temp', 'atemp'], ) for x in x_sets: print ', '.join(x) get_linear_model_metrics(bike_data[x], y, linear_model.LinearRegression()) print
temp P Values: [ 0.] Coefficients: [ 117.68705779] y-intercept: -22.812739188 R-Squared: 0.21124654163 atemp P Values: [ 0.] Coefficients: [ 130.27875081] y-intercept: -26.3071675481 R-Squared: 0.206188705733 temp, atemp P Values: [ 0. 0.] Coefficients: [ 116.34021588 1.52795677] y-intercept: -22.8703398286 R-Squared: 0.21124723661

Even though the 2-variable model temp + atemp has a higher explanation of variance than two variables on their own, and both variables are considered significant (p values approaching 0), we can see that together, their coefficients are wildly different. This can introduce error in how we explain models.

What happens if we use a second variable that isn't highly correlated with temperature, like humidity?

temp, hum P Values: [ 0. 0.] Coefficients: [ 112.02457031 -80.87301833] y-intercept: 30.7273338581 R-Squared: 0.310901196913

While temperature's coefficient is higher, the logical output still makes sense: for guest riders we expected a positive relationship with temperature and a negative relationship with humidity, and our model suggests it as well.

Check: What is multicollinearity? Why might this cause problems in a model?

Guided Practice: Multicollinearity with dummy variables (15 mins)

There can be a similar effect from a feature set that is a singular matrix, which is when there is a clear relationship in the matrix (for example, the sum of all rows = 1).

Run through the following code on your own. What happens to the coefficients when you include all weather situations instead of just including all except one?

lm = linear_model.LinearRegression() weather = pd.get_dummies(bike_data.weathersit) get_linear_model_metrics(weather[[1, 2, 3, 4]], y, lm) print - Drop the least significant, weather situation = 4 get_linear_model_metrics(weather[[1, 2, 3]], y, lm)
P Values: [ 3.75616929e-73 3.43170021e-22 1.57718666e-55 2.46181288e-01] Coefficients: [ 4.05237297e+12 4.05237297e+12 4.05237297e+12 4.05237297e+12] y-intercept: -4.05237297302e+12 R-Squared: 0.0233498651216 P Values: [ 3.75616929e-73 3.43170021e-22 1.57718666e-55] Coefficients: [ 37.87876398 26.92862383 13.38900634] y-intercept: 2.66666666652 R-Squared: 0.0233906873841

Check: Are students able to explain how coefficients changed once all the weather situations were included?

This model makes more sense, because we can more easily explain the variables compared to the one we left out. For example, this suggests that a clear day (weathersit:1) on average brings in about 38 more riders hourly than a day with heavy snow. In fact, since the weather situations "degrade" in quality (1 is the nicest day, 4 is the worst), the coefficients now reflect that well. However at this point, there is still a lot of work to do, because weather on its own fails to explain ridership well.

Guided Practice: Combining non-correlated features into a better model (15 mins)

With a partner, complete this code together and visualize the correlations of all the numerical features built into the data set.

We want to:

  • Add the three significant weather situations into our current model

  • Find two more features that are not correlated with current features, but could be strong indicators for predicting guest riders.

lm = linear_model.LinearRegression() bikemodel_data = bike_data.join() # add in the three weather situations cmap = sns.diverging_palette(220, 10, as_cmap=True) correlations = # what are we getting the correlations of? print correlations print sns.heatmap(correlations, cmap=cmap) columns_to_keep = [] #[which_variables?] final_feature_set = bikemodel_data[columns_to_keep] get_linear_model_metrics(final_feature_set, y, lm)

Check: Were groups able to add all three conditions into the model? Did they come up with two additional predictive features?

Independent Practice: Building models for other y variables (25 minutes)

We've completely a model together that explains casual guest riders. It's now your turn to build another model, but using a different y variable: registered riders.

Pay attention to:

  • the distribution of riders (should we rescale the data?)

  • checking correlations with variables and registered riders

  • having a feature space (our matrix) with low multicollinearity

  • model complexity vs explanation of variance: at what point do features in a model stop improving r-squared?

  • the linear assumption -- given all feature values being 0, should we have no ridership? negative ridership? positive ridership?

Bonus

  • Which variables would make sense to dummy (because they are categorical, not continuous)?

  • What features might explain ridership but aren't included in the data set?

    • Is there a way to build these using pandas and the features available?

Outcomes If your model at least improves upon the original model and the explanatory effects (coefficients) make sense, consider this a complete task. If your model has an r-squared above .4, this a relatively effective model for the data available. Kudos!

Conclusion (5 mins)

  • How do you dummy a categorical variable?

    • How do you avoid a singular matrix?

  • What is a single linear regression?

  • What makes multi-variable regressions more useful?

    • What challenges do they introduce?


BEFORE NEXT CLASS

ADDITIONAL RESOURCES

  • Add your own resources.

  • Go crazy.

  • So much room for bullets!