Path: blob/master/april_18/lessons/lesson-06/README.md
1904 views
---
---
Intro to Regression Analysis
DS | Lesson 6
LEARNING OBJECTIVES
After this lesson, you will be able to:
Define data modeling and simple linear regression
Build a linear regression model using a dataset that meets the linearity assumption using the sci-kit learn library
Understand and identify multicollinearity in a multiple regression.
STUDENT PRE-WORK
Before this lesson, you should already be able to:
Effectively show correlations between an independent variable
x
and a dependent variabley
Be familiar with the
get_dummies
function in pandasUnderstand the difference between vectors, matrices, Series, and DataFrames
Understand the concepts of outliers and distance.
Be able to interpret p values and confidence intervals
INSTRUCTOR PREP
Before this lesson, instructors will need to:
Review Final Project pt 1
Copy and modify the lesson slide deck
Read through datasets and starter/solution code
Add to the "Additional Resources" section for this lesson
A notebook of [linear regression examples](./code/Linear Regression with Statsmodels and Scikit-Learn.ipynb ) is available, using both statsmodels and scikit-learn, including quadratic and exponential regressions.
LESSON GUIDE
TIMING | TYPE | TOPIC |
---|---|---|
5 min | Opening | Where are we in the Data Science Workflow? |
10 min | Introduction | Simple Linear Regression |
10 min | Demo | Regressing and Normal Distributions |
15 min | Guided Practice | Seaborn & Single Variable Linear Model Plots |
10 min | Introduction | Single Regression Analysis in sklearn |
20 min | Demo | Significance is Key |
15 min | Guided Practice | Using the LinearRegression Object |
20 min | Independent Practice | Base Linear Regression Classes |
10 min | Introduction | Multiple Regression Analysis |
15 min | Guided Practice | Multicollinearity with Dummy Variables |
15 min | Guided Practice | Combining Non-Correlated Features |
25 min | Independent Practice | Building Models for Other Y Variables |
5 min | Conclusion | Topic Review |
Opening (5 mins)
Where are we in the data science workflow?
The data we are working with for this lesson has been acquired and parsed. Through today's process we will be refining the data and building models, using some plotting to represent the results.
Introduction: Simple Linear Regression (10 mins)
It starts with a simple correlation
A linear regression is an explanation of a continuous variable given a series of independent variables. In it's simplest form, a linear regression reminds us of a basic algebraic function - a line of best fit:
y = mx + b
That is: given some value x, its power in explanation m, and a starting point b, explain the value y.
However, the power of a linear regression is that we can use linear algebra to explain multiple x's together in order to explain y:
y = betas * X + alpha (+ error)
Our terminology is now:
Given a matrix X, their relative coefficients beta, and a y-intercept alpha, explain a dependent vector, y.
A linear regression works best when:
The data is normally distributed (but doesn't have to be)
The Xs significantly explain y (have low p-values)
The Xs are independent of each other (low multicollinearity)
The resulting values passes linear assumptions (dependent on problem)
Check: What is linear regression and when can it be applied?
Demo: Regressing and normal distributions (10 mins)
When working with linear regressions, it helps to have data with normal distributions. Linear regressions have linear solutions, and we want this linear solution to explain the majority, "normal" part of our data; not the outliers! If the data is not normally distributed, the model could introduce bias, a term we will be discussing in more detail later on in the course.
For example, let's look at explaining the relationship between an animal's body weight, and their brain weight.
In the plot, it's apparent that there is a relationship between the two values, but as it stands, it's not a linear solution. Using the seaborn library, we can plot the linear regression fit with these two variables:
Notice:
The
lmplot()
function returns a straight line. That is why it is a linear solution. If we had multiple variables, the solution would be a linear plane.The linear solution does explain a portion of the data well, but because both "bodywt" and "brainwt" are log-log distributions, outliers effect the weight of the solution poorly. We can see this from the wide and inconsistently shaped confidence intervals that seaborn's lmplot generates.
Because both values are a log-log distribution, some math properties allow us to transform them into normal distributions. Then, we can solve for the linear regression!
Check: Does this explain the animal's brain weight better or worse than the original data?
Even though we changed the way the data was shaped, this is still a linear result: it's just linear in the log10 of the data, instead of in the data's natural state.
Guided Practice: Using Seaborn to generate single variable linear model plots (15 mins)
Update and complete the code below to use lmplot
and display correlations between body weight and two dependent variables: sleep_rem
and awake
.
Check: Were you successfully able to use the code to determine correlations between body weight and the variables?
Introduction: Single Regression Analysis in sklearn (10 mins)
Defining model objects
When modeling with sklearn, you'll use the following base principals.
All sklearn estimators (modeling classes) are based on this base estimator. This allows you to easily rotate through estimators without changing much code.
All estimators take a matrix, X, either sparse or dense.
Many estimators also take a vector, y, when working on a supervised machine learning problem. Regressions are supervised learning because we already have examples of y given X.
All estimators have parameters that can be set. This allows for customization and higher level of detail to the learning process. The parameters are appropriate to each estimator algorithm.
For today, our LinearRegression()
does not have a transform function... but some do! We will not be using it today.
With this information, we can build a simple process for linear regressions that take advantage of a feature_selection function and the linear regression estimator, as well as get familiar with how to implement parameters.
Check: Describe some of the base principles for sklearn model objects.
Demo: Significance is Key (20 mins)
With the sklearn library, we can generate an sklearn model object and explore important evaluation values for linear regression.
Check: What does our ouput tell us?
Our output tells us that:
The relationship between bodywt and brainwt isn't random (p value approaching 0)
The model explains, roughly, 87% of the variance of the dataset (the largest errors being in the large brain and body sizes)
With this current model,
brainwt
is roughlybodywt * 0.00096395
The residuals, or error in the prediction, is not normal, with outliers on the right. A better with will have similar to normally distributed error.
Evaluating Fit, Evaluating Sense
Although we know there is a better solution to the model, we should evaluate some other sense things first. For example, given this model, what is an animal's brainwt if their bodywt is 0?
Check: What would we expect an animal's brainwt to be if their bodywt is 0?
With linear modeling we call this part of the linear assumption. Consider it a test to the model. If an animal's body weights nothing, we expect their brain to be nonexistent. That given, we can improve the model by telling sklearn's LinearRegression object we do not want to fit a y intercept.
Now, the model fits where brainwt = 0, bodywt = 0.
Because we start at 0, the large outliers have a greater effect, so the coefficient has increased.
Fitting the this linear assumption also explains slightly less of the variance.
Check: Is this a better or worse model? Why?
Guided Practice: Using the LinearRegression object (15 mins)
We learned earlier that the the data in its current state does not allow for the best linear regression fit. With a partner, generate two more models using the log-transformed data to see how this transform changes the model's performance. Complete the following code to update X and y to match the log-transformed data. Complete the loop by setting the list to be one True and one False.
Check: Out of the four, which model performed the best? The worst? Why?
Independent Practice: Base linear regression classes (20 minutes)
Next class we'll go into further detail on other regression techniques, but for now, experiment with the model evaluation function we have (get_linear_model_metrics
) with the following sklearn estimator classes to show how easy it is to implement different estimators:
linear_model.Lasso()
linear_model.Ridge()
linear_model.ElasticNet()
Check: Did the implementation run without error? What were the r-squared outputs for each estimator?
Introduction: Multiple Regression Analysis (10 minutes)
In the previous example, one variable explained the variance of another; however, more often than not, we will need multiple variables. For example, a house's price may be best measured by square feet, but a lot of other variables play a vital role: bedrooms, bathrooms, location, appliances, etc. For a linear regression, we want these variables to be largely independent of each other, but all of them should help explain the y variable.
We'll work with bike-share data to showcase what this means and to explain a concept called multicollinearity.
What is Multicollinearity?
With the bike share data, let's compare three data points: actual temperature, "feel" temperature, and guest ridership. Our data is already normalized between 0 and 1, so we'll start off with the correlations and modeling.
The correlation matrix explains:
both temperature fields are moderately correlated to guest ridership;
the two temperature fields are highly correlated to each other.
Including both of these fields in a model could introduce a pain point of multicollinearity, where it's more difficult for a model to determine which feature is effecting the predicted value.
We can measure this effect in the coefficients:
Even though the 2-variable model temp + atemp
has a higher explanation of variance than two variables on their own, and both variables are considered significant (p values approaching 0), we can see that together, their coefficients are wildly different. This can introduce error in how we explain models.
What happens if we use a second variable that isn't highly correlated with temperature, like humidity?
While temperature's coefficient is higher, the logical output still makes sense: for guest riders we expected a positive relationship with temperature and a negative relationship with humidity, and our model suggests it as well.
Check: What is multicollinearity? Why might this cause problems in a model?
Guided Practice: Multicollinearity with dummy variables (15 mins)
There can be a similar effect from a feature set that is a singular matrix, which is when there is a clear relationship in the matrix (for example, the sum of all rows = 1).
Run through the following code on your own. What happens to the coefficients when you include all weather situations instead of just including all except one?
Check: Are students able to explain how coefficients changed once all the weather situations were included?
This model makes more sense, because we can more easily explain the variables compared to the one we left out. For example, this suggests that a clear day (weathersit:1) on average brings in about 38 more riders hourly than a day with heavy snow. In fact, since the weather situations "degrade" in quality (1 is the nicest day, 4 is the worst), the coefficients now reflect that well. However at this point, there is still a lot of work to do, because weather on its own fails to explain ridership well.
Guided Practice: Combining non-correlated features into a better model (15 mins)
With a partner, complete this code together and visualize the correlations of all the numerical features built into the data set.
We want to:
Add the three significant weather situations into our current model
Find two more features that are not correlated with current features, but could be strong indicators for predicting guest riders.
Check: Were groups able to add all three conditions into the model? Did they come up with two additional predictive features?
Independent Practice: Building models for other y variables (25 minutes)
We've completely a model together that explains casual guest riders. It's now your turn to build another model, but using a different y variable: registered riders.
Pay attention to:
the distribution of riders (should we rescale the data?)
checking correlations with variables and registered riders
having a feature space (our matrix) with low multicollinearity
model complexity vs explanation of variance: at what point do features in a model stop improving r-squared?
the linear assumption -- given all feature values being 0, should we have no ridership? negative ridership? positive ridership?
Bonus
Which variables would make sense to dummy (because they are categorical, not continuous)?
What features might explain ridership but aren't included in the data set?
Is there a way to build these using pandas and the features available?
Outcomes If your model at least improves upon the original model and the explanatory effects (coefficients) make sense, consider this a complete task. If your model has an r-squared above .4, this a relatively effective model for the data available. Kudos!
Conclusion (5 mins)
How do you dummy a categorical variable?
How do you avoid a singular matrix?
What is a single linear regression?
What makes multi-variable regressions more useful?
What challenges do they introduce?
BEFORE NEXT CLASS
UPCOMING PROJECTS | Final Project, Deliverable 1 |
ADDITIONAL RESOURCES
Add your own resources.
Go crazy.
So much room for bullets!