Lecture 26 – Residuals and Inference
DSC 10, Fall 2022
Announcements
The Final Project is due tomorrow at 11:59pm.
Questions about slip days? See here.
The Final Exam is this Saturday 12/3 from 11:30am to 2:30pm.
More details coming shortly, but start studying!
There are several study sessions/group office hours this week, which should be helpful as you complete the final project and study for the final exam. Check the calendar for all office hours.
Monday 11/28 from 12-2pm in PCNYH 122.
Tuesday 11/29 from 7-9pm in SDSC Auditorium (with no heat 🥶; dress warmly 🧣).
Wednesday 11/30 from 3-7pm in SDSC Auditorium (with no heat 🥶; dress warmly 🧣).
Friday 12/2 from 5-9pm in WLH 2205.
Lecture section C00 is not meeting today – Suraj is in India 🇮🇳.
C00 will be meeting again this Wednesday and Friday.
Agenda
Residuals.
Inference for regression.
Residuals
Quality of fit
The regression line describes the "best linear fit" for a given dataset.
The formulas for the slope and intercept work no matter what the shape of the data is.
However, the line is only meaningful if the relationship between and is roughly linear.
Example: Non-linear data
This line doesn't fit the data at all, despite being the "best" line for the data!
Residuals
Any set of predictions has errors.
When using the regression line to make predictions, the errors are called residuals.
There is one residual corresponding to each data point in the dataset.
Calculating residuals
Example: Predicting a son's height from his mother's height 👵👨 📏
Is the association between 'mom' and 'son' linear?
If there is a linear association, is it strong?
We can answer this using the correlation coefficient.
Close to 0 = weak, close to -1/+1 = strong.
Is "linear" the best description of the association between
'mom'and'son'?We'll use residuals to answer this question.
Residual plots
The residual plot of a regression line is the scatter plot with the variable on the -axis and residuals on the -axis.
Residual plots describe how the error in the regression line's predictions varies.
**Key idea: If a linear fit is good, the residual plot should look like a patternless "cloud" ☁️.
The residual plot for a non-linear association 🚗
Consider the hybrid cars dataset from earlier.
Let's look at a regression line that uses
'mpg'to predict'msrp'.
Note that as 'mpg' increases, the residuals go from being mostly large, to being mostly small, to being mostly large again. That's a pattern!
Issue: Patterns in the residual plot
Patterns in the residual plot imply that the relationship between and may not be linear.
While this can be spotted in the original scatter plot, it may be easier to identify in the residual plot.
In such cases, a curve may be a better choice than a line for prediction.
In future courses, you'll learn how to extend linear regression to work for polynomials and other types of mathematical functions.
Another example: 'mpg' and 'acceleration' ⛽
Let's fit a regression line that predicts
'mpg'given'acceleration'.Let's then look at the resulting residual plot.
Note that the residuals tend to vary more for smaller accelerations than they do for larger accelerations – that is, the vertical spread of the plot is not similar at all points on the -axis.
Issue: Uneven vertical spread
If the vertical spread in a residual plot is uneven, it implies that the regression line's predictions aren't equally reliable for all inputs.
This doesn't necessarily mean that fitting a non-linear curve would be better; it just impacts how we interpret the regression line's predictions.
For instance, in the previous plot, we see that the regression line's predictions for cars with larger accelerations are more reliable than those for cars with lower accelerations.
The formal term for "uneven spread" is heteroscedasticity.
Example: Anscombe's quartet

All 4 datasets have the same mean of , mean of , SD of , SD of , and correlation.
Therefore, they have the same regression line because the slope and intercept of the regression line are determined by these 5 quantities.
But they all look very different!
Not all of them are linearly associated.
Example: The Datasaurus Dozen 🦖
Never trust summary statistics alone; always visualize your data!
Inference for regression
Another perspective on regression
What we're really doing:
Collecting a sample of data from a population.
Fitting a regression line to that sample.
Using that regression line to make predictions for inputs that are not in our sample.
What if we used a different sample? 🤔
Concept Check ✅ – Answer at cc.dsc10.com
What strategy will help us assess how different our regression line's predictions would have been if we'd used a different sample?
A. Hypothesis testing
B. Permutation testing
C. Bootstrapping
D. Central Limit Theorem
Prediction intervals
We want to come up with a range of reasonable values for a prediction for a single input . To do so, we'll:
Bootstrap the sample.
Compute the slope and intercept of the regression line for that sample.
Repeat steps 1 and 2 many times to compute many slopes and many intercepts.
For a given , use the bootstrapped slopes and intercepts to create bootstrapped predictions, and take the middle 95% of them.
The resulting interval will be called a prediction interval.
Bootstrapping the scatter plot of mother/son heights
Note that each time we run this cell, the resulting line is slighty different!
Bootstrapping predictions: mother/son heights
If a mother is 68 inches tall, how tall do we predict her son to be?
Using the original dataset, and hence the original slope and intercept, we get a single prediction for the input of 68.
Using the bootstrapped slopes and intercepts, we get an interval of predictions for the input of 68.
How different could our prediction have been, for all inputs?
Here, we'll plot several of our bootstrapped lines. What do you notice?
Observations:
All bootstrapped lines pass through
Predictions seem to vary more for very tall and very short mothers than they do for mothers with an average height.
Prediction interval width vs. mother's height
Note that the closer a mother's height is to the mean mother's height, the narrower the prediction interval for her son's height is!
Summary, next time
Summary
Residuals are the errors in the predictions made by the regression line.
Residual plots help us determine whether a line is a good fit to our data.
No pattern in residual plot = good linear fit.
Patterns in residual plot = poor linear fit.
The correlation coefficient, , doesn't tell the full story! 🦖
To see how our predictions might have been different if we'd had a different sample, bootstrap!
Resample the data points and make a prediction using the regression line for each resample.
Many resamples lead to many predictions. Take the middle 95% of them to get a 95% prediction interval.
Next time
We're done with introducing new material!
We'll review in class on Wednesday and Friday.
The final exam is this Saturday 12/3 from 11:30am to 2:30pm.
