Lab 8: Regression
Due Saturday, November 26th at 11:59PM
Welcome to Lab 8, the final lab of the quarter! In this lab you will get some hands-on practice with linear regression, which is covered in CIT 15.
This lab is due on Saturday, November 26th at 11:59PM.
1. How faithful is Old Faithful?
(The clever title comes from here.)
Old Faithful is a geyser in Yellowstone National Park in the central United States. It's famous for erupting on a fairly regular schedule. You can see a video below.
Some of Old Faithful's eruptions last longer than others. When it has a long eruption, there's generally a longer wait until the next eruption.
If you visit Yellowstone, you might want to predict when the next eruption will happen, so you can see the rest of the park and come to see the geyser when it is erupting. In this lab, we will use a dataset on eruption durations and waiting times to see if we can make such predictions accurately with linear regression.
The dataset has one row for each observed eruption. It includes the following columns:
'duration': Eruption duration, in minutes.'wait': Time between this eruption and the next, also in minutes.
Run the next cell to load the dataset.
We would like to use linear regression to make predictions about waiting time given a duration, but that won't work well if the data aren't roughly linearly related. To check that, we should look at the data.
Question 1.1. Make a scatter plot of the data. It's conventional to put the column we will try to predict on the vertical axis and the other column on the horizontal axis.
Question 1.2. Look at the scatter plot. Are the duration of the eruption and the waiting time roughly linearly related? If so, is the relationship between them negative or positive? Assign either 1, 2, or 3 to the variable faith_q2 below.
Eruption duration and waiting time are not roughly linearly related.
Eruption duration and waiting time are roughly linearly related. The relationship between them is negative.
Eruption duration and waiting time are roughly linearly related. The relationship between them is positive.
The scatterplot suggests that it is reasonable to use linear regression to analyze this data.
Next, we'd like to plot the data in standard units. Recall that, if nums is a Series of numbers, then
(nums - nums.mean()) / np.std(nums)
is a Series of those numbers in standard units.
Question 1.3. Compute the mean and standard deviation of the eruption durations and waiting times. Then create a DataFrame called faithful_su containing the eruption durations and waiting times in standard units. The columns should be named 'duration_su' and 'wait_su', respectively.
Note: You must use np.std to calculate the standard deviation.
Question 1.4. Create a scatter plot of the data again, but this time in standard units.
You'll notice that this plot looks exactly the same as the last one! The data really are different, but the plots look the same because the axes are scaled differently. The method .plot automatically scales the axes so the data fill up the available space. This means it's important to read the ticks on the axes!
Question 1.5. Among the following numbers, which would you guess is closest to the correlation between eruption duration and waiting time in this dataset: -1, 0, or 1? Assign your answer to correlation_guess.
Now, you'll actually compute the correlation between duration and waiting time. To help you do so, we've defined a function standard_units that takes in an array or Series of numbers and returns a copy in which all values are in standard units:
Question 1.6. Complete the implementation of the function correlation, which takes in a DataFrame df and two column names x and y and returns the correlation between the two columns. Then, use your function to find the correlation between duration and waiting time, and assign it to the variable r.
Hints:
Does it matter if we compute the correlation between
'duration_su'and'wait_su'or'duration'and'wait'?CIT 15.1.2. explains how to do this, if you're stuck.
2. The regression line
Recall that the correlation is the slope of the regression line when the data are put in standard units.
The next cell plots the regression line in standard units:
ParseError: KaTeX parse error: Expected 'EOF', got '_' at position 11: \text{wait_̲su} = r \times …Then, it overlays the line on a plot of the original data (in standard units) for comparison. (You don't need to fully understand the code, just run it.)
How would you take a point in standard units and convert it back to original units? We'd have to "stretch" its horizontal position by duration_std and its vertical position by wait_std.
That means the same thing would happen to the slope of the line.
Stretching a line horizontally makes it less steep, so we divide the slope by a horizontal stretching factor. Stretching a line vertically makes it more steep, so we multiply the slope by a vertical stretching factor. (What value describes how spread out the durations are? What value describes how spread out the waiting times are?)
Question 2.1. What is the slope of the regression line in original units?
Hint: If the "stretching" explanation is unintuitive, consult CIT 15.2.5.
We know that the regression line passes through the point . So, now we know the slope of the regression line and a point that it passes through. You might recall the point-slope form of a line from high-school algebra. It says the equation for the line is:
Question 2.2. Rearrange the above equation to find the intercept of the line. Then, assign intercept to be the intercept of the regression line.
Hint: Think of as and as . Try and rearrange the above equation to be in the form . is the intercept.
3. Investigating the regression line
The slope and intercept tell you exactly what the regression line looks like.
To predict the waiting time for an eruption, multiply the eruption's duration by slope and then add intercept.
Question 3.1. Compute the predicted waiting time for an eruption that lasts 2 minutes, and for an eruption that lasts 5 minutes.
The next cell plots the line that goes between those two points, which is (a segment of) the regression line.
Question 3.2. Make predictions for the waiting time after each eruption in the faithful DataFrame. Put these numbers into a column in a new DataFrame called faithful_predictions. Its first row should look like this:
| duration | wait | predicted_wait | |
|---|---|---|---|
| 0 | 3.600 | 79.0 | 72.101106 |
Note that we know exactly what the true waiting times were for every eruption in our dataset! We are doing this so we can see how accurate our predictions are.
Hint: There is no need for a for-loop or even .apply; use Series arithmetic instead.
Question 3.3. How good are our predictions? To answer this question, we can compute the residual for each prediction. Residuals are defined as follows (note that there is no absolute value):
Compute the residual for each eruption in the dataset. Add the residuals to faithful_predictions as a new column called 'residual', naming the resulting DataFrame faithful_residuals.
Hint: Again, there is no need for a for-loop or .apply.
Here is a plot of the residuals you computed. Each point corresponds to one eruption. It shows how much our prediction over- or under-estimated the waiting time.
If a linear fit is good, the residual plot should look like a patternless "blob". This implies that the accuracy of the line's predictions is roughly the same for all durations. (This is an idea you will study further in future courses; see CIT 15.5. for more details).
In the residual plot above, there isn't really a pattern in the residuals, which confirms that it was reasonable to try linear regression. It's true that there are two separate clouds; the eruption durations seemed to fall into two distinct clusters. But that's just a pattern in the eruption durations themselves, not a pattern in the relationship between eruption durations and waiting times.
4. How accurate are different predictions?
Earlier, you should have found that the correlation is fairly close to 1, so a line fits the data fairly well. This means that, overall, the residuals are small (close to 0) in comparison to the waiting times.
We can see that visually by plotting the waiting times and residuals together:
However, even though the regression line fits the data well, you should be wary of applying your prediction model to data that are very different from the data in your sample.
Question 4.1. In faithful, no eruption lasted exactly 0, 2.5, or 60 minutes. Using this line, what is the predicted waiting time for an eruption that lasts 0 minutes? 2.5 minutes? 60 minutes (an hour)?
Question 4.2. Do you believe any of these values are reliable predictions? Why or why not? Assign true_predictions to a list of the correct statements.
The predicted waiting time for a zero minute duration is reliable.
The predicted waiting time for a 2.5 minute duration is reliable.
The predicted waiting time for an hour duration is reliable.
We have data for all of the durations we predicted waiting times for.
We have data surrounding (above and below) all of the durations we predicted waiting times for.
Hint: What does a duration of zero minutes mean?
5. Divide and conquer
It appears from the scatter plot that there are two clusters of points: one for durations around 2 and another for durations between 3.5 and 5. A vertical line at 3 divides the two clusters.
The standardize function below returns a DataFrame with all columns converted to standard units. It uses the standard_units function we defined for you right before Question 1.6. Pay attention to the names of the columns in the DataFrame that standardize returns.
Question 5.1. Assign below_3_r to the correlation coefficient of all points with a duration below 3 and above_3_r to the correlation coefficient of all points with a duration above or equal to 3. To do so:
Create two DataFrames,
below_3andabove_3.below_3should contain all rows infaithfulin which the duration is below 3, andabove_3should contain all rows infaithfulin which the duration is above or equal to 3.Call your
correlationfunction from Question 1.6 on bothbelow_3andabove_3.
Question 5.2. Below, complete the implementation of the functions slope_of and intercept_of. Both functions should take in a DataFrame df that contains a 'duration' column and a 'wait' column (both of which are in original units, not standard units).
When you're done, the functions wait_below_3 and wait_above_3 should each use a different regression line to predict a wait time for a duration. The first function should use the regression line for all points with duration below 3. The second function should use the regression line for all points with duration above or equal to 3.
The plot below shows two different regression lines, one for each cluster!
Question 5.3. Write a function predict_wait that takes a duration and returns the predicted wait time using the appropriate regression line, depending on whether the duration is below 3 or greater than or equal to 3.
The predicted wait times for each point appear below.
Question 5.4. Do you think the predictions produced by predict_wait are more or less accurate than the predictions from the original regression line you created in Question 2? How can you tell? To answer this question, let's create another plot of the residuals, this time from new_faithful, and see if they're any different than before.
Add a column called new_residuals to the new_faithful DataFrame. This column should contain the residuals from the predictions made by predict_wait. Then, create a residual plot to show, for each eruption, how much the new prediction over- or under-estimates the actual wait time.
For comparison's sake, here is the residual plot we created with our old predictions in Question 3.
Question 5.5. Now that we have plotted the residuals, can we say that the new set of predictions are more or less accurate than before? Assign either 1, 2, 3, or 4 to the variable new_predict below.
The new predictions are more accurate than the old predictions because the new residuals have a lower max value than the old residuals, as well as a lower minimum value than the old residuals, so the new predictions are closer to the true values than the old predictions.
The new predictions are more accurate than the old predictions because the new residuals exhibit less spread than the old residuals, so the new predictions are closer to the true values more often than the old predictions.
The new predictions are less accurate than the old predictions because they can't predict what will happen after a three minute duration.
We cannot tell if the new predictions are more accurate than the old predictions because the new and old residuals look similar.
Finish Line 🏁
Congratulations! You've finished the last lab of the quarter!
To submit your assignment:
Select
Kernel -> Restart & Run Allto ensure that you have executed all cells, including the test cells.Read through the notebook to make sure everything is fine and all tests passed.
Run the cell below to run all tests, and make sure that they all pass.
Download your notebook using
File -> Download as -> Notebook (.ipynb), then upload your notebook to Gradescope.