GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2021-summer/materials/worksheet_09/worksheet_09.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 9 - Regression Continued

Lecture and Tutorial Learning Goals:

By the end of the week, students will be able to:

Perform ordinary least squares regression in R using tidymodels to predict the values for a test dataset.
Compare and contrast predictions obtained from k-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.
In R, overlay the ordinary least squares regression lines from geom_smooth on a single plot.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)
library(cowplot)
options(repr.matrix.max.rows = 6)
source("tests_worksheet_09.R")
source('cleanup_worksheet_09.R')

Warm-up Questions

Here are some warm-up questions on the topic of multiple regression to get you thinking before we jump into data analysis. The course readings should help you answer these.

Question 1.0 Multiple Choice:
{points: 1}

In multivariate k-nn regression with one outcome/target variable and two predictor variables, the predictions take the form of what shape?

A. a flat plane

B. a wiggly/flexible plane

C. A straight line

D. a wiggly/flexible line

E. a 4D hyperplane

F. a 4D wiggly/flexible hyperplane

Save the letter of the answer you think is correct to a variable named answer1.0. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.0()

Question 1.1 Multiple Choice:
{points: 1}

In simple linear regression with one outcome/target variable and one predictor variable, the predictions take the form of what shape?

A. a flat plane

B. a wiggly/flexible plane

C. A straight line

D. a wiggly/flexible line

E. a 4D hyperplane

F. a 4D wiggly/flexible hyperplane

Save the letter of the answer you think is correct to a variable named answer1.1. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.1()

Question 1.2 Multiple Choice:
{points: 1}

In multiple linear regression with one outcome/target variable and two predictor variables, the predictions take the form of what shape?

A. a flat plane

B. a wiggly/flexible plane

C. A straight line

D. a wiggly/flexible line

E. a 4D hyperplane

F. a 4D wiggly/flexible hyperplane

Save the letter of the answer you think is correct to a variable named answer1.2. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.2()

Understanding Simple Linear Regression

Consider this small and simple dataset:

In [ ]:

simple_data  <- tibble(X = c(1, 2, 3, 6, 7, 7),
               Y = c(1, 1, 3, 5, 7, 6))
options(repr.plot.width = 5, repr.plot.height = 5)
base <- ggplot(simple_data, aes(x = X, y = Y)) +
    geom_point(size = 2) +
    scale_x_continuous(limits = c(0, 7.5), breaks = seq(0, 8), minor_breaks = seq(0, 8, 0.25)) +
    scale_y_continuous(limits = c(0, 7.5), breaks = seq(0, 8), minor_breaks = seq(0, 8, 0.25)) +
    theme(text = element_text(size = 20))
base

Now consider these three potential lines we could fit for the same dataset:

In [ ]:

options(repr.plot.height = 3.5, repr.plot.width = 10)
line_a <- base +
    ggtitle("Line A") +
    geom_abline(intercept = -0.897, slope = 0.9834, color = "blue") +
    theme(text = element_text(size = 20))
line_b <- base +
    ggtitle("Line B") +
    geom_abline(intercept = 0.1022, slope = 0.9804, color = "purple") +
    theme(text = element_text(size = 20))
line_c <- base +
    ggtitle("Line C") +
    geom_abline(intercept = -0.2347, slope = 0.9164, color = "green") +
    theme(text = element_text(size = 20))
plot_grid(line_a, line_b, line_c, ncol = 3)

Question 2.0
{points: 1}

Use the graph below titled "Line A" to roughly calculate the average squared vertical distance between the points and the blue line. Read values of the graph to a precision of 0.25 (e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named answer2.0.

We reprint the plot for you in a larger size to make it easier to estimate the locations on the graph.

In [ ]:

#run this code
options(repr.plot.width = 9, repr.plot.height = 9)
line_a

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.0

In [ ]:

test_2.0()

Question 2.1
{points: 1}

Use the graph titled "Line B" to roughly calculate the average squared vertical distance between the points and the purple line. Read values of the graph to a precision of 0.25 (e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named answer2.1.

We reprint the plot for you in a larger size to make it easier to estimate the locations on the graph.

In [ ]:

options(repr.plot.width = 9, repr.plot.height = 9)
line_b

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.1

In [ ]:

test_2.1()

Question 2.2
{points: 1}

Use the graph titled "Line C" to roughly calculate the average squared vertical distance between the points and the green line. Read values of the graph to a precision of 0.25 (e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named answer2.2.

We reprint the plot for you in a larger size to make it easier to estimate the locations on the graph.

In [ ]:

options(repr.plot.width = 9, repr.plot.height = 9)
line_c

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.2

In [ ]:

test_2.2()

Question 2.3
{points: 1}

Based on your calculations above, which line would linear regression by ordinary least squares choose given our small and simple dataset? Line A, B or C? Assign the letter that corresponds the line to a variable named answer2.3. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.3()

Marathon Training Revisited with Linear Regression!

Source: https://media.giphy.com/media/BDagLpxFIm3SM/giphy.gif

Remember our question from last week: what features predict whether athletes will perform better than others? Specifically, we are interested in marathon runners, and looking at how the maximum distance ran per week during training predicts the time it takes a runner to end the race?

This time around, however, we will analyze the data using simple linear regression rather than $k$ -nn regression. In the end, we will compare our results to what we found last week with $k$ -nn regression.

Question 3.0
{points: 1}

Load the marathon data and assign it to an object called marathon.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
marathon

In [ ]:

test_3.0()

Question 3.1
{points: 1}

Similar to what we have done for the last few weeks, we will first split the dataset into the training and testing datasets, using 75% of the original data as the training data. Remember, we will be putting the test dataset away in a 'lock box' that we will comeback to later after we choose our final model. In the strata argument of the initial_split function, place the variable we are trying to predict. Assign your split dataset to an object named marathon_split.

Assign your training dataset to an object named marathon_training and your testing dataset to an object named marathon_testing.

In [ ]:

set.seed(2000) # DO NOT CHANGE THIS
# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_3.1()

Question 3.2
{points: 1}

Using only the observations in the training dataset, create a scatterplot to assess the relationship between race time (time_hrs) and maximum distance ran per week during training (max). Put time_hrs on the y-axis and max on the x-axis. Assign this plot to an object called marathon_eda. Remember to do whatever is necessary to make this an effective visualization.

In [ ]:

options(repr.plot.height = 8, repr.plot.width = 7)

# your code here
fail() # No Answer - remove if you provide an answer
marathon_eda

In [ ]:

test_3.2()

Question 3.3
{points: 1}

Now that we have our training data, the next step is to build a linear regression model specification. Thankfully, building other model specifications is quite straightforward since we will still go through the same procedure (indicate the function, the engine and the mode).

Instead of using the nearest_neighbor function, we will be using the linear_reg function to let tidymodels know we want to perform a linear regression. In the set_engine function, we have typically set "kknn" there for $k$ -nn. Since we are doing a linear regression here, set "lm" as the engine. Finally, instead of setting "classification" as the mode, set "regression" as the mode.

Assign your answer to an object named lm_spec.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_3.3()

Question 3.3.1
{points: 1}

After we have created our linear regression model specification, the next step is to create a recipe, establish a workflow analysis and fit our simple linear regression model.

First, create a recipe with the variables of interest (race time and max weekly training distance) using the training dataset and assign your answer to an object named lm_recipe.

Then, create a workflow analysis with our model specification and recipe. Remember to fit in the training dataset as well. Assign your answer to an object named lm_fit.

In [ ]:

#... <- recipe(... ~ ..., data = ...)

#... <- workflow() %>%
#       add_recipe(...) %>%
#       add_model(...) %>%
#       fit(...)

# your code here
fail() # No Answer - remove if you provide an answer
lm_fit

In [ ]:

test_3.3.1()

Question 3.4
{points: 1}

Now, let's visualize the model predictions as a straight line overlaid on the training data. Use geom_smooth with method = "lm" and se = FALSE to visualize the predictions as a straight line. Name your plot lm_predictions.

In [ ]:

options(repr.plot.width = 8, repr.plot.height = 7)

# your code here
fail() # No Answer - remove if you provide an answer
lm_predictions

In [ ]:

test_3.4()

Question 3.5
{points: 1}

Great! We can now see the line of best fit on the graph. Now let's calculate the $RMSPE$ using the test data. To get to this point, first, use the lm_fit to make predictions on the test data. Remember to bind the appropriate columns for the test data. Afterwards, collect the metrics and store it in an object called lm_test_results.

From lm_test_results, extract the $RMPSE$ and return a single numerical value. Assign your answer to an object named lm_rmspe.

In [ ]:

#... <- lm_fit %>%
#         predict(...) %>%
#         bind_cols(...) %>%
#         metrics(truth = ..., estimate = ..)


#... <- lm_test_results %>%
#          filter(...) %>%
#          select(...) %>%
#          ...

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmspe

In [ ]:

test_3.5()

Question 3.5.1
{points: 1}

Now, let's visualize the model predictions as a straight line overlaid on the test data. Use geom_smooth with method = "lm" and se = FALSE to visualize the predictions as a straight line. Name your plot lm_predictions_test. Remember to do whatever is necessary to make this an effective visualization.

In [ ]:

options(repr.plot.width = 8, repr.plot.height = 7)


# your code here
fail() # No Answer - remove if you provide an answer
lm_predictions_test

In [ ]:

test_3.5.1()

Question 3.6
{points: 1}

Compare the test RMPSE of k-nn regression (0.606 from last worksheet) to that of simple linear regression, which is greater?

A. Simple linear regression has a greater RMSPE

B. $k$ -nn regression has a greater RMSPE

C. Neither, they are identical

Save the letter of the answer you think is correct to a variable named answer3.6. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_3.6()

Question 3.7 Multiple Choice:
{points: 1}

Which model does a better job of predicting on the test dataset?

A. Simple linear regression

B. $k$ -nn regression

C. Neither, they are identical

Save the letter of the answer you think is correct to a variable named answer3.7. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_3.7()

Given that the linear regression model is a straight line, we can write our model as a mathematical equation. We can get the two numbers we need for this from the coefficients, (Intercept) and time_hrs.

In [ ]:

# run this cell
lm_fit

Question 3.8.1 Multiple Choice:
{points: 1}

Which of the following mathematical equations represents the model based on the numbers output in the cell above?

A. $Predicted \ race \ time \ (in \ hours) = 4.86 - 0.02 * max \ (in \ miles)$

B. $Predicted \ race \ time \ (in \ hours) = -0.02 + 4.86 * max \ (in \ miles)$

C. $Predicted \ max \ (in \ miles) = 4.86 - 0.02 * \ race \ time \ (in \ hours)$

D. $Predicted \ max \ (in \ miles) = -0.02 + 4.86 * \ race \ time \ (in \ hours)$

Save the letter of the answer you think is correct to a variable named answer3.8.1. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_3.8.1()

In [ ]:

source('cleanup_worksheet_09.R')

Worksheet 9 - Regression Continued

Lecture and Tutorial Learning Goals:

Warm-up Questions

Understanding Simple Linear Regression

Marathon Training Revisited with Linear Regression!

Product

Resources

Company