GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2020-spring/materials/worksheet_08/worksheet_08.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 8 - Regression

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Recognize situations where a simple regression analysis would be appropriate for making predictions.
Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.
Interpret the output of a k-nn regression.
In a dataset with two variables, perform k-nearest neighbour regression in R using caret::knnregTrain() to predict the values for a test dataset.
Using R, execute cross-validation in R to choose the number of neighbours.
Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root mean square prediction error, RMSPE).
In the context of k-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
Describe advantages and disadvantages of the k-nearest neighbour regression approach.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(caret)
source("tests_worksheet_08.R")
source('cleanup_worksheet_08.R')

Question 0.0 Multiple Choice:
{points: 1}

To predict a value for $Y$ for a new observation using k-nn regression, we identify the $k$ -nearest neighbours and then:

A. Assign it the median of the of the $k$ -nearest neighbours as the predicted value

B. Assign it the mean of the of the $k$ -nearest neighbours as the predicted value

C. Assign it the mode of the of the $k$ -nearest neighbours as the predicted value

D. Assign it the majority vote of the of the $k$ -nearest neighbours as the predicted value

Save the letter of the answer you think is correct to a variable named answer0.0. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_0.0()

Question 0.1 Multiple Choice:
{points: 1}

Of those shown below, which is the correct formula for RMSPE?

A. $RMSPE = \sqrt{\frac{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}{1 - n}}$

B. $RMSPE = \sqrt{\frac{1}{n - 1}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}$

C. $RMSPE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}$

D. $RMSPE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})}$

Save the letter of the answer you think is correct to a variable named answer0.1. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_0.1()

Question 0.2
{points: 1}

The plot below is a very simple k-nn regression example, where the black dots are the data observations and the blue line is the predictions from a k-nn regression model created from this data where $k=2$ .

Using the formula for RMSE/RMSPE (given in the reading), and the graph below, by hand (pen and paper or use R as a calculator) calculate RMSE for this model. Use one decimal place of precision when inputting the heights of the black dots and blue line. Save your answer to a variable named answer0.2

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer0.2

In [ ]:

test_0.2()

RMSPE Definition

Question 0.3 Multiple Choice:
{points: 1}

What does RMSPE stand for?

A. root mean squared prediction error

B. root mean squared percentage error

C. root mean squared performance error

D. root mean squared preference error

Save the letter of the answer you think is correct to a variable named answer0.3. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_0.3()

Marathon Training

Source: https://media.giphy.com/media/nUN6InE2CodRm/giphy.gif

What predicts which athletes will perform better than others? Specifically, we are interested in marathon runners, and looking at how the maximum distance ran per week during training predicts the time it takes a runner to end the race? For this, we will be looking at the marathon.csv file in the data/ folder.

Question 1.0
{points: 1}

Load the data and assign it to an object called marathon.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(marathon)

In [ ]:

test_1.0()

Question 2.0
{points: 1}

We want to predict race time (time_hrs) given a particular value of maximum distance ran per week during training (max). Let's take a subset of size 50 of our marathon data and assign it to an object called marathon_50. With this subset, plot a scatterplot to assess the relationship between these two variables. Put time_hrs on the y-axis and max on the x-axis. Assign this plot to an object called answer2. Discuss with your neighbour the relationship between race time and maximum distance ran per week during training based on the scatterplot you create below.

Hint: To take a subset of your data you can use the sample_n() function

In [ ]:

set.seed(2000) ### DO NOT CHANGE

#marathon_50 <- ... %>%
#    sample_n(...)

# your code here
fail() # No Answer - remove if you provide an answer
answer2

In [ ]:

test_2.0()

Question 3.0
{points: 1}

Suppose we want to predict the race time for someone who ran a maximum distance of 100 miles per week during training. In the plot below we can see that no one has ran a maximum distance of 100 miles per week. But, if we are interested in prediction, how can we predict with this data? We can use k-nn regression! To do this we get the $Y$ values (target/response variable) of the nearest $k$ values and then take their average and use that as the prediction.

For this question we want to predict race time based on the 4 closest neighbors to the 100 miles per week during training.

Fill in the scaffolding below and assign your answer to an object named answer3.

In [ ]:

options(repr.plot.height = 3, repr.plot.width = 3)
marathon_50 %>%
    ggplot(aes(x = max, y = time_hrs)) + 
        geom_point(color = 'dodgerblue', alpha = 0.4) +
        geom_vline(xintercept = 100, linetype = "dotted") +
        xlab("Maximum Distance Ran per \n Week During Training (mi)") +
        ylab("Race Time (hours)") + 
        geom_segment(aes(x = 100, y = 2.56, xend = 107, yend = 2.56), col = "orange") +
        geom_segment(aes(x = 100, y = 2.65, xend = 90, yend = 2.65), col = "orange") +
        geom_segment(aes(x = 100, y = 2.99, xend = 86, yend = 2.99), col = "orange") +
        geom_segment(aes(x = 100, y = 3.05, xend = 82, yend = 3.05), col = "orange")

In [ ]:

#answer3 <- ... %>% 
#  mutate(diff = abs(100 - ...)) %>% 
#  ...(diff) %>% 
#  head(...) %>%  #Controls the K
#  summarise(predicted = ...(...)) %>%
#  unlist()

# your code here
fail() # No Answer - remove if you provide an answer
answer3

In [ ]:

test_3.0()

Question 4.0
{points: 1}

For this question, let's instead predict the race time based on the 2 closest neighbors to the 100 miles per week during training.

Assign your answer to an object named answer4.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer4

In [ ]:

test_4.0()

Question 5.0 Multiple Choice:
{points: 1}

Now that you have done k Nearest Neighbors predictions manually, which method would you use to choose the $k$ ?

A) Choose the $k$ that excludes most outliers
B) Choose the $k$ with the lowest training error
C) Choose the $k$ with the lowest cross-validation error
D) Choose the $k$ that includes the most data points
E) Choose the $k$ with the lowest testing error

Assing your answer to an object called answer5

In [ ]:

# Assign your answer to an object called: answer5
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_5.0()

Question 6.0
{points: 1}

We have seen how to do k-nn regression manually, now we will apply it to the whole dataset using the caret package. For this we first need to create a training and testing sets. Remember we won't touch the test dataset until the end.

For this question create an object called training_rows that includes the indexes of the rows we will use.

Use 75% of the data as training data

In [ ]:

set.seed(2000) ### DO NOT CHANGE

#... <- marathon %>% 
#  select(max) %>% 
#  unlist() %>%
#  createDataPartition(p = ..., list = FALSE)

# your code here
fail() # No Answer - remove if you provide an answer
head(training_rows)

In [ ]:

test_6.0()

Question 7.0
{points: 1}

Create the training and testing datasets by filling the scaffoldings below. The scaffolding for the training dataset is given below.

Assing your answer to objects called X_train, Y_train, X_test, Y_test respectively.

Hint: For the test dataset you can use the - sign inside the slice() function.

In [ ]:

#X_train <- marathon %>% 
#  select(...) %>% 
#  slice(training_rows) %>% 
#  data.frame()

#Y_train <- marathon %>% 
#  select(...) %>% 
#  slice(training_rows) %>% 
#  unlist()

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_7.0()

Question 8.0
{points: 1}

Now that we have separated the data into training and testing sets, let's choose the $k$ for our $k$ -nearest neighbours algorithm. We can do this using cross-validation as we've seen before for k-nn classification. In this exercise we will do 3-fold cross validation searching for a K from 1 to 150 in steps of size 10. For this question name your model object (output from train) knn_cv.

In [ ]:

set.seed(2019) # DO NOT CHANGE
# your code here
fail() # No Answer - remove if you provide an answer
knn_cv

In [ ]:

test_8.0()

Question 8.1
{points: 1}

Plot the results from cross-validation as a line and point plot with cross-validation error (as $RMSPE$ ) on the y-axis and $k$ on the x-axis. Name your plot object choosing_k.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
choosing_k

In [ ]:

test_8.1()

Question 8.2
{points: 1}

Report the best $k$ for k-nn regression for this data set. Save your answer as an object named best_k. We provide scaffolding to help you choose the $k$ from the long list that you came up with:

In [ ]:

#best_k <- knn_cv$results %>%
#    filter(... == min(...)) %>%
#    select(..) %>%
#    unlist()

# your code here
fail() # No Answer - remove if you provide an answer
best_k

In [ ]:

test_8.2()

Question 9.0
{points: 1}

Re-train your k-nn regression model with the best $k$ that you found in Question 8 using the entire training data set. Assign the model to an object called knn_model.

In [ ]:

set.seed(2019) # DO NOT CHANGE

# your code here
fail() # No Answer - remove if you provide an answer
knn_model

In [ ]:

test_9.0()

Question 10.0
{points: 1}

Using the knn_model, predict the test data and save it to an object called predictions.

In [ ]:

set.seed(2019) # DO NOT CHANGE

# your code here
fail() # No Answer - remove if you provide an answer
head(predictions)

In [ ]:

test_10.0()

Question 11.0
{points: 1}

Now with these predictions calculate the test error as $RMSPE$ (how well the predictions on the test data match the true values of the test data set). Use the defaultSummary function to obtain the test error as $RMSPE$ , and name the object returned from it test_error.

In [ ]:

set.seed(2019) # DO NOT CHANGE
# your code here
fail() # No Answer - remove if you provide an answer
test_error

In [ ]:

test_11.0()

Question 11.1

The test error is larger than the cross-validation error for the best $k$ , true or false? Save your answer as "true" or "false" and name it answer11.1

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_11.1()

Question 11.2
{points: 1}

Given that $RMSPE$ is in the units of the target/response variable, the test error $RMSPE$ seems very large (and thus indicates that our predictions are likely not very good). True or false? Save your answer as "true" or "false" and name it answer11.2

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_11.2()

Question 12.0
{points: 1}

Using the knn_model trained on the entire training set (from Question 9.0), predict across the range of values observed in the training data set. Store the predictions as a column named time_hrs in a data frame named full_predictions. That data frame should also have a column named max that contains the values you predicted across.

Use the min and max functions to find the upper and lower limits of predictor/explanatory variable values in the training data set.
Use the seq function to create the column called max that contains the values you would like to predict across.

In [ ]:

set.seed(2019) # DO NOT CHANGE
#upper <- X_train %>% 
#    select(max) %>% 
#    max() 
#lower <- ... %>% 
#    ... %>% 
#    ...
#... <- data.frame(max = seq(from = ..., to = ..., by = 1))
#full_predictions <- ... %>% 
#    mutate(... = predict(..., ...))

# your code here
fail() # No Answer - remove if you provide an answer
head(full_predictions)
round(sum(full_predictions$time_hrs))

In [ ]:

test_12.0()

Question 13.0
{points: 1}

Plot these predictions as a blue line over the data points from the training set. You will have to create a single data frame containing the training data set to do this. One way you can do this is by combining X_train and Y_train using the bind_cols function. Name your plot predict_plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
predict_plot

In [ ]:

test_13.0()

Worksheet 8 - Regression

Lecture and Tutorial Learning Goals:

RMSPE Definition

Marathon Training

Product

Resources

Company