Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2020-spring/materials/worksheet_08/worksheet_08.ipynb
2051 views
Kernel: R

Worksheet 8 - Regression

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

  • Recognize situations where a simple regression analysis would be appropriate for making predictions.

  • Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.

  • Interpret the output of a k-nn regression.

  • In a dataset with two variables, perform k-nearest neighbour regression in R using caret::knnregTrain() to predict the values for a test dataset.

  • Using R, execute cross-validation in R to choose the number of neighbours.

  • Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root mean square prediction error, RMSPE).

  • In the context of k-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).

  • Describe advantages and disadvantages of the k-nearest neighbour regression approach.

### Run this cell before continuing. library(tidyverse) library(testthat) library(digest) library(repr) library(caret) source("tests_worksheet_08.R") source('cleanup_worksheet_08.R')

Question 0.0 Multiple Choice:
{points: 1}

To predict a value for YY for a new observation using k-nn regression, we identify the kk-nearest neighbours and then:

A. Assign it the median of the of the kk-nearest neighbours as the predicted value

B. Assign it the mean of the of the kk-nearest neighbours as the predicted value

C. Assign it the mode of the of the kk-nearest neighbours as the predicted value

D. Assign it the majority vote of the of the kk-nearest neighbours as the predicted value

Save the letter of the answer you think is correct to a variable named answer0.0. Make sure you put quotations around the letter and pay attention to case.

# your code here fail() # No Answer - remove if you provide an answer
test_0.0()

Question 0.1 Multiple Choice:
{points: 1}

Of those shown below, which is the correct formula for RMSPE?

A. RMSPE=1n∑i=1n(yi−yi^)21−nRMSPE = \sqrt{\frac{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}{1 - n}}

B. RMSPE=1n−1∑i=1n(yi−yi^)2RMSPE = \sqrt{\frac{1}{n - 1}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}

C. RMSPE=1n∑i=1n(yi−yi^)2RMSPE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}

D. RMSPE=1n∑i=1n(yi−yi^)RMSPE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})}

Save the letter of the answer you think is correct to a variable named answer0.1. Make sure you put quotations around the letter and pay attention to case.

# your code here fail() # No Answer - remove if you provide an answer
test_0.1()

Question 0.2
{points: 1}

The plot below is a very simple k-nn regression example, where the black dots are the data observations and the blue line is the predictions from a k-nn regression model created from this data where k=2k=2.

Using the formula for RMSE/RMSPE (given in the reading), and the graph below, by hand (pen and paper or use R as a calculator) calculate RMSE for this model. Use one decimal place of precision when inputting the heights of the black dots and blue line. Save your answer to a variable named answer0.2

# your code here fail() # No Answer - remove if you provide an answer answer0.2
test_0.2()

RMSPE Definition

Question 0.3 Multiple Choice:
{points: 1}

What does RMSPE stand for?

A. root mean squared prediction error

B. root mean squared percentage error

C. root mean squared performance error

D. root mean squared preference error

Save the letter of the answer you think is correct to a variable named answer0.3. Make sure you put quotations around the letter and pay attention to case.

# your code here fail() # No Answer - remove if you provide an answer
test_0.3()

Marathon Training

Source: https://media.giphy.com/media/nUN6InE2CodRm/giphy.gif

What predicts which athletes will perform better than others? Specifically, we are interested in marathon runners, and looking at how the maximum distance ran per week during training predicts the time it takes a runner to end the race? For this, we will be looking at the marathon.csv file in the data/ folder.

Question 1.0
{points: 1}

Load the data and assign it to an object called marathon.

# your code here fail() # No Answer - remove if you provide an answer head(marathon)
test_1.0()

Question 2.0
{points: 1}

We want to predict race time (time_hrs) given a particular value of maximum distance ran per week during training (max). Let's take a subset of size 50 of our marathon data and assign it to an object called marathon_50. With this subset, plot a scatterplot to assess the relationship between these two variables. Put time_hrs on the y-axis and max on the x-axis. Assign this plot to an object called answer2. Discuss with your neighbour the relationship between race time and maximum distance ran per week during training based on the scatterplot you create below.

Hint: To take a subset of your data you can use the sample_n() function

set.seed(2000) ### DO NOT CHANGE #marathon_50 <- ... %>% # sample_n(...) # your code here fail() # No Answer - remove if you provide an answer answer2
test_2.0()

Question 3.0
{points: 1}

Suppose we want to predict the race time for someone who ran a maximum distance of 100 miles per week during training. In the plot below we can see that no one has ran a maximum distance of 100 miles per week. But, if we are interested in prediction, how can we predict with this data? We can use k-nn regression! To do this we get the YY values (target/response variable) of the nearest kk values and then take their average and use that as the prediction.

For this question we want to predict race time based on the 4 closest neighbors to the 100 miles per week during training.

Fill in the scaffolding below and assign your answer to an object named answer3.

options(repr.plot.height = 3, repr.plot.width = 3) marathon_50 %>% ggplot(aes(x = max, y = time_hrs)) + geom_point(color = 'dodgerblue', alpha = 0.4) + geom_vline(xintercept = 100, linetype = "dotted") + xlab("Maximum Distance Ran per \n Week During Training (mi)") + ylab("Race Time (hours)") + geom_segment(aes(x = 100, y = 2.56, xend = 107, yend = 2.56), col = "orange") + geom_segment(aes(x = 100, y = 2.65, xend = 90, yend = 2.65), col = "orange") + geom_segment(aes(x = 100, y = 2.99, xend = 86, yend = 2.99), col = "orange") + geom_segment(aes(x = 100, y = 3.05, xend = 82, yend = 3.05), col = "orange")
#answer3 <- ... %>% # mutate(diff = abs(100 - ...)) %>% # ...(diff) %>% # head(...) %>% #Controls the K # summarise(predicted = ...(...)) %>% # unlist() # your code here fail() # No Answer - remove if you provide an answer answer3
test_3.0()

Question 4.0
{points: 1}

For this question, let's instead predict the race time based on the 2 closest neighbors to the 100 miles per week during training.

Assign your answer to an object named answer4.

# your code here fail() # No Answer - remove if you provide an answer answer4
test_4.0()

Question 5.0 Multiple Choice:
{points: 1}

Now that you have done k Nearest Neighbors predictions manually, which method would you use to choose the kk?

  • A) Choose the kk that excludes most outliers

  • B) Choose the kk with the lowest training error

  • C) Choose the kk with the lowest cross-validation error

  • D) Choose the kk that includes the most data points

  • E) Choose the kk with the lowest testing error

Assing your answer to an object called answer5

# Assign your answer to an object called: answer5 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_5.0()

Question 6.0
{points: 1}

We have seen how to do k-nn regression manually, now we will apply it to the whole dataset using the caret package. For this we first need to create a training and testing sets. Remember we won't touch the test dataset until the end.

For this question create an object called training_rows that includes the indexes of the rows we will use.

Use 75% of the data as training data

set.seed(2000) ### DO NOT CHANGE #... <- marathon %>% # select(max) %>% # unlist() %>% # createDataPartition(p = ..., list = FALSE) # your code here fail() # No Answer - remove if you provide an answer head(training_rows)
test_6.0()

Question 7.0
{points: 1}

Create the training and testing datasets by filling the scaffoldings below. The scaffolding for the training dataset is given below.

Assing your answer to objects called X_train, Y_train, X_test, Y_test respectively.

Hint: For the test dataset you can use the - sign inside the slice() function.

#X_train <- marathon %>% # select(...) %>% # slice(training_rows) %>% # data.frame() #Y_train <- marathon %>% # select(...) %>% # slice(training_rows) %>% # unlist() # your code here fail() # No Answer - remove if you provide an answer
test_7.0()

Question 8.0
{points: 1}

Now that we have separated the data into training and testing sets, let's choose the kk for our kk-nearest neighbours algorithm. We can do this using cross-validation as we've seen before for k-nn classification. In this exercise we will do 3-fold cross validation searching for a K from 1 to 150 in steps of size 10. For this question name your model object (output from train) knn_cv.

set.seed(2019) # DO NOT CHANGE # your code here fail() # No Answer - remove if you provide an answer knn_cv
test_8.0()

Question 8.1
{points: 1}

Plot the results from cross-validation as a line and point plot with cross-validation error (as RMSPERMSPE) on the y-axis and kk on the x-axis. Name your plot object choosing_k.

# your code here fail() # No Answer - remove if you provide an answer choosing_k
test_8.1()

Question 8.2
{points: 1}

Report the best kk for k-nn regression for this data set. Save your answer as an object named best_k. We provide scaffolding to help you choose the kk from the long list that you came up with:

#best_k <- knn_cv$results %>% # filter(... == min(...)) %>% # select(..) %>% # unlist() # your code here fail() # No Answer - remove if you provide an answer best_k
test_8.2()

Question 9.0
{points: 1}

Re-train your k-nn regression model with the best kk that you found in Question 8 using the entire training data set. Assign the model to an object called knn_model.

set.seed(2019) # DO NOT CHANGE # your code here fail() # No Answer - remove if you provide an answer knn_model
test_9.0()

Question 10.0
{points: 1}

Using the knn_model, predict the test data and save it to an object called predictions.

set.seed(2019) # DO NOT CHANGE # your code here fail() # No Answer - remove if you provide an answer head(predictions)
test_10.0()

Question 11.0
{points: 1}

Now with these predictions calculate the test error as RMSPERMSPE (how well the predictions on the test data match the true values of the test data set). Use the defaultSummary function to obtain the test error as RMSPERMSPE, and name the object returned from it test_error.

set.seed(2019) # DO NOT CHANGE # your code here fail() # No Answer - remove if you provide an answer test_error
test_11.0()

Question 11.1

The test error is larger than the cross-validation error for the best kk, true or false? Save your answer as "true" or "false" and name it answer11.1

# your code here fail() # No Answer - remove if you provide an answer
test_11.1()

Question 11.2
{points: 1}

Given that RMSPERMSPE is in the units of the target/response variable, the test error RMSPERMSPE seems very large (and thus indicates that our predictions are likely not very good). True or false? Save your answer as "true" or "false" and name it answer11.2

# your code here fail() # No Answer - remove if you provide an answer
test_11.2()

Question 12.0
{points: 1}

Using the knn_model trained on the entire training set (from Question 9.0), predict across the range of values observed in the training data set. Store the predictions as a column named time_hrs in a data frame named full_predictions. That data frame should also have a column named max that contains the values you predicted across.

  • Use the min and max functions to find the upper and lower limits of predictor/explanatory variable values in the training data set.

  • Use the seq function to create the column called max that contains the values you would like to predict across.

set.seed(2019) # DO NOT CHANGE #upper <- X_train %>% # select(max) %>% # max() #lower <- ... %>% # ... %>% # ... #... <- data.frame(max = seq(from = ..., to = ..., by = 1)) #full_predictions <- ... %>% # mutate(... = predict(..., ...)) # your code here fail() # No Answer - remove if you provide an answer head(full_predictions) round(sum(full_predictions$time_hrs))
test_12.0()

Question 13.0
{points: 1}

Plot these predictions as a blue line over the data points from the training set. You will have to create a single data frame containing the training data set to do this. One way you can do this is by combining X_train and Y_train using the bind_cols function. Name your plot predict_plot.

# your code here fail() # No Answer - remove if you provide an answer predict_plot
test_13.0()