Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_08/worksheet_08.ipynb
2051 views
Kernel: R

Worksheet 8 - Regression

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

  • Recognize situations where a simple regression analysis would be appropriate for making predictions.

  • Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.

  • Interpret the output of a k-nn regression.

  • In a dataset with two variables, perform k-nearest neighbour regression in R using caret::knnregTrain() to predict the values for a test dataset.

  • Using R, execute cross-validation in R to choose the number of neighbours.

  • Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root mean square prediction error, RMSERMSE).

  • Describe advantages and disadvantages of the k-nearest neighbour regression approach.

### Run this cell before continuing. library(tidyverse) library(testthat) library(digest) library(repr) library(caret)

Question 0.0

To predict a value for YY for a new observation using k-nn regression, we identify the kk-nearest neighbours and then:

A. Assign it the median of the of the kk-nearest neighbours as the predicted value

B. Assign it the mean of the of the kk-nearest neighbours as the predicted value

C. Assign it the mode of the of the kk-nearest neighbours as the predicted value

D. Assign it the majority vote of the of the kk-nearest neighbours as the predicted value

Save the letter of the answer you think is correct to a variable named answer0.0. Make sure you put quotations around the letter and pay attention to case.

# your code here fail() # No Answer - remove if you provide an answer answer0.0
test_that('Solution is correct', { expect_equal(digest(answer0.0), '3a5505c06543876fe45598b5e5e5195d') }) print("Success!")

Question 0.1

The plot below is a very simple k-nn regression example, where the black dots are the data observations and the blue line is the predictions from a k-nn regression model created from this data where kk=2.

Using the formula for RMSERMSE (given in the reading), and the graph below, by hand (pen and paper or use R as a calculator) calculate RMSERMSE for this model. Estimate the values off the graph to one decimal place. Save your answer to a variable named answer0.1

# your code here fail() # No Answer - remove if you provide an answer answer0.1
test_that('Solution is correct', { expect_true(digest(round(answer0.1, 2)) %in% c('651ba44efc6a75d694ff482aae958ccc', '2a1ea47875e195a421d56ae3f6621d32')) }) print("Success!")

Marathon training

Source: https://media.giphy.com/media/nUN6InE2CodRm/giphy.gif

What predicts which athletes will perform better than others? Specifically, we are interested in marathon runners, and looking at how the maximum distance ran per week during training predicts the time it takes a runner to end the race? For this, we will be looking at the marathon.csv file in the data/ folder.

Question 1.0

Load the data and assign it to an object called marathon.

# your code here fail() # No Answer - remove if you provide an answer head(marathon)
test_that('Solution is correct', { expect_equal(nrow(marathon), 929) expect_equal(ncol(marathon), 13) expect_that("time_hrs" %in% colnames(marathon), is_true()) expect_that("max" %in% colnames(marathon), is_true()) }) print("Success!")

Question 2.0

Given that we want to predict race time (time_hrs) given a particular value of maximum distance ran per week during training (max). Let's take a subset of size 50 of our marathon data and assign it to an object called marathon_50. With this subset, plot a scatterplot to assess the relationship between these two variables. Put time_hrs on the y-axis and max on the x-axis. Assign this plot to an object called answer2. Discuss with your neighbour the relationship between race time and maximum distance ran per week during training based on the scatterplot you create below.

Hint: To take a subset of your data you can use the sample_n() function

set.seed(2000) ### DO NOT CHANGE #marathon_50 <- ... %>% # sample_n(...) # your code here fail() # No Answer - remove if you provide an answer answer2
test_that('Solution is incorrect', { expect_true(exists('marathon_50')) expect_equal(digest(as.character(rlang::get_expr(answer2$mapping$x))) , '60f9c54cbd347e2956e968462f44c536') expect_equal(digest(as.character(rlang::get_expr(answer2$mapping$y))) , 'b55efe8bd9491c88b50fd4d402fbde92') expect_equal(digest(class(rlang::get_expr(answer2$layers[[1]]$geom))[1]), '911e5b9debfb523f25ad2ccc01a4b2dd') }) print("Success!")

Question 3.0

Suppose we want to predict the race time for someone who ran a maximum distance of 100 miles per week during training. In the plot below we can see that no one has ran a maximum distance of 100 miles per week. But, if we are interested in prediction, how can we predict with this data? We can use k-nn regression, to do this we get the YY values (target/response variable) of the nearest kk values and then take their average and use that as the prediction.

For this question we want to predict race time based on the 4 closest neighbors to the 100 miles per week during training.

Fill in the scaffolding below and assign your answer to an object named answer3.

options(repr.plot.height = 3, repr.plot.width = 3) marathon_50 %>% ggplot(aes(x = max, y = time_hrs)) + geom_point(color = 'dodgerblue', alpha = 0.4) + geom_vline(xintercept = 100, linetype = "dotted") + xlab("Maximum Distance Ran per \n Week During Training (mi)") + ylab("Race Time (hours)") + geom_segment(aes(x = 100, y = 2.57, xend = 80, yend = 2.57), col = "orange") + geom_segment(aes(x = 100, y = 2.87, xend = 75, yend = 2.87), col = "orange") + geom_segment(aes(x = 100, y = 2.61, xend = 110, yend = 2.61), col = "orange") + geom_segment(aes(x = 100, y = 2.93, xend = 105, yend = 2.93), col = "orange")
#answer3 <- ... %>% # mutate(diff = abs(100 - ...)) %>% # ...(diff) %>% # head(...) %>% #Controls the K # summarise(predicted = ...(...)) %>% # unlist() # your code here fail() # No Answer - remove if you provide an answer answer3
test_that('Solution is correct', { expect_true(exists('answer3')) expect_equal(digest(as.numeric(answer3)), '485826410541b8deebb40b4fb731ca15') }) print("Success!")

Question 4.0

For this question, let's instead predict the race time based on the 2 closest neighbors to the 100 miles per week during training.

Assign your answer to an object named answer4.

# your code here fail() # No Answer - remove if you provide an answer answer4
test_that('Solution is correct', { expect_true(exists('answer4')) expect_equal(digest(as.numeric(answer4)), '0d7b5b246b9c97983f281a717db02df2') }) print("Success!")

Question 5.0 Multiple Choice

Now that you have done K Nearest Neighbors predictions manually, which method would you use to choose the kk?

  • A) Choose the kk that excludes most outliers

  • B) Choose the kk with the lowest training error

  • C) Choose the kk with the lowest cross-validation error

  • D) Choose the kk that includes the most data points

  • D) Choose the kk with the lowest testing error

Assing your answer to an object called answer5

# Assign your answer to an object called: answer5 # Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is correct', { expect_that(exists('answer5'), is_true()) expect_equal(digest(answer5), '475bf9280aab63a82af60791302736f6') }) print("Success!")

Question 6.0

We have seen how to do k-nn regression manually, now we will apply it to the whole dataset using the caret package. For this we first need to create a training and testing sets. Remember we won't touch the test dataset until the end.

For this question create an object called training_rows that includes the indexes of the rows we will use.

Use 75% of the data as training data

set.seed(2000) ### DO NOT CHANGE #... <- marathon %>% # select(...) %>% # unlist() %>% # createDataPartition(p = ..., list = FALSE) # your code here fail() # No Answer - remove if you provide an answer head(training_rows)
test_that('Solution is correct', { expect_equal(nrow(training_rows), 698) expect_equal(ncol(training_rows), 1) }) print("Success!")

Question 7.0

Create the training and testing dataset filling the scaffoldings below. The scaffolding for the training dataset is given below.

Assing your answer to objects called X_train, Y_train, X_test, Y_test respectively.

Hint: For the test dataset you can use the - sign inside the slice() function.

#X_train <- marathon %>% # select(...) %>% # slice(training_rows) %>% # data.frame() #Y_train <- marathon %>% # select(...) %>% # slice(training_rows) %>% # unlist() # your code here fail() # No Answer - remove if you provide an answer
test_that('Solution is correct', { expect_equal(dim(X_train), c(698, 1)) expect_equal(class(X_train), 'data.frame') expect_equal(dim(X_test), c(231, 1)) expect_equal(class(X_test), 'data.frame') expect_equal(length(Y_train), 698) expect_equal(class(Y_train), 'numeric') expect_equal(length(Y_test), 231) expect_equal(class(Y_test), 'numeric') }) print("Success!")

Question 8.0

Now that we have separated the data into training and testing sets, let's choose the kk for our kk-nearest neighbours algorithm. We can do this using cross-validation as we've seen before for k-nn classification. In this exercise we will do 3-fold cross validation searching for a K from 1 to 250. For this question name your model object (output from train) knn_cv.

set.seed(2019) # DO NOT CHANGE # your code here fail() # No Answer - remove if you provide an answer knn_cv
test_that('Solution is correct', { expect_equal(digest(as.numeric(train_control$number)), 'e5b57f323c7b3719bbaaf9f96b260d39') expect_equal(digest(train_control$method), '54c51511b5d01c4f13f8b56316886833') expect_equal(digest(as.integer(sum(knn_cv$results$k))), '2c3292a23a8e95227c4d2aaf87d7da65') }) print("Success!")

Question 8.1

Plot the results from cross-validation as a line and point plot with cross-validation error (as RMSERMSE) on the y-axis and kk on the x-axis. Name your plot object choosing_k.

# your code here fail() # No Answer - remove if you provide an answer choosing_k
test_that('Solution is incorrect', { expect_equal(as.character(rlang::get_expr(choosing_k$mapping$x)), 'k') expect_equal(as.character(rlang::get_expr(choosing_k$mapping$y)), 'RMSE') expect_true('GeomLine' %in% c(class(rlang::get_expr(choosing_k$layers[[1]]$geom)), class(rlang::get_expr(choosing_k$layers[[2]]$geom)))) expect_true('GeomPoint' %in% c(class(rlang::get_expr(choosing_k$layers[[1]]$geom)), class(rlang::get_expr(choosing_k$layers[[2]]$geom)))) }) print("Success!")

Question 8.2

Report the best kk for k-nn regression for this data set. Save your answer as an object named best_k. We provide scaffolding to help you choose the kk from the long list that you came up with:

#best_k <- knn_cv$results %>% # filter(... == min(...)) %>% # select(..) %>% # unlist() # your code here fail() # No Answer - remove if you provide an answer best_k
test_that('Solution is incorrect', { expect_equal(digest(as.numeric(best_k)), 'f67fbc496dfabdb88e8a3761809759ab') }) print("Success!")

Question 8.3

Our test error for kk = 75 is 0.5687047, true or false? Save your answer as "true" or "false" and name it answer8.3

# your code here fail() # No Answer - remove if you provide an answer answer8.3
test_that('Solution is incorrect', { expect_equal(digest(answer8.3), 'd2a90307aac5ae8d0ef58e2fe730d38b') }) print("Success!")

Question 9.0

Re-train your k-nn regression model with the best kk that you found in Question 8 using the entire training data set. Assign the model to an object called knn_model.

set.seed(2019) # DO NOT CHANGE # your code here fail() # No Answer - remove if you provide an answer knn_model
test_that('Solution is correct', { expect_that(exists('train_control'), is_true()) expect_equal(as.integer(knn_model$results$k), 75) expect_equal(knn_model$method, 'knn') expect_equal(knn_model$method, 'knn') expect_equal(colnames(knn_model$trainingData), c('max', '.outcome')) expect_equal(dim(knn_model$trainingData), c(698, 2)) }) print("Success!")

Question 10.0

Using the knn_model, predict the test data and save it to an object called predictions.

set.seed(2019) # DO NOT CHANGE # your code here fail() # No Answer - remove if you provide an answer head(predictions)
test_that('Solution is correct', { expect_true(class(predictions) == 'numeric') expect_equal(length(predictions), 231) }) print("Success!")

Question 11.0

Now with this predictions calculate the test error as RMSERMSE (how well the predictions on the test data match the true values of the test data set). Use the defaultSummary function to obtain the test error as RMSERMSE, and name the object returned from it test_error.

set.seed(2019) # DO NOT CHANGE # your code here fail() # No Answer - remove if you provide an answer test_error
test_that('Solution is correct', { expect_equal(digest(as.numeric(round(test_error['RMSE'], 3))), '30dc801401d69f591184aaaae5bfb987') }) print("Success!")

Question 11.1

The test error (as measured by RMSERMSE) is larger than the cross-validation error for the best kk, true or false? Save your answer as "true" or "false" and name it answer11.1

# your code here fail() # No Answer - remove if you provide an answer answer11.1
test_that('Solution is correct', { expect_equal(digest(answer11.1), '05ca18b596514af73f6880309a21b5dd') }) print("Success!")

Question 11.2

Given that RMSERMSE is in the units of the target/response variable, the test error RMSERMSE seems very large (and thus indicates that our predictions are likely not very good). True or false? Save your answer as "true" or "false" and name it answer11.2

# your code here fail() # No Answer - remove if you provide an answer answer11.2
test_that('Solution is correct', { expect_equal(digest(answer11.2), 'd2a90307aac5ae8d0ef58e2fe730d38b') }) print("Success!")

Question 12.0

Using the knn_model trained on the entire training set (from Question 9.0), predict across the range of values observed in the training data set. Store the predictions as a column named time_hrs in a data frame named full_predictions. That data frame should also have a column named max that contains the values you predicted across.

  • Use the min and max functions to find the upper and lower limits of predictor/explanatory variable values in the training data set.

  • Use the seq function to create the column called max that contains the values you would like to predict across.

#upper <- X_train %>% # select(max) %>% # max() #lower <- ... %>% # ... %>% # ... #... <- data.frame(max = seq(from = ..., to = ..., by = 1)) #full_predictions <- ... %>% # mutate(... = predict(..., ...)) # your code here fail() # No Answer - remove if you provide an answer head(full_predictions)
test_that('Solution is correct', { expect_true('time_hrs' %in% colnames(full_predictions)) expect_true('max' %in% colnames(full_predictions)) }) print("Success!")

Question 13.0

Plot these predictions as a blue line over the data points from the training set. You will have to create a single data frame containing the training data set to do this. One way you can do this is by combining X_train and Y_train using the bind_cols function. Name your plot predict_plot.

# your code here fail() # No Answer - remove if you provide an answer predict_plot
test_that('Solution is incorrect', { expect_equal(as.character(rlang::get_expr(predict_plot$mapping$x)), 'max') expect_equal(as.character(rlang::get_expr(predict_plot$mapping$y)), 'time_hrs') expect_true('GeomLine' %in% c(class(rlang::get_expr(predict_plot$layers[[1]]$geom)), class(rlang::get_expr(predict_plot$layers[[2]]$geom)))) expect_true('GeomPoint' %in% c(class(rlang::get_expr(predict_plot$layers[[1]]$geom)), class(rlang::get_expr(predict_plot$layers[[2]]$geom)))) }) print("Success!")