GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_08/worksheet_08.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 8 - Regression

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Recognize situations where a simple regression analysis would be appropriate for making predictions.
Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.
Interpret the output of a k-nn regression.
In a dataset with two variables, perform k-nearest neighbour regression in R using caret::knnregTrain() to predict the values for a test dataset.
Using R, execute cross-validation in R to choose the number of neighbours.
Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root mean square prediction error, $RMSE$ ).
Describe advantages and disadvantages of the k-nearest neighbour regression approach.

In [ ]:

### Run this cell before continuing.

library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(caret)

Question 0.0

To predict a value for $Y$ for a new observation using k-nn regression, we identify the $k$ -nearest neighbours and then:

A. Assign it the median of the of the $k$ -nearest neighbours as the predicted value

B. Assign it the mean of the of the $k$ -nearest neighbours as the predicted value

C. Assign it the mode of the of the $k$ -nearest neighbours as the predicted value

D. Assign it the majority vote of the of the $k$ -nearest neighbours as the predicted value

Save the letter of the answer you think is correct to a variable named answer0.0. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer0.0

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer0.0), '3a5505c06543876fe45598b5e5e5195d')
})
print("Success!")

Question 0.1

The plot below is a very simple k-nn regression example, where the black dots are the data observations and the blue line is the predictions from a k-nn regression model created from this data where $k$ =2.

Using the formula for $RMSE$ (given in the reading), and the graph below, by hand (pen and paper or use R as a calculator) calculate $RMSE$ for this model. Estimate the values off the graph to one decimal place. Save your answer to a variable named answer0.1

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer0.1

In [ ]:

test_that('Solution is correct', {
    expect_true(digest(round(answer0.1, 2)) %in% c('651ba44efc6a75d694ff482aae958ccc', '2a1ea47875e195a421d56ae3f6621d32'))
})
print("Success!")

Marathon training

Source: https://media.giphy.com/media/nUN6InE2CodRm/giphy.gif

What predicts which athletes will perform better than others? Specifically, we are interested in marathon runners, and looking at how the maximum distance ran per week during training predicts the time it takes a runner to end the race? For this, we will be looking at the marathon.csv file in the data/ folder.

Question 1.0

Load the data and assign it to an object called marathon.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(marathon)

In [ ]:

test_that('Solution is correct', {
    expect_equal(nrow(marathon), 929)
    expect_equal(ncol(marathon), 13)
    expect_that("time_hrs" %in% colnames(marathon), is_true())
    expect_that("max" %in% colnames(marathon), is_true())
})
print("Success!")

Question 2.0

Given that we want to predict race time (time_hrs) given a particular value of maximum distance ran per week during training (max). Let's take a subset of size 50 of our marathon data and assign it to an object called marathon_50. With this subset, plot a scatterplot to assess the relationship between these two variables. Put time_hrs on the y-axis and max on the x-axis. Assign this plot to an object called answer2. Discuss with your neighbour the relationship between race time and maximum distance ran per week during training based on the scatterplot you create below.

Hint: To take a subset of your data you can use the sample_n() function

In [ ]:

set.seed(2000) ### DO NOT CHANGE

#marathon_50 <- ... %>%
#    sample_n(...)

# your code here
fail() # No Answer - remove if you provide an answer
answer2

In [ ]:

test_that('Solution is incorrect', {
    expect_true(exists('marathon_50'))
    expect_equal(digest(as.character(rlang::get_expr(answer2$mapping$x))) , '60f9c54cbd347e2956e968462f44c536')
    expect_equal(digest(as.character(rlang::get_expr(answer2$mapping$y))) , 'b55efe8bd9491c88b50fd4d402fbde92')
    expect_equal(digest(class(rlang::get_expr(answer2$layers[[1]]$geom))[1]), '911e5b9debfb523f25ad2ccc01a4b2dd')
    })
print("Success!")

Question 3.0

Suppose we want to predict the race time for someone who ran a maximum distance of 100 miles per week during training. In the plot below we can see that no one has ran a maximum distance of 100 miles per week. But, if we are interested in prediction, how can we predict with this data? We can use k-nn regression, to do this we get the $Y$ values (target/response variable) of the nearest $k$ values and then take their average and use that as the prediction.

For this question we want to predict race time based on the 4 closest neighbors to the 100 miles per week during training.

Fill in the scaffolding below and assign your answer to an object named answer3.

In [ ]:

options(repr.plot.height = 3, repr.plot.width = 3)

marathon_50 %>%
    ggplot(aes(x = max, y = time_hrs)) + 
        geom_point(color = 'dodgerblue', alpha = 0.4) +
        geom_vline(xintercept = 100, linetype = "dotted") +
        xlab("Maximum Distance Ran per \n Week During Training (mi)") +
        ylab("Race Time (hours)") + 
        geom_segment(aes(x = 100, y = 2.57, xend = 80, yend = 2.57), col = "orange") +
        geom_segment(aes(x = 100, y = 2.87, xend = 75, yend = 2.87), col = "orange") +
        geom_segment(aes(x = 100, y = 2.61, xend = 110, yend = 2.61), col = "orange") +
        geom_segment(aes(x = 100, y = 2.93, xend = 105, yend = 2.93), col = "orange")

In [ ]:

#answer3 <- ... %>% 
#  mutate(diff = abs(100 - ...)) %>% 
#  ...(diff) %>% 
#  head(...) %>%  #Controls the K
#  summarise(predicted = ...(...)) %>%
#  unlist()

# your code here
fail() # No Answer - remove if you provide an answer
answer3

In [ ]:

test_that('Solution is correct', {
    expect_true(exists('answer3'))
    expect_equal(digest(as.numeric(answer3)), '485826410541b8deebb40b4fb731ca15')
})
print("Success!")

Question 4.0

For this question, let's instead predict the race time based on the 2 closest neighbors to the 100 miles per week during training.

Assign your answer to an object named answer4.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer4

In [ ]:

test_that('Solution is correct', {
    expect_true(exists('answer4'))
    expect_equal(digest(as.numeric(answer4)), '0d7b5b246b9c97983f281a717db02df2')
})
print("Success!")

Question 5.0 Multiple Choice

Now that you have done K Nearest Neighbors predictions manually, which method would you use to choose the $k$ ?

A) Choose the $k$ that excludes most outliers
B) Choose the $k$ with the lowest training error
C) Choose the $k$ with the lowest cross-validation error
D) Choose the $k$ that includes the most data points
D) Choose the $k$ with the lowest testing error

Assing your answer to an object called answer5

In [ ]:

# Assign your answer to an object called: answer5
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_that(exists('answer5'), is_true())
    expect_equal(digest(answer5), '475bf9280aab63a82af60791302736f6')
})
print("Success!")

Question 6.0

We have seen how to do k-nn regression manually, now we will apply it to the whole dataset using the caret package. For this we first need to create a training and testing sets. Remember we won't touch the test dataset until the end.

For this question create an object called training_rows that includes the indexes of the rows we will use.

Use 75% of the data as training data

In [ ]:

set.seed(2000) ### DO NOT CHANGE

#... <- marathon %>% 
#  select(...) %>% 
#  unlist() %>%
#  createDataPartition(p = ..., list = FALSE)

# your code here
fail() # No Answer - remove if you provide an answer
head(training_rows)

In [ ]:

test_that('Solution is correct', {
    expect_equal(nrow(training_rows), 698)
    expect_equal(ncol(training_rows), 1)
})
print("Success!")

Question 7.0

Create the training and testing dataset filling the scaffoldings below. The scaffolding for the training dataset is given below.

Assing your answer to objects called X_train, Y_train, X_test, Y_test respectively.

Hint: For the test dataset you can use the - sign inside the slice() function.

In [ ]:

#X_train <- marathon %>% 
#  select(...) %>% 
#  slice(training_rows) %>% 
#  data.frame()

#Y_train <- marathon %>% 
#  select(...) %>% 
#  slice(training_rows) %>% 
#  unlist()

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_equal(dim(X_train), c(698, 1))
    expect_equal(class(X_train), 'data.frame')
    expect_equal(dim(X_test), c(231, 1))
    expect_equal(class(X_test), 'data.frame')
    expect_equal(length(Y_train), 698)
    expect_equal(class(Y_train), 'numeric')
    expect_equal(length(Y_test), 231)
    expect_equal(class(Y_test), 'numeric')
})
print("Success!")

Question 8.0

Now that we have separated the data into training and testing sets, let's choose the $k$ for our $k$ -nearest neighbours algorithm. We can do this using cross-validation as we've seen before for k-nn classification. In this exercise we will do 3-fold cross validation searching for a K from 1 to 250. For this question name your model object (output from train) knn_cv.

In [ ]:

set.seed(2019) # DO NOT CHANGE
# your code here
fail() # No Answer - remove if you provide an answer
knn_cv

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(as.numeric(train_control$number)), 'e5b57f323c7b3719bbaaf9f96b260d39')
    expect_equal(digest(train_control$method), '54c51511b5d01c4f13f8b56316886833')
    expect_equal(digest(as.integer(sum(knn_cv$results$k))), '2c3292a23a8e95227c4d2aaf87d7da65')
})
print("Success!")

Question 8.1

Plot the results from cross-validation as a line and point plot with cross-validation error (as $RMSE$ ) on the y-axis and $k$ on the x-axis. Name your plot object choosing_k.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
choosing_k

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.character(rlang::get_expr(choosing_k$mapping$x)), 'k')
    expect_equal(as.character(rlang::get_expr(choosing_k$mapping$y)), 'RMSE')
    expect_true('GeomLine' %in% c(class(rlang::get_expr(choosing_k$layers[[1]]$geom)), class(rlang::get_expr(choosing_k$layers[[2]]$geom))))
    expect_true('GeomPoint' %in% c(class(rlang::get_expr(choosing_k$layers[[1]]$geom)), class(rlang::get_expr(choosing_k$layers[[2]]$geom))))
    })
print("Success!")

Question 8.2

Report the best $k$ for k-nn regression for this data set. Save your answer as an object named best_k. We provide scaffolding to help you choose the $k$ from the long list that you came up with:

In [ ]:

#best_k <- knn_cv$results %>%
#    filter(... == min(...)) %>%
#    select(..) %>%
#    unlist()

# your code here
fail() # No Answer - remove if you provide an answer
best_k

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(best_k)), 'f67fbc496dfabdb88e8a3761809759ab')
    })
print("Success!")

Question 8.3

Our test error for $k$ = 75 is 0.5687047, true or false? Save your answer as "true" or "false" and name it answer8.3

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer8.3

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer8.3), 'd2a90307aac5ae8d0ef58e2fe730d38b')
    })
print("Success!")

Question 9.0

Re-train your k-nn regression model with the best $k$ that you found in Question 8 using the entire training data set. Assign the model to an object called knn_model.

In [ ]:

set.seed(2019) # DO NOT CHANGE

# your code here
fail() # No Answer - remove if you provide an answer
knn_model

In [ ]:

test_that('Solution is correct', {
    expect_that(exists('train_control'), is_true())
    expect_equal(as.integer(knn_model$results$k), 75)
    expect_equal(knn_model$method, 'knn')
    expect_equal(knn_model$method, 'knn')
    expect_equal(colnames(knn_model$trainingData), c('max', '.outcome'))
    expect_equal(dim(knn_model$trainingData), c(698, 2))
})
print("Success!")

Question 10.0

Using the knn_model, predict the test data and save it to an object called predictions.

In [ ]:

set.seed(2019) # DO NOT CHANGE
# your code here
fail() # No Answer - remove if you provide an answer
head(predictions)

In [ ]:

test_that('Solution is correct', {
    expect_true(class(predictions) == 'numeric')
    expect_equal(length(predictions), 231)
})
print("Success!")

Question 11.0

Now with this predictions calculate the test error as $RMSE$ (how well the predictions on the test data match the true values of the test data set). Use the defaultSummary function to obtain the test error as $RMSE$ , and name the object returned from it test_error.

In [ ]:

set.seed(2019) # DO NOT CHANGE
# your code here
fail() # No Answer - remove if you provide an answer
test_error

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(as.numeric(round(test_error['RMSE'], 3))), '30dc801401d69f591184aaaae5bfb987')
})
print("Success!")

Question 11.1

The test error (as measured by $RMSE$ ) is larger than the cross-validation error for the best $k$ , true or false? Save your answer as "true" or "false" and name it answer11.1

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer11.1

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer11.1), '05ca18b596514af73f6880309a21b5dd')
})
print("Success!")

Question 11.2

Given that $RMSE$ is in the units of the target/response variable, the test error $RMSE$ seems very large (and thus indicates that our predictions are likely not very good). True or false? Save your answer as "true" or "false" and name it answer11.2

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer11.2

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer11.2), 'd2a90307aac5ae8d0ef58e2fe730d38b')
})
print("Success!")

Question 12.0

Using the knn_model trained on the entire training set (from Question 9.0), predict across the range of values observed in the training data set. Store the predictions as a column named time_hrs in a data frame named full_predictions. That data frame should also have a column named max that contains the values you predicted across.

Use the min and max functions to find the upper and lower limits of predictor/explanatory variable values in the training data set.
Use the seq function to create the column called max that contains the values you would like to predict across.

In [ ]:

#upper <- X_train %>% 
#    select(max) %>% 
#    max() 
#lower <- ... %>% 
#    ... %>% 
#    ...
#... <- data.frame(max = seq(from = ..., to = ..., by = 1))
#full_predictions <- ... %>% 
#    mutate(... = predict(..., ...))

# your code here
fail() # No Answer - remove if you provide an answer
head(full_predictions)

In [ ]:

test_that('Solution is correct', {
    expect_true('time_hrs' %in% colnames(full_predictions))
    expect_true('max' %in% colnames(full_predictions))
})
print("Success!")

Question 13.0

Plot these predictions as a blue line over the data points from the training set. You will have to create a single data frame containing the training data set to do this. One way you can do this is by combining X_train and Y_train using the bind_cols function. Name your plot predict_plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
predict_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.character(rlang::get_expr(predict_plot$mapping$x)), 'max')
    expect_equal(as.character(rlang::get_expr(predict_plot$mapping$y)), 'time_hrs')
    expect_true('GeomLine' %in% c(class(rlang::get_expr(predict_plot$layers[[1]]$geom)), class(rlang::get_expr(predict_plot$layers[[2]]$geom))))
    expect_true('GeomPoint' %in% c(class(rlang::get_expr(predict_plot$layers[[1]]$geom)), class(rlang::get_expr(predict_plot$layers[[2]]$geom))))
    })
print("Success!")

Worksheet 8 - Regression

Lecture and Tutorial Learning Goals:

Marathon training

Product

Resources

Company