Path: blob/master/2019-spring/materials/worksheet_08/worksheet_08.ipynb
2051 views
Worksheet 8 - Regression
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
Recognize situations where a simple regression analysis would be appropriate for making predictions.
Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.
Interpret the output of a k-nn regression.
In a dataset with two variables, perform k-nearest neighbour regression in R using
caret::knnregTrain()
to predict the values for a test dataset.Using R, execute cross-validation in R to choose the number of neighbours.
Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root mean square prediction error, ).
Describe advantages and disadvantages of the k-nearest neighbour regression approach.
Question 0.0
To predict a value for for a new observation using k-nn regression, we identify the -nearest neighbours and then:
A. Assign it the median of the of the -nearest neighbours as the predicted value
B. Assign it the mean of the of the -nearest neighbours as the predicted value
C. Assign it the mode of the of the -nearest neighbours as the predicted value
D. Assign it the majority vote of the of the -nearest neighbours as the predicted value
Save the letter of the answer you think is correct to a variable named answer0.0
. Make sure you put quotations around the letter and pay attention to case.
Question 0.1
The plot below is a very simple k-nn regression example, where the black dots are the data observations and the blue line is the predictions from a k-nn regression model created from this data where =2.
Using the formula for (given in the reading), and the graph below, by hand (pen and paper or use R as a calculator) calculate for this model. Estimate the values off the graph to one decimal place. Save your answer to a variable named answer0.1
Marathon training
Source: https://media.giphy.com/media/nUN6InE2CodRm/giphy.gif
What predicts which athletes will perform better than others? Specifically, we are interested in marathon runners, and looking at how the maximum distance ran per week during training predicts the time it takes a runner to end the race? For this, we will be looking at the marathon.csv
file in the data/
folder.
Question 1.0
Load the data and assign it to an object called marathon
.
Question 2.0
Given that we want to predict race time (time_hrs
) given a particular value of maximum distance ran per week during training (max
). Let's take a subset of size 50 of our marathon data and assign it to an object called marathon_50
. With this subset, plot a scatterplot to assess the relationship between these two variables. Put time_hrs
on the y-axis and max
on the x-axis. Assign this plot to an object called answer2
. Discuss with your neighbour the relationship between race time and maximum distance ran per week during training based on the scatterplot you create below.
Hint: To take a subset of your data you can use the sample_n()
function
Question 3.0
Suppose we want to predict the race time for someone who ran a maximum distance of 100 miles per week during training. In the plot below we can see that no one has ran a maximum distance of 100 miles per week. But, if we are interested in prediction, how can we predict with this data? We can use k-nn regression, to do this we get the values (target/response variable) of the nearest values and then take their average and use that as the prediction.
For this question we want to predict race time based on the 4 closest neighbors to the 100 miles per week during training.
Fill in the scaffolding below and assign your answer to an object named answer3
.
Question 4.0
For this question, let's instead predict the race time based on the 2 closest neighbors to the 100 miles per week during training.
Assign your answer to an object named answer4
.
Question 5.0 Multiple Choice
Now that you have done K Nearest Neighbors predictions manually, which method would you use to choose the ?
A) Choose the that excludes most outliers
B) Choose the with the lowest training error
C) Choose the with the lowest cross-validation error
D) Choose the that includes the most data points
D) Choose the with the lowest testing error
Assing your answer to an object called answer5
Question 6.0
We have seen how to do k-nn regression manually, now we will apply it to the whole dataset using the caret
package. For this we first need to create a training and testing sets. Remember we won't touch the test dataset until the end.
For this question create an object called training_rows
that includes the indexes of the rows we will use.
Use 75% of the data as training data
Question 7.0
Create the training and testing dataset filling the scaffoldings below. The scaffolding for the training dataset is given below.
Assing your answer to objects called X_train
, Y_train
, X_test
, Y_test
respectively.
Hint: For the test dataset you can use the -
sign inside the slice()
function.
Question 8.0
Now that we have separated the data into training and testing sets, let's choose the for our -nearest neighbours algorithm. We can do this using cross-validation as we've seen before for k-nn classification. In this exercise we will do 3-fold cross validation searching for a K from 1 to 250. For this question name your model object (output from train
) knn_cv
.
Question 8.1
Plot the results from cross-validation as a line and point plot with cross-validation error (as ) on the y-axis and on the x-axis. Name your plot object choosing_k
.
Question 8.2
Report the best for k-nn regression for this data set. Save your answer as an object named best_k
. We provide scaffolding to help you choose the from the long list that you came up with:
Question 8.3
Our test error for = 75 is 0.5687047, true or false? Save your answer as "true"
or "false"
and name it answer8.3
Question 9.0
Re-train your k-nn regression model with the best that you found in Question 8 using the entire training data set. Assign the model to an object called knn_model
.
Question 10.0
Using the knn_model
, predict the test data and save it to an object called predictions
.
Question 11.0
Now with this predictions calculate the test error as (how well the predictions on the test data match the true values of the test data set). Use the defaultSummary
function to obtain the test error as , and name the object returned from it test_error
.
Question 11.1
The test error (as measured by ) is larger than the cross-validation error for the best , true or false? Save your answer as "true"
or "false"
and name it answer11.1
Question 11.2
Given that is in the units of the target/response variable, the test error seems very large (and thus indicates that our predictions are likely not very good). True or false? Save your answer as "true"
or "false"
and name it answer11.2
Question 12.0
Using the knn_model
trained on the entire training set (from Question 9.0), predict across the range of values observed in the training data set. Store the predictions as a column named time_hrs
in a data frame named full_predictions
. That data frame should also have a column named max
that contains the values you predicted across.
Use the
min
andmax
functions to find the upper and lower limits of predictor/explanatory variable values in the training data set.Use the
seq
function to create the column calledmax
that contains the values you would like to predict across.
Question 13.0
Plot these predictions as a blue line over the data points from the training set. You will have to create a single data frame containing the training data set to do this. One way you can do this is by combining X_train
and Y_train
using the bind_cols
function. Name your plot predict_plot
.