GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_09/worksheet_09.ipynb
²⁷¹⁵ views

Kernel: R

Worksheet 9 - Regression cont'd

Lecture and Tutorial Learning Goals:

By the end of the week, students will be able to:

In the context of k-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
In a dataset with 2 variables, perform simple ordinary least squares regression in R using caret's train with method = "lm" to predict the values for a test dataset.
Compare and contrast predictions obtained from k-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(caret)
library(gridExtra)

Understanding $RMSE$ and $RMSPE$

Question 1.0

What does $RMSPE$ stand for?

A. root mean squared prediction error

B. root mean squared percentage error

C. root mean squared performance error

D. root mean squared preference error

Save the letter of the answer you think is correct to a variable named answer1.0. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer1.0

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer1.0), '75f1160e72554f4270c809f041c7a776')
})
print("Success!")

Question 1.1

Of those shown below, which is the correct formula for $RMSPE$ ?

A. $RMSPE = \sqrt{\frac{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}{1 - n}}$

B. $RMSPE = \sqrt{\frac{1}{n - 1}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}$

C. $RMSPE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}$

D. $RMSPE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})}$

Save the letter of the answer you think is correct to a variable named answer1.1. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer1.1

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer1.1), '475bf9280aab63a82af60791302736f6')
})
print("Success!")

Question 1.2

Which statement(s) below is/are incorrect?

A. $RMSE$ is a measure of model goodness of fit

B. $RMSPE$ is a measure of how well the model predicts on the training data

C. $RMSPE$ is a measure of how well the model predicts on the testing data

D. $RMSE$ is a measure of how well the model predicts on the training data

E. $RMSE$ is a measure of how well the model predicts on the testing data

Save the letter of the answer(s) you think are incorrect to a variable named answer1.2. Save these as a character vector (scaffolding and example given below). Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# write your answer as one of these below, replacing ... with the appropriate letters:
#answer1.2 <- c("...")
#answer1.2 <- c("...", "...")
#answer1.2 <- c("...", "...", "..")

# your code here
fail() # No Answer - remove if you provide an answer
answer1.2

In [ ]:

test_that('Solution is correct', {
    expect_that(digest(paste(answer1.2, collapse="")) %in% c('a01e8a1915e410d88459f7a2876a96ca', 'ecfc3aab7059f241ec9cbd72e9edeb89'), is_true())
})
print("Success!")

Understanding Linear Regression

Consider this small and simple data set:

In [ ]:

simple_data  <- tibble(X = c(1, 2, 3, 6, 7, 7),
               Y = c(1, 1, 3, 5, 7, 6))
options(repr.plot.width = 3, repr.plot.height = 3)
base <- ggplot(simple_data, aes(x = X, y = Y)) +
    geom_point() +
    scale_x_continuous(limits = c(0, 7.5), breaks = seq(0, 8), minor_breaks = seq(0, 8, 0.25)) +
    scale_y_continuous(limits = c(0, 7.5), breaks = seq(0, 8), minor_breaks = seq(0, 8, 0.25))
base

Now consider these three potential lines of best fit for the same data set:

In [ ]:

line_a <- base +
    ggtitle("Line A") +
    geom_abline(intercept = -0.6547, slope = 0.9434, color = "blue")
line_b <- base +
    ggtitle("Line B") +
    geom_abline(intercept = 0.1022, slope = 0.8904, color = "purple")
line_c <- base +
    ggtitle("Line C") +
    geom_abline(intercept = -0.2547, slope = 0.9434, color = "green") 
options(repr.plot.width = 10, repr.plot.height = 3.5)
grid.arrange(line_a, line_b, line_c, ncol = 3)

Question 2.0

Use the graph titled "Line A" to roughly calculate the average squared vertical distance between the points and the blue line. Save your answer to a variable named answer2.0.

We re-reprint the plot for you in a larger size to make it easier to estimate the locations on the graph.

In [ ]:

options(repr.plot.width = 5, repr.plot.height = 5)
line_a

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.0

In [ ]:

test_that('Solution is correct', {
    expect_that(digest(round(answer2.0, 2)) %in% c('75dc8b7b8724a54d1fba4cc109438cfb', '4ec4e6dd2f7793c5d566f50026f262e9', '0d8b1b67a03fc038058a25213d5e9778'), is_true())
})
print("Success!")

Question 2.1

Use the graph titled "Line B" to roughly calculate the average squared vertical distance between the points and the purple line. Save your answer to a variable named answer2.1.

We re-reprint the plot for you in a larger size to make it easier to estimate the locations on the graph.

In [ ]:

line_b

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.1

In [ ]:

test_that('Solution is correct', {
    expect_that(digest(round(answer2.1, 2)) %in% c('62233ed4e6655a993784e4c0886c4550', '1825ac9b036d540ac34abdd1ecb7fc21', '354964e94313ac16ac091669a785eb4f'), is_true())
})
print("Success!")

Question 2.2

Use the graph titled "Line C" to roughly calculate the average squared vertical distance between the points and the green line. Save your answer to a variable named answer2.2.

We re-reprint the plot for you in a larger size to make it easier to estimate the locations on the graph.

In [ ]:

line_c

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.2

In [ ]:

test_that('Solution is correct', {
    expect_that(digest(round(answer2.2, 2)) %in% c('522dbf08f17812fee06f0991cf0481af', 'ee48059132b8cdd8f1a1d9abbdaead78', '37cd4e5174c65a7196eae5fed7c0a61e', '4d308066a8d7253145df19089a026b9e'), is_true())
})
print("Success!")

Question 2.3

Based on your calculations above, which line would linear regression by ordinary least squares choose given our small and simple data set? Line A, B or C? Assign the letter that corresponds the line to a variable named answer2.3. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.3

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer1.1), '475bf9280aab63a82af60791302736f6')
})
print("Success!")

Marathon training revisited with linear regression!

Source: https://media.giphy.com/media/BDagLpxFIm3SM/giphy.gif

Remeber our question from last week: what predicts which athletes will perform better than others? Specifically, we are interested in marathon runners, and looking at how the maximum distance ran per week during training predicts the time it takes a runner to end the race?

This time around however we will analyze the data using linear regression. And then in the end we will compare our results to what we found last week with k-nn regression.

Question 3.0

Load the data and assign it to an object called marathon.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(marathon)

In [ ]:

test_that('Solution is correct', {
    expect_equal(nrow(marathon), 929)
    expect_equal(ncol(marathon), 13)
    expect_that("time_hrs" %in% colnames(marathon), is_true())
    expect_that("max" %in% colnames(marathon), is_true())
})
print("Success!")

Question 3.1

Using all the observations in the data set, create a scatterplot of to assess the relationship between race time (time_hrs) given a particular value of maximum distance ran per week during training (max). Put time_hrs on the y-axis and max on the x-axis. Assign this plot to an object called marathon_eda. Remember to do all the things to make this an effective visualization.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
marathon_eda

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.character(rlang::get_expr(marathon_eda$mapping$x)), 'max')
    expect_equal(as.character(rlang::get_expr(marathon_eda$mapping$y)), 'time_hrs')
    expect_true('GeomPoint' %in% c(class(rlang::get_expr(marathon_eda$layers[[1]]$geom))))
    })
print("Success!")

Question 3.2

Create a training and testing dataset using 75% of the data as training data. Use set.seed(2000) and the max column as the input to createDataPartition (as we did in the last worksheet) so that we end up with the same training data set for linear regression that we had for k-nn regression (so we can compare our results between these two weeks).

At the end of thiw question you should have 4 objects named X_train, Y_train, X_test and Y_test.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_equal(dim(X_train), c(698, 1))
    expect_equal(class(X_train), 'data.frame')
    expect_equal(dim(X_test), c(231, 1))
    expect_equal(class(X_test), 'data.frame')
    expect_equal(length(Y_train), 698)
    expect_equal(class(Y_train), 'numeric')
    expect_equal(length(Y_test), 231)
    expect_equal(class(Y_test), 'numeric')
})
print("Success!")

Question 3.3

Now use caret's train function with method = "lm" to fit you linear regression model. Name your linear regression model object lm_model.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_true(lm_model$results$intercept)
    expect_equal(lm_model$method, 'lm')
    expect_equal(colnames(lm_model$trainingData), c('max', '.outcome'))
    expect_equal(dim(lm_model$trainingData), c(698, 2))
})
print("Success!")

Question 3.4

Now, let's visualize the model predictions as a straight line overlaid on the training data. Use geom_smooth with method = "lm" and se = FALSE to visualize the predictions as a straight line. Name your plot lm_predictions.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
lm_predictions

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.character(rlang::get_expr(lm_predictions$mapping$x)), 'max')
    expect_equal(as.character(rlang::get_expr(lm_predictions$mapping$y)), 'time_hrs')
    expect_true('GeomPoint' %in% c(class(rlang::get_expr(lm_predictions$layers[[1]]$geom))))
    expect_true('GeomSmooth' %in% c(class(rlang::get_expr(lm_predictions$layers[[2]]$geom))))
    })
print("Success!")

Question 3.5

Calculate the $RMSE$ to assess goodness of fit on your lm_model (remember this is how well it predicts on the training data used to fit the model). Return a single numerical value named lm_rmse.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmse

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(round(lm_rmse, 2)), 'd89d31a3d09a47e1f9e7b97edbcf7fbb')
})
print("Success!")

Question 3.6

Calculate $RMSPE$ using the test data. Return a single numerical value named lm_rmspe.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmspe

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(round(lm_rmspe, 2)), '7635548ab5401636180406281994cba1')
})
print("Success!")

Question 3.7

Compare RMSPE between k-nn and linear regression, which is greater?

A. Linear regression has a greater RMSPE

B. k-nn regression has a greater RMSPE

Save the letter of the answer you think is correct to a variable named answer3.7. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer3.7

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer3.7), '75f1160e72554f4270c809f041c7a776')
})
print("Success!")

Question 3.8

Which model does a better job of predicting on the test data set?

A. Linear regression

B. k-nn regression

C. Linear regression, but only very slightly

D. k-nn regression, but only very slightly

Save the letter of the answer you think is correct to a variable named answer3.8. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer3.8

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer3.8), 'c1f86f7430df7ddb256980ea6a3b57a4')
})
print("Success!")

Question 3.9

(optional - not graded)

Given that the linear regression model is a straight line, we can write our model as a mathematical equation. We can get the two numbers we need for this (y-intercept and slope) from the finalModel attribute from our model object as shown below:

In [ ]:

# run this cell
lm_model$finalModel

Use the numbers output in the cell above to write the model as a mathematical equation.

YOUR ANSWER HERE

Worksheet 9 - Regression cont'd

Lecture and Tutorial Learning Goals:

Understanding $RMSE$ and $RMSPE$

Question 1.0

Question 1.1

Question 1.2

Understanding Linear Regression

Question 2.0

Question 2.1

Question 2.2

Question 2.3

Marathon training revisited with linear regression!

Question 3.0

Question 3.1

Question 3.2

Question 3.3

Question 3.4

Question 3.5

Question 3.6

Question 3.7

Question 3.8

Question 3.9

Product

Resources

Company

Worksheet 9 - Regression cont'd

Lecture and Tutorial Learning Goals:

Understanding RMSERMSERMSE and RMSPERMSPERMSPE

Question 1.0

Question 1.1

Question 1.2

Understanding Linear Regression

Question 2.0

Question 2.1

Question 2.2

Question 2.3

Marathon training revisited with linear regression!

Question 3.0

Question 3.1

Question 3.2

Question 3.3

Question 3.4

Question 3.5

Question 3.6

Question 3.7

Question 3.8

Question 3.9

Understanding $RMSE$ and $RMSPE$