Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2020-spring/materials/worksheet_09/worksheet_09.ipynb
2051 views
Kernel: R

Worksheet 9 - Regression Continued

Lecture and Tutorial Learning Goals:

By the end of the week, students will be able to:

  • Perform ordinary least squares regression in R using caret’s train with method = "lm" to predict the values for a test dataset.

  • Compare and contrast predictions obtained from k-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.

  • In R, overlay the ordinary least squares regression lines from geom_smooth on a single plot.

### Run this cell before continuing. library(tidyverse) library(testthat) library(digest) library(repr) library(caret) library(gridExtra) source("tests_worksheet_09.R") source('cleanup_worksheet_09.R')

Warm-up Questions

Here are some warm-up questions on the topic of multivariate regression to get you thinking before we jump into data analysis. The course readings should help you answer these.

Question 1.0 Multiple Choice:
{points: 1}

In multivariate k-nn regression with one outcome/target variable and two predictor variables, the predictions take the form of what shape?

A. a flat plane

B. a wiggly/flexible plane

C. A straight line

D. a wiggly/flexible line

E. a 4D hyperplane

F. a 4D wiggly/flexible hyperplane

Save the letter of the answer you think is correct to a variable named answer1.0. Make sure you put quotations around the letter and pay attention to case.

# your code here fail() # No Answer - remove if you provide an answer
test_1.0()

Question 1.1 Multiple Choice:
{points: 1}

In simple linear regression with one outcome/target variable and one predictor variable, the predictions take the form of what shape?

A. a flat plane

B. a wiggly/flexible plane

C. A straight line

D. a wiggly/flexible line

E. a 4D hyperplane

F. a 4D wiggly/flexible hyperplane

Save the letter of the answer you think is correct to a variable named answer1.1. Make sure you put quotations around the letter and pay attention to case.

# your code here fail() # No Answer - remove if you provide an answer
test_1.1()

Question 1.2 Multiple Choice:
{points: 1}

In multivariate linear regression with one outcome/target variable and two predictor variables, the predictions take the form of what shape?

A. a flat plane

B. a wiggly/flexible plane

C. A straight line

D. a wiggly/flexible line

E. a 4D hyperplane

F. a 4D wiggly/flexible hyperplane

Save the letter of the answer you think is correct to a variable named answer1.2. Make sure you put quotations around the letter and pay attention to case.

# your code here fail() # No Answer - remove if you provide an answer
test_1.2()

Understanding Simple Linear Regression

Consider this small and simple data set:

simple_data <- tibble(X = c(1, 2, 3, 6, 7, 7), Y = c(1, 1, 3, 5, 7, 6)) options(repr.plot.width = 3, repr.plot.height = 3) base <- ggplot(simple_data, aes(x = X, y = Y)) + geom_point() + scale_x_continuous(limits = c(0, 7.5), breaks = seq(0, 8), minor_breaks = seq(0, 8, 0.25)) + scale_y_continuous(limits = c(0, 7.5), breaks = seq(0, 8), minor_breaks = seq(0, 8, 0.25)) base

Now consider these three potential lines of best fit for the same data set:

line_a <- base + ggtitle("Line A") + geom_abline(intercept = -0.897, slope = 0.9834, color = "blue") line_b <- base + ggtitle("Line B") + geom_abline(intercept = 0.1022, slope = 0.9804, color = "purple") line_c <- base + ggtitle("Line C") + geom_abline(intercept = -0.2547, slope = 0.9434, color = "green") options(repr.plot.width = 10, repr.plot.height = 3.5) grid.arrange(line_a, line_b, line_c, ncol = 3)

Question 2.0
{points: 1}

Use the graph below titled "Line A" to roughly calculate the average squared vertical distance between the points and the blue line. Read values of the graph to a precision of 0.25 (e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named answer2.0.

We re-reprint the plot for you in a larger size to make it easier to estimate the locations on the graph.

#run this code options(repr.plot.width = 5, repr.plot.height = 5) line_a
# your code here fail() # No Answer - remove if you provide an answer answer2.0
test_2.0()

Question 2.1
{points: 1}

Use the graph titled "Line B" to roughly calculate the average squared vertical distance between the points and the purple line. Read values of the graph to a precision of 0.25 (e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named answer2.1.

We re-reprint the plot for you in a larger size to make it easier to estimate the locations on the graph.

line_b
# your code here fail() # No Answer - remove if you provide an answer answer2.1
test_2.1()

Question 2.2
{points: 1}

Use the graph titled "Line C" to roughly calculate the average squared vertical distance between the points and the green line. Read values of the graph to a precision of 0.25 (e.g. 1, 1.25, 1.5, 1.75, 2). Save your answer to a variable named answer2.2.

We re-reprint the plot for you in a larger size to make it easier to estimate the locations on the graph.

line_c
# your code here fail() # No Answer - remove if you provide an answer answer2.2
test_2.2()

Question 2.3
{points: 1}

Based on your calculations above, which line would linear regression by ordinary least squares choose given our small and simple data set? Line A, B or C? Assign the letter that corresponds the line to a variable named answer2.3. Make sure you put quotations around the letter and pay attention to case.

# your code here fail() # No Answer - remove if you provide an answer
test_2.3()

Marathon Training Revisited with Linear Regression!

Source: https://media.giphy.com/media/BDagLpxFIm3SM/giphy.gif

Remeber our question from last week: what predicts which athletes will perform better than others? Specifically, we are interested in marathon runners, and looking at how the maximum distance ran per week during training predicts the time it takes a runner to end the race?

This time around however we will analyze the data using a simple linear regression. And then in the end we will compare our results to what we found last week with k-nn regression.

Question 3.0
{points: 1}

Load the data and assign it to an object called marathon.

# your code here fail() # No Answer - remove if you provide an answer head(marathon)
test_3.0()

Question 3.1
{points: 1}

Create a training and testing dataset using 75% of the data as training data. Use set.seed(2000) and the max column as the input to createDataPartition (as we did in the last worksheet) so that we end up with the same training data set for simple linear regression that we had for k-nn regression (so we can compare our results between these two weeks).

At the end of this question you should have 4 objects named X_train, Y_train, X_test and Y_test.

set.seed(2000) # DO NOT CHANGE THIS # your code here fail() # No Answer - remove if you provide an answer
test_3.1()

Question 3.2
{points: 1}

Using only the training observations in the data set, create a scatterplot to assess the relationship between race time (time_hrs) given a particular value of maximum distance ran per week during training (max). Put time_hrs on the y-axis and max on the x-axis. Assign this plot to an object called marathon_eda. Remember to do whatever is necessary to make this an effective visualization.

# your code here fail() # No Answer - remove if you provide an answer marathon_eda
test_3.2()

Question 3.3
{points: 1}

Now use caret's train function with method = "lm" to fit your simple linear regression model. Name your simple linear regression model object lm_model.

# your code here fail() # No Answer - remove if you provide an answer
test_3.3()

Question 3.4
{points: 1}

Now, let's visualize the model predictions as a straight line overlaid on the training data. Use geom_smooth with method = "lm" and se = FALSE to visualize the predictions as a straight line. Name your plot lm_predictions.

# your code here fail() # No Answer - remove if you provide an answer lm_predictions
test_3.4()

Question 3.5
{points: 1}

Calculate the RMSERMSE to assess goodness of fit on your lm_model (remember this is how well it predicts on the training data used to fit the model). Return a single numerical value named lm_rmse.

# train_pred <- predict(lm_model, ...) # lm_modelvalues <- data.frame(obs = ..., pred = ...) # ... <- defaultSummary(...)[[1]] # your code here fail() # No Answer - remove if you provide an answer lm_rmse
test_3.5()

Question 3.6
{points: 1}

Calculate RMSPERMSPE using the test data. Return a single numerical value named lm_rmspe.

# your code here fail() # No Answer - remove if you provide an answer lm_rmspe
test_3.6()

Question 3.61
{points: 1}

Now, let's visualize the model predictions as a straight line overlaid on the test data. Use geom_smooth with method = "lm" and se = FALSE to visualize the predictions as a straight line. Name your plot lm_predictions_test. Remember to do whatever is necessary to make this an effective visualization.

# your code here fail() # No Answer - remove if you provide an answer lm_predictions_test
test_3.61()

Question 3.7
{points: 1}

Compare the test RMPSE of k-nn regression (from last worksheet) to that of simple linear regression, which is greater?

A. Simple linear regression has a greater RMSPE

B. k-nn regression has a greater RMSPE

C. Neither, they are identical

Save the letter of the answer you think is correct to a variable named answer3.7. Make sure you put quotations around the letter and pay attention to case.

# your code here fail() # No Answer - remove if you provide an answer
test_3.7()

Question 3.8 Multiple Choice:
{points: 1}

Which model does a better job of predicting on the test data set?

A. Simple linear regression

B. k-nn regression

C. Neither, they are identical

Save the letter of the answer you think is correct to a variable named answer3.8. Make sure you put quotations around the letter and pay attention to case.

# your code here fail() # No Answer - remove if you provide an answer
test_3.8()

Question 3.9
(optional - not graded)

Given that the linear regression model is a straight line, we can write our model as a mathematical equation. We can get the two numbers we need for this (y-intercept and slope) from the finalModel attribute from our model object as shown below:

# run this cell lm_model$finalModel

Use the numbers output in the cell above to write the model as a mathematical equation.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.