GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2020-fall/materials/tutorial_09/tutorial_09.ipynb
²⁰⁵¹ views

Kernel: R

Tutorial 9: Regression Continued

Regression learning objectives:

Recognize situations where a simple regression analysis would be appropriate for making predictions.
Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.
Interpret the output of a k-nn regression.
In a dataset with two variables, perform k-nearest neighbour regression in R using tidymodels to predict the values for a test dataset.
Using R, execute cross-validation in R to choose the number of neighbours.
Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root means square prediction error).
In a dataset with > 2 variables, perform k-nn regression in R using tidymodels to predict the values for a test dataset.
In the context of k-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
Describe advantages and disadvantages of the k-nearest neighbour regression approach.
Perform ordinary least squares regression in R using tidymodels to predict the values for a test dataset.
Compare and contrast predictions obtained from k-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.
In R, overlay the ordinary least squares regression lines from geom_smooth on a single plot.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)
library(GGally)
source("tests_tutorial_09.R")
source("cleanup_tutorial_09.R")

Predicting credit card balance

Source: https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized-large.gif

Here in this worksheet we will work with a simulated data set that contains information that we can use to create a model to predict customer credit card balance. A bank might use such information to predict which customers might be the most profitable to lend to (customers who carry a balance, but do not default, for example).

Specifically, we wish to build a model to predict credit card balance (Balance column) based on income (Income column) and credit rating (Rating column).

Question 1.0
{points: 1}

Load the data located at http://faculty.marshall.usc.edu/gareth-james/ISL/Credit.csv and assign it to an object called credit using read_csv().

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(credit)

In [ ]:

test_1.0()

Question 1.1
{points: 1}

Select only the columns of data we are interested in using for our prediction (both the predictors and the response variable). Name the modified data frame credit.

Note: We could alternatively just leave these variables in and use our recipe formula below to specify our predictors and response. But for this worksheet, let's select the relevant columns first.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(credit)

In [ ]:

test_1.1()

Question 1.2
{points: 1}

Before we perform exploratory data analysis, we should create our training and testing data sets. First, split the credit data set. Use 60% of the data and set the variables we want to predict as the strata argument. Assign your answer to an object called credit_split.

Assign your training data set to an object called credit_training and your testing data set to an object called credit_testing.

In [ ]:

set.seed(2000)
# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.2()

Question 1.3
{points: 1}

Using only the observations in the training data set, create a ggpairs scatterplot of all the columns we are interested in including in our model. Name the plot object credit_eda.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
credit_eda

In [ ]:

test_1.3()

Question 1.4 Multiple Choice:
{points: 1}

Looking at the ggpairs plot above, which of the following statements is incorrect?

A. There is a strong positive relationship between the response variable (Balance) and the Rating predictor

B. There is a strong positive relationship between the two predictors (Income and Rating)

C. There is a strong positive relationship between the response variable (Balance) and the Income predictor

D. None of the above

Assign your answer to an object called answer1.4. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer1.4

In [ ]:

test_1.4()

Question 1.5
{points: 1}

Now that we have our training data, we will fit a linear regression model.

Create and assign your linear regression model specification to an object called lm_spec.
Create a recipe for the model. Assign your answer to an object called credit_recipe.

In [ ]:

set.seed(2020) #DO NOT REMOVE

# your code here
fail() # No Answer - remove if you provide an answer
print(lm_spec)
print(credit_recipe)

In [ ]:

test_1.5()

Question 1.6
{points: 1}

Now that we have our model specification and recipe, let's put them together in a workflow, and fit our simple linear regression model. Assign the fit to an object called credit_fit.

In [ ]:

set.seed(2020) # DO NOT REMOVE

# your code here
fail() # No Answer - remove if you provide an answer
credit_fit

In [ ]:

test_1.6()

Question 1.7 Multiple Choice:
{points: 1}

Looking at the slopes/coefficients above from each of the predictors, which of the following mathematical equations is correct for your prediction model?

A. $credit\: card \: balance = -531.116 -7.960*income + 3.985*credit\: card\: rating$

B. $credit\: card \: balance = -531.116 + 3.985*income -7.960*credit\: card\: rating$

C. $credit\: card \: balance = 531.116 -7.960*income - 3.985*credit\: card\: rating$

D. $credit\: card \: balance = 531.116 - 3.985*income + 7.960*credit\: card\: rating$

Assign your answer to an object called answer1.7. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer1.7

Question 1.8
{points: 1}

Calculate the $RMSE$ to assess goodness of fit on credit_fit (remember this is how well it predicts on the training data used to fit the model). Return a single numerical value named lm_rmse.

In [ ]:

set.seed(2020) # DO NOT REMOVE

#... <- credit_fit %>%
#         predict(...) %>%
#         bind_cols(...) %>%
#         ...(truth = ..., estimate = ...) %>%
#         filter(.metric == ...) %>%
#         select(...) %>%
#         pull()

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmse

In [ ]:

test_1.8()

Question 1.9
{points: 1}

Calculate $RMSPE$ using the test data. Return a single numerical value named lm_rmspe.

In [ ]:

set.seed(2020) # DO NOT REMOVE

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmspe

In [ ]:

test_1.9()

Question 1.9.1
{points: 1}

Redo this analysis using k-nn regression instead of linear regression. Use set.seed(2000) at the beginning of this code cell to make it reproducible. Use the same predictors and train - test data splits as you used for linear regression. If you need help, follow the step-by-step instructions below.

Create a recipe that contains your preprocessed data called credit_knn_recipe.
Create a tuned k-nn model specification. Assign your answer to an object called credit_knn_spec.
Create a 5-fold cross validation split of the training data. Assign your answer to an object called credit_vfold.
Put the recipe and model spec into a workflow. Assign your answer to an object called credit_knn_workflow.
Make a tibble that contains a single column called neighbors that contains all the numbers from 1 to 20. Assign your answer to an object called gridvals.
Tune the # of neighbors using cross validation, and then collect the metrics. Assign your answer to an object called credit_knn_results.
From your results, select the minimum $RMSPE$ and extract a single numerical value. Assign your answer to an object called knn_rmspe.

In [ ]:

set.seed(2000) # DO NOT REMOVE

# your code here
fail() # No Answer - remove if you provide an answer
knn_rmspe

In [ ]:

# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create objects named credit_knn_spec, credit_knn_recipe, credit_vfold, credit_knn_workflow, and knn_rmspe', {
    expect_true(exists("credit_knn_spec")) 
    expect_true(exists("credit_knn_recipe"))
    expect_true(exists("credit_vfold"))
    expect_true(exists("credit_knn_workflow"))
    expect_true(exists('knn_rmspe'))
    })

Question 1.9.2 Multiple Choice:
{points: 1}

Which of the following reasons is/are the most likely explanation(s) as to why linear regression is better at giving predictions as measured by $RMSPE$ compared to k-nn regression?

A. Even with the best 𝑘 we can pick, k-nn regression seems to have slightly overfit the training data and doesn't generalize as well to data that wasn't used to train it

B. There is a fairly linear relationship between most/all of the predictors and the target/outcome variable, so linear regression is an appropriate model and fits well

C. A & B

D. None of the above

Assign your answer to an object called answer1.9.2. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer1.9.2

In [ ]:

test_1.9.2()

2. Ames Housing Prices

Source: https://media.giphy.com/media/xUPGGuzpmG3jfeYWIg/giphy.gif

If we take a look at the Business Insider report What do millenials want in a home?, we can see that millenials like newer houses that have their own defined spaces. Today we are going to be looking at housing data to understand how the sale price of a house is determined. Finding highly detailed housing data with the final sale prices is very hard, however researchers from Truman State Univeristy have studied and made available a dataset containing multiple variables for the city of Ames, Iowa. The data set describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. You can read more about the data set here. Today we will be looking at 5 different variables to predict the sale price of a house. These variables are:

Lot Area: lot_area
Year Built: year_built
Basement Square Footage: bsmt_sf
First Floor Square Footage: first_sf
Second Floor Square Footage: second_sf

First, load the data with the script given below.

In [ ]:

# run this cell

ames_data <- read_csv('data/ames.csv', col_types = cols()) %>%
    select(lot_area = Lot.Area, 
           year_built = Year.Built, 
           bsmt_sf = Total.Bsmt.SF, 
           first_sf = `X1st.Flr.SF`, 
           second_sf = `X2nd.Flr.SF`, 
           sale_price = SalePrice) %>%
    filter(!is.na(bsmt_sf))

head(ames_data)

Question 2.1
{points: 1}

Split the data into a train dataset and a test dataset, based on a 70%-30% train-test split. Use set.seed(2019). Remember that we want to predict the sale_price based on all of the other variables.

Assign the objects to ames_split, ames_training, and ames_testing, respectively.

Use 2019 as your seed for the split.

In [ ]:

set.seed(2019) # DO NOT CHANGE!
# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create objects named ames_split, ames_training and ames_testing', {
    expect_true(exists("ames_split")) 
    expect_true(exists("ames_training")) 
    expect_true(exists("ames_testing"))  
    })

Question 2.2
{points: 1}

Let's start by exploring the training data. Use the ggpairs() function from the GGally package to explore the relationships between the different variables.

Assign your plot object to a variable named answer2.2.

In [ ]:

set.seed(2020) # DO NOT REMOVE

# your code here
fail() # No Answer - remove if you provide an answer
answer2.2

In [ ]:

# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create a plot named answer2.2', {
    expect_true(exists("answer2.2")) 
})

Question 2.3 Multiple Choice:
{points: 1}

Now that we have seen all the relationships between the variables, which of the following variables would not be a strong predictor for sale_price?

A. bsmt_sf

B. year_built

C. first_sf

D. lot_area

E. second_sf

F. It isn't clear from these plots

Assign your answer to an object called answer2.3. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.3

In [ ]:

# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object called answer2.3', {
    expect_true(exists('answer2.3'))
})

Question 2.4 - Linear Regression
{points: 1}

Fit a linear regression model using tidymodels with ames_training using all the variables in the data set.

create a model specification called lm_spec
create a recipe called ames_recipe
create a workflow with your model spec and recipe, and then create the model fit and name it ames_fit

In [ ]:

set.seed(2020) # DO NOT REMOVE

# your code here
fail() # No Answer - remove if you provide an answer
ames_fit

In [ ]:

# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object named lm_spec', {
    expect_true(exists("lm_spec")) 
    })
test_that('Did not create an object named ames_recipe', {
    expect_true(exists("ames_recipe")) 
    })
test_that('Did not create an object named ames_fit', {
    expect_true(exists("ames_fit")) 
    })

Question 2.5 True or False:
{points: 1}

Aside from the intercept, all the variables have a positive relationship with the sale_price. This can be interpreted as the variables decrease, the price of the houses increase.

Assign your answer to an object called answer2.5. Make sure your answer is in lowercase letters and is surrounded by quotation marks (e.g. "true" or "false").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.5

In [ ]:

# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object named answer2.5', {
    expect_true(exists("answer2.5")) 
    })

In [ ]:

# run this cell
ames_fit$fit$fit$fit$coefficients

Question 2.6
{points: 3}

Looking at the coefficients and intercept produced from the cell block above, write down the equation for the linear model.

Make sure to use correct math typesetting syntax (i.e., surround your answer with dollar signs, $a = b$ )

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 2.7 Multiple Choice:
{points: 1}

Why can we not easily visualize the model above as a line or a plane in a single plot?

A. This is not true, we can actually easily visualize the model

B. The intercept is much larger (6 digits) than the coefficients (single/double digits)

C. There are more than 2 predictors

D. None of the above

Assign your answer to an object called answer2.7. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.7

In [ ]:

# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object named answer2.7', {
    expect_true(exists("answer2.7")) 
    })

Question 2.8
{points: 1}

We need to evaluate how well our model is doing. For this question, calculate the $RMSPE$ (a single numerical value) of the linear regression model using the test data set and assign it to an object named ames_rmspe.

In [ ]:

set.seed(2020) # DO NOT REMOVE

# your code here
fail() # No Answer - remove if you provide an answer
ames_rmspe

In [ ]:

# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object named ames_rmspe', {
    expect_true(exists("ames_rmspe")) 
    })

Question 2.9 Multiple Choice:
{points: 1}

Which of the following statements is incorrect?

A. $RMSE$ is a measure of goodness of fit

B. $RMSE$ measures how well the model predicts on data it was trained with

C. $RMSPE$ measures how well the model predicts on data it was not trained with

D. $RMSPE$ measures how well the model pedicts on data it was trained with

Assign your answer to an object called answer2.9. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer2.9

In [ ]:

# We check that you've created objects with the right names below
# But all other tests were intentionally hidden so that you can practice deciding 
# when you have the correct answer.
test_that('Did not create an object named answer2.9', {
    expect_true(exists("answer2.9")) 
    })

In [ ]:

source("cleanup_tutorial_09.R")

Tutorial 9: Regression Continued

Predicting credit card balance

2. Ames Housing Prices

Product

Resources

Company