GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/materials/tutorial_09/tutorial_09.ipynb
²⁰⁵¹ views

Kernel: R

Tutorial 10: Regression Wrap-Up

Regression learning objectives:

Recognize situations where a simple regression analysis would be appropriate for making predictions.
Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.
Interpret the output of a k-nn regression.
In a dataset with two variables, perform k-nearest neighbour regression in R using caret::train() to predict the values for a test dataset.
Using R, execute cross-validation in R to choose the number of neighbours.
Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root means square prediction error).
In a dataset with > 2 variables, perform k-nn regression in R using caret's train with method = "knn" to predict the values for a test dataset.
In the context of k-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
Describe advantages and disadvantages of the k-nearest neighbour regression approach.
Perform ordinary least squares regression in R using caret's train with method = "lm" to predict the values for a test dataset.
Compare and contrast predictions obtained from k-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.
In R, overlay the ordinary least squares regression lines from geom_smooth on a single plot.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(caret)
library(GGally)
source("tests_tutorial_09.R")

Predicting credit card balance

Source: https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized-large.gif

Here in this worksheet we will work with a simulated data set that contains information that we can use to create a model to predict customer credit card balance. A bank might use such information to predict which customers might be the most profitable to lend to (customers who carry a balance, but do not default, for example).

Specifically, we wish to build a model to predict credit card balance (Balance column) based on income (Income column) and credit rating (Rating column).

In [ ]:

Question 1.0
{points: 1}

Load the data located at this URL and assign it to an object called credit.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(credit)

In [ ]:

test_1.0()

Question 1.1
{points: 1}

Select only the columns of data we are interested in using for our prediction (both the predictors and the response variable). Name the modified data frame credit.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(credit)

In [ ]:

test_1.1()

Question 1.2
{points: 1}

Before we perform exploratory data analysis, we should create our training and testing data sets. We will use 60% of the data as training data. Use set.seed(2000) and use the Balance column as the input to createDataPartition().

At the end of this question you should have 4 objects named X_train, Y_train, X_test and Y_test.

In [ ]:

set.seed(2000)
# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.2()

Question 1.3
{points: 1}

Using only the observations in the data set, create a ggpairs scatterplot of all the columns we are interested in including in our model. Name the plot object credit_eda.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
credit_eda

In [ ]:

test_1.3()

Question 1.4
{points: 3}

Discuss the relationship you observe in the scatter plots above between the response variable and each predictor.

YOUR ANSWER HERE

Question 1.5
{points: 1}

Now use caret's train function with method = "lm" to fit your linear regression model. Name your linear regression model object lm_model.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.5()

Question 1.6
{points: 1}

Let's print out a table of the regression slopes/coefficients. To do this, we want to access some attributes of the model object. We provide scaffolding below. To get the slopes/coefficients as a nice data frame, we then use the t() (transpose) function to pivot the data, and then data.frame() to convert it to a data frame. Name the resultant data frame lm_coeffs.

In [ ]:

# ... <- ...$finalModel$coefficients 
#    %>% ...
#    %>% data.frame()

# your code here
fail() # No Answer - remove if you provide an answer
lm_coeffs

In [ ]:

test_1.6()

Question 1.7
{points: 3}

Looking at the slopes/coefficients above from each of the predictors, write a mathematical equation for your prediction model.

A couple hints:

surrounding your equation with $ signs in a markdown cell, makes it a LaTeX equation
to add white space in a LaTeX equation, use \:

YOUR ANSWER HERE

Question 1.8
{points: 1}

Calculate the $RMSE$ to assess goodness of fit on your lm_model (remember this is how well it predicts on the training data used to fit the model). Return a single numerical value named lm_rmse.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmse

In [ ]:

test_1.8()

Question 1.9
{points: 1}

Calculate $RMSPE$ using the test data. Return a single numerical value named lm_rmspe.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmspe

In [ ]:

test_1.9()

Question 1.9.1
{points: 3}

Redo this analysis using k-nn regression instead of linear regression. Use set.seed(2000) at the beginning of this code cell to make it reproducible. Assign a single numeric value for $RMSPE$ for your k-nn model as your answer, and name it knn_rmspe. Use the same predictors and train - test data split as you used for linear regression, and use 10-fold cross validation to choose $k$ .

In [ ]:

set.seed(2000)
# your code here
fail() # No Answer - remove if you provide an answer
knn_rmspe

Question 1.9.2
{points: 3}

Discuss which model gives better predictions and why you think that might be happening.

YOUR ANSWER HERE

2. Ames Housing Prices

Source: https://media.giphy.com/media/xUPGGuzpmG3jfeYWIg/giphy.gif

If we take a look at the Business Insider report What do millenials want in a home?, we can see that millenials like newer houses that have their own defined spaces. Today we are going to be looking at housing data to understand how the sale price of a house is determined. Finding highly detailed housing data with the final sale prices is very hard, however researchers from Truman State Univeristy have studied and made available a dataset containing multiple variables for the city of Ames, Iowa. The data set describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. You can read more about the data set here. Today we will be looking at 5 different variables to predict the sale price of a house. These variables are:

Lot Area: lot_area
Year Built: year_built
Basement Square Footage: bsmt_sf
First Floor Square Footage: first_sf
Second Floor Square Footage: second_sf

We are going to be looking at two approaches, linear regression and KNN regression. First, load the data with the script given below.

In [ ]:

ames_data <- read_csv('data/ames.csv', col_types = cols()) %>%
    select(lot_area = Lot.Area, 
           year_built = Year.Built, 
           bsmt_sf = Total.Bsmt.SF, 
           first_sf = `X1st.Flr.SF`, 
           second_sf = `X2nd.Flr.SF`, 
           sale_price = SalePrice) %>%
    filter(!is.na(bsmt_sf))

head(ames_data)

Question 2.1
{points: 3}

Split the data into a train dataset and a test dataset, based on a 70%-30% train-test split. Use set.seed(2019). Remember that we want to predict the sale_price based on all of the other variables.

Assign the objects to X_train, Y_train, X_test, and Y_test respectively.

Use 2019 as your seed for the split.

In [ ]:

set.seed(2019) # DO NOT CHANGE!
# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Did not create objects named X_train, Y_train, X_test, Y_test', {
    expect_true(exists("X_train")) 
    expect_true(exists("Y_train")) 
    expect_true(exists("X_test")) 
    expect_true(exists("Y_test")) 
    })

Question 2.2
{points: 3}

Let's start by exploring the training data. Use the ggpairs() function from the GGally package to explore the relationships between the different variables.

Assign your plot object to a variable named answer2.2.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Did not create a plot named answer2.2', {
    expect_true(exists("answer2.2")) 
})

Question 2.3
{points: 3}

Now that we have seen the multiple relationships between the variables, which one(s) do you think will be strong predictors for sale_price? On what do you base this? If you think none would be good predictors, explain why.

YOUR ANSWER HERE

Question 2.4 - Linear Regression
{points: 3}

Fit a linear regression model with X_train and y_train using all the variables in the data set and save it to an object called lm_reg. Extract the coefficients of the model and assign them to an object named lm_coefs.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Did not create an object named lm_reg', {
    expect_true(exists("lm_reg")) 
    })
test_that('Did not create an object named lm_coefs', {
    expect_true(exists("lm_coefs")) 
    })

Question 2.5
{points: 3}

Provide a brief interpretation of the coefficients above and discuss the relationship between sale_price and each variable (i.e., does the slope indicate a positive or negative relationship between house sale price and each of the predictors).

You don't have to interpret the intercept.

YOUR ANSWER HERE

Question 2.6
{points: 3}

Use the coefficients above to write a mathematical equation for your prediction model. You can round the coefficients to two decimal places for this.

YOUR ANSWER HERE

Question 2.7
{points: 3}

Could we easily visualize the predictions of the model above as a line or a plane in a single plot? If so, explain how. If not, explain why not.

YOUR ANSWER HERE

Question 2.8
{points: 3}

We need to evaluate how well our model is doing. For this calculate the RMSPE of the linear regression model and assign it to an object named lm_rmspe.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmspe

In [ ]:

test_that('Did not create an object named lm_rmspe', {
    expect_true(exists("lm_rmspe")) 
    })

Question 2.9
{points: 3}

Above we calculated $RMSPE$ to assess how well our model predicts on new data. Why did we not calculate $RMSE$ to answer that question? (hint - think of the definitions of $RMSE$ and $RMSPE$ )

YOUR ANSWER HERE

Question 2.10
(optional)

"Logarithmically transforming variables in a regression model is a very common way to handle situations where a non-linear relationship exists between the independent and dependent variables. Using the logarithm of one or more variables instead of the un-logged form makes the effective relationship non-linear, while still preserving the linear model." - Linear Regression Models with Logarithmic Transformations

Take the logarithm of the sale_price variable and fit your linear model again. Do you have a lower RMSPE? Do you have the same model as in question 1.4?

Tutorial 10: Regression Wrap-Up

Predicting credit card balance

2. Ames Housing Prices

Product

Resources

Company