GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_10/worksheet_10.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 10 - Regression cont'd

Lecture and Tutorial Learning Goals:

By the end of the week, students will be able to:

In a dataset with > 2 variables, perform k-nn regression in R using caret's train with method = "k-nn" to predict the values for a test dataset.
In a dataset with > 2 variables, perform simple ordinary least squares regression in R using caret's train with method = "lm" to predict the values for a test dataset.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(caret)
library(GGally)

Warm-up questions

Here are some warm-up questions on the topic of multivariate regression to get you thinking before we jump into data analysis. The course readings should help you answer these.

Question 0.0

In multivariate k-nn regression with one outcome/target variable and two predictor variables, the predictions take the form of what shape?

A. a flat plane

B. a wiggly/flexible plane

C. A straight line

D. a wiggly/flexible line

E. a 4D hyperplane

F. a 4D wiggly/flexible hyperplane

Save the letter of the answer you think is correct to a variable named answer0.0. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer0.0

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer0.0), '3a5505c06543876fe45598b5e5e5195d')
})
print("Success!")

Question 0.1

You must scale the predictor variables for linear regression once you are working with > 1 predictor (i.e., multivariate linear regression). True or false?

To answer, assign the value "true" or "false" to a variable named answer0.1. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer0.1

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer0.1), 'd2a90307aac5ae8d0ef58e2fe730d38b')
})
print("Success!")

Question 0.2 (optional - not graded)

If there exists a case in multivariate regression where you do need to scale the predictors, what is it? And why do you need to do that? If you think there is none, then state that.

YOUR ANSWER HERE

Predicting credit card balance

Source: https://media.giphy.com/media/LCdPNT81vlv3y/giphy-downsized-large.gif

Here in this worksheet we will work with a simulated data set that contains information that we can use to create a model to predict customer credit card balance. A bank might use such information to predict which customers might be the most profitable to lend to (customers that carry a balance, but do not default, for example).

Specifically, we wish to build a model to predict credit card balance (Balance column) based on income (Income column), credit rating (Rating column), credit card limit (Limit column) and age (Age column).

Question 1.0

Load the data located at this URL and assign it to an object called credit.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(credit)

In [ ]:

test_that('Solution is correct', {
    expect_equal(nrow(credit), 400)
    expect_equal(ncol(credit), 12)
    expect_that("Income" %in% colnames(credit), is_true())
    expect_that("Balance" %in% colnames(credit), is_true())
})
print("Success!")

Question 1.1

The first column in the data set, named X1 is simply the row numbers, and thus not informative. Let's remove it from the data frame using select. Name the modified data frame credit.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(credit)

In [ ]:

test_that('Solution is correct', {
    expect_equal(nrow(credit), 400)
    expect_equal(ncol(credit), 11)
    expect_that("Income" %in% colnames(credit), is_true())
    expect_that("Balance" %in% colnames(credit), is_true())
    expect_that("X1" %in% colnames(credit), is_false())
})
print("Success!")

Question 1.2

Using all the observations in the data set, create a ggpairs scatterplot of all the columns we are interested in including in our model. Name the plot object credit_eda.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
credit_eda

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(credit_eda$data), 400)
    expect_equal(ncol(credit_eda$data), 5)
    expect_true('ggmatrix' %in% c(class(credit_eda)))
    })
print("Success!")

Question 1.3

Based on your exploratory data visualization above, are there any columns that do not appear to be somewhat correlated (i.e., increase or decrease as balance increases or decreases) with balance (and thus we might not want to include them in our analysis)?

A. Income

B. Rating

C. Limit

D. Age

E. All appear to correlate well with credit card balance

Save the letter of the answer you think is correct to a variable named answer1.3. Make sure you put quotations around the letter and pay attention to case.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
answer1.3

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(answer1.3), 'c1f86f7430df7ddb256980ea6a3b57a4')
})
print("Success!")

Question 1.4

If you answered that a column should be removed based on your exploratory data analysis, do that now. Also remove the columns that we are not interested in analyzing (ones that we did not specify in the problem statement at the top of the worksheet). Name the modified data frame credit. If you answered that no column should be removed, then only remove the columns we are not interested in analyzing (ones we did not specify in the problem statement at the top of the worksheet).

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(credit)

In [ ]:

test_that('Solution is correct', {
    expect_equal(nrow(credit), 400)
    expect_equal(digest(as.numeric(dim(credit))), 'aea34bfa210afd186bc0977c50cce4bc')
    expect_equal(map(colnames(credit), digest) %in% c('50dfc98e31ea6af6d66703369d562adc', 'b048a33cce93da4bff391dc74e4dd49c', '1fcf72666a688bb8f196a3a47799e640', 'dabae570c8079ba0d7d2dfab803abbf4'), c(TRUE, TRUE, TRUE, TRUE))
})
print("Success!")

Question 1.5

Now that we have done our exploratory data analysis and cleaned up our data, we should create our training and testing data sets. We will use 60% of the data as training data. Use set.seed(2000) and use the Balance column as the input to createDataPartition().

At the end of this question you should have 4 objects named X_train, Y_train, X_test and Y_test.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_equal(dim(X_train), c(241, 3))
    expect_equal(class(X_train), 'data.frame')
    expect_equal(dim(X_test), c(159, 3))
    expect_equal(class(X_test), 'data.frame')
    expect_equal(length(Y_train), 241)
    expect_equal(class(Y_train), 'integer')
    expect_equal(length(Y_test), 159)
    expect_equal(class(Y_test), 'integer')
})
print("Success!")

Question 1.6

Now use caret's train function with method = "lm" to fit you linear regression model. Name your linear regression model object lm_model.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_true(lm_model$results$intercept)
    expect_equal(lm_model$method, 'lm')
    expect_equal(colnames(lm_model$trainingData), c('Income', 'Rating', 'Limit', '.outcome'))
    expect_equal(dim(lm_model$trainingData), c(241, 4))
})
print("Success!")

Question 1.7

Given that we cannot see in 4 dimensions, instead of creating a visualization of the model, let's print out a table of the regression slopes/coefficients. To do this, we want to access some attributes of the model object. We provide scaffolding below. To get the slopes/coefficients as a nice data frame, we then use the t() (transpose) function to pivot the data, and then data.frame() to convert it to a data frame. Name the resultant data frame lm_coeffs.

In [ ]:

# ... <- ...$finalModel$coefficients 
#    %>% ...
#    %>% data.frame()

# your code here
fail() # No Answer - remove if you provide an answer
lm_coeffs

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(dim(lm_coeffs), c(1, 4))
    expect_equal(class(lm_coeffs), 'data.frame')
    })
print("Success!")

Question 1.8 (optional - not graded)

Looking at the slopes/coefficients above from each of the predictors, write a mathematical equation for your prediction model.

A couple hints:

surrounding your equation with $ signs in a markdown cell, makes it a LaTeX equation
to add white space in a LaTeX equation, use \:

YOUR ANSWER HERE

Question 1.9

Calculate the $RMSE$ to assess goodness of fit on your lm_model (remember this is how well it predicts on the training data used to fit the model). Return a single numerical value named lm_rmse.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmse

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(round(lm_rmse, 1)), '65f0553dfb97ae3f80c957eb60271f5b')
})
print("Success!")

Question 2.0

Calculate $RMSPE$ using the test data. Return a single numerical value named lm_rmspe.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmspe

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(round(lm_rmspe, 1)), '63735d3afca0c50ccf19bfd8381bff44')
})
print("Success!")

Question 2.1

Redo this analysis using k-nn regression instead of linear regression. Assign a single numeric value for $RMSPE$ for your k-nn model as your answer, and name it knn_rmspe. Use the same predictors and train - test data split as you used for linear regression, and use 10-fold cross validation to choose $k$ .

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(class(knn_rmspe)), '46606ee201b428a3fa6c8a0d3d9e671c')
    expect_that(knn_rmspe < 380, is_true())
    expect_that(knn_rmspe > 320, is_true())
})
print("Success!")

Question 2.2 (optional - not graded)

Discuss which model gives better predictions and why you think that might be happening.

YOUR ANSWER HERE

Worksheet 10 - Regression cont'd

Lecture and Tutorial Learning Goals:

Warm-up questions

Question 0.0

Question 0.1

Question 0.2 (optional - not graded)

Predicting credit card balance

Question 1.0

Question 1.1

Question 1.2

Question 1.3

Question 1.4

Question 1.5

Question 1.6

Question 1.7

Question 1.8 (optional - not graded)

Question 1.9

Question 2.0

Question 2.1

Question 2.2 (optional - not graded)

Product

Resources

Company