GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/tutorial_10/tutorial_10.ipynb
²⁰⁵¹ views

Kernel: R

Tutorial 10: Regression wrap-up

Regression learning objectives: Recognize situations where a simple regression analysis would be appropriate for making predictions.

Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.
Interpret the output of a k-nn regression.
In a dataset with two variables, perform k-nearest neighbour regression in R using caret::train() to predict the values for a test dataset.
Using R, execute cross-validation in R to choose the number of neighbours.
Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root means square prediction error).
Describe advantages and disadvantages of the k-nearest neighbour regression approach.
In the context of k-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
In a dataset with 2 variables, perform simple ordinary least squares regression in R using caret's train with method = "lm" to predict the values for a test dataset.
Compare and contrast predictions obtained from k-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.
In R, overlay the ordinary least squares regression lines from geom_smooth on a single plot.
In a dataset with > 2 variables, perform k-nn regression in R using caret’s train with method = "k-nn" to predict the values for a test dataset.
In a dataset with > 2 variables, perform simple ordinary least squares regression in R using caret’s train with method = "lm" to predict the values for a test dataset.

1. Ames Housing Prices

Source: https://media.giphy.com/media/xUPGGuzpmG3jfeYWIg/giphy.gif

If we take a look at the Business Insider report What do millenials want in a home?, we can see that millenials like newer houses that have their own defined spaces. Today we are going to be looking at housing data to understand how the sale price of a house is determined. Finding highly detailed housing data with the final sale prices is very hard, however researchers from Truman State Univeristy have studied and made available a dataset containing multiple variables for the city of Ames, Iowa. The data set describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. You can read more about the data set here. Today we will be looking at 5 different variables to predict the sale price of a house. These variables are:

Lot Area: lot_area
Year Built: year_built
Basement Square Footage: bsmt_sf
First Floor Square Footage: first_sf
Second Floor Square Footage: second_sf

We are going to be looking at two approaches, linear regression and KNN regression. First, load the data with the script given below.

In [ ]:

library(repr)
library(tidyverse)
library(testthat)
library(digest)
library(GGally)
library(caret)

In [ ]:

ames_data <- read_csv('data/ames.csv', col_types = cols()) %>%
    select(lot_area = Lot.Area, 
           year_built = Year.Built, 
           bsmt_sf = Total.Bsmt.SF, 
           first_sf = `X1st.Flr.SF`, 
           second_sf = `X2nd.Flr.SF`, 
           sale_price = SalePrice) %>%
    filter(!is.na(bsmt_sf))

head(ames_data)

Question 1.1

Let's start by exploring the data. Use the ggpairs() function from the GGally package to explore the relationships between the different variables.

Assign your plot object to a variable named answer1.1.

In [ ]:

# YOUR CODE HERE

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(answer1.1$data), 2929)
    expect_equal(ncol(answer1.1$data), 6)
    expect_true('ggmatrix' %in% c(class(answer1.1)))
    })
print("Success!")

Question 1.2

Now that we have seen the multiple relationships between the variables, which one(s) do you think will be good predictors for sale_price? On what do you base this? If you think none would be good predictors, explain why.

YOUR ANSWER HERE

Question 1.3

Let's split the data into a train dataset and a test dataset, based on a 70%-30% split. Remember that we want to predict the sale_price based on all of the other variables.

Assign the objects to X_train, Y_train, X_test, and Y_test respectively.

Use 2019 as your seed for the split.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
!('data.frame' %in% class(Y_train))

In [ ]:

test_that('Solution is correct', {
    expect_equal(dim(X_train), c(2052, 5))
    expect_equal(class(X_train), 'data.frame')
    expect_equal(dim(X_test), c(877, 5))
    expect_equal(class(X_test), 'data.frame')
    expect_equal(length(Y_train), 2052)
    expect_true(!('data.frame' %in% class(Y_train)))
    expect_equal(length(Y_test), 877)
    expect_true(!('data.frame' %in% class(Y_test)))
})
print("Success!")

Question 1.4 - Linear Regression

Fit a linear regression model with X_train and y_train and save it to an object called lm_reg.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_true(lm_reg$results$intercept)
    expect_equal(lm_reg$method, 'lm')
    expect_equal(dim(lm_reg$trainingData), c(2052, 6))
})
print("Success!")

Question 1.5

Extract the coefficients of the model and provide a brief interpretation of what is the relationship between sale_price and each variable (i.e., does the slope indicate a positive or negative relationship between house sale price and each of the predictors).

You don't have to interpret the intercept.

Assign the coefficients to an object named lm_coefs.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
lm_coefs

In [ ]:

test_that('Solution is correct', {
    expect_true(round(lm_coefs[1], 3) == -1644643.107)
    expect_equal(attributes(lm_coefs[1])$names, '(Intercept)')
})

print("Success!")

YOUR ANSWER HERE

Question 1.51

Use the coefficients above to write a mathematical equation for your prediction model. You can round the coefficients to two decimal places for this.

YOUR ANSWER HERE

Question 1.52 (Optional - not graded)

Could we easily visualize the predictions of the model above as a line or a plane in a single plot? If so, explain how. If not, explain why not.

YOUR ANSWER HERE

Question 1.6

We need to evaluate how well our model is doing. For this calculate the RMSPE of the linear regression model and assign it to an object named lm_rmspe.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
lm_rmspe

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(round(lm_rmspe, 1)), '298eb726436a758681e43150399b6fb8')
})
print("Success!")

Question 1.7 - KNN Regression

We are now going to do KNN Regression to see how both models compare on this dataset. We first need to scale our variables. Assign an object with the scaled predictors to an object named scaled_ames_data.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

head(scaled_ames_data)

Optional Question - Won't be graded

Try doing the scaling using the mutate_at() function or one of the functions in the map_* family.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 1.8

Let's split the data into a train dataset and a test dataset, based on a 70%-30% split.

Assign the objects to X_train, Y_train, X_test, and Y_test respectively.

Use 2019 as your seed for the split.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 1.9

Remember that one of the steps of K-NN regression is choosing which K we are going to use. Use 10-fold cross validation to determine which K you will use. Assign the best K to an object called best_k.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
best_k

Question 1.10

Now that we know the best K, we can go ahead and train our K-NN regression model. Assign the model to an object called knn_reg.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 1.11

Let's evaluate how well our model is doing. For this calculate the RMSPE of the KNN regression model and assign it to an object named knn_rmspe.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
knn_rmspe

Question 1.12

Compare the RMSPE of both KNN and Linear regression. Explain why you think one method might be better than the other.

YOUR ANSWER HERE

Question 1.13

Describe and explain one advantage of using Linear Regression over K-NN regression.

YOUR ANSWER HERE

Question 1.14

Above we calculated $RMSPE$ to assess how well our model predicts on new data. Why did we not calculate $RMSE$ to answer that question? (hint - think of the definitions of $RMSE$ and $RMSPE$ )

YOUR ANSWER HERE

Question 1.15(OPTIONAL)

"Logarithmically transforming variables in a regression model is a very common way to handle situations where a non-linear relationship exists between the independent and dependent variables. Using the logarithm of one or more variables instead of the un-logged form makes the effective relationship non-linear, while still preserving the linear model." - Linear Regression Models with Logarithmic Transformations

Take the logarithm of the sale_price variable and fit your linear model again. Do you have a lower RMSPE? Do you have the same model as in question 1.4?

2. World Cup 2022

Source: https://media.giphy.com/media/75NksGAgEUicg/giphy.gif

Football, the most loved sport around the world. Canada's football federation is interested in getting qualified for the 2022 World Cup in Qatar. The director of Sports Analytics of the federation has told you that we have data of how good the players are at penalties for players around the world , except for players from Mexico, which is the best team in the region. To predict how good the Mexican players are at penalties, we are going to use regression based on the following variables:

Finishing: How good the player is at scoring goals
ShotPower: How strong the player kicks the ball
Agility: How agile the player is
HeadingAccuracy: How good is the player with headers

What is a penalty?

"A penalty kick (commonly known as a penalty or a PK) is a method of restarting play in association football, in which a player is allowed to take a single shot on the goal while it is defended only by the opposing team's goalkeeper. It is awarded when a foul punishable by a direct free kick is committed by a player in his or her own penalty area. The shot is taken from the penalty mark, which is 12 yards (11 m) from the goal line and centred between the touch lines. In practice, penalty kicks result in goals more often than not, even against the best and most experienced goalkeepers. This means that penalty awards are often decisive, especially in low-scoring games." - Wikipedia

In [ ]:

# run this cell to load the data
set.seed(2019)
fifa_data <- read_csv('data/fifa_players.csv', col_types = cols()) %>%
                select(Name, Finishing, ShotPower, Agility, HeadingAccuracy, Penalties) %>%
                filter(!is.na(Finishing)) %>%
                sample_n(8000)

head(fifa_data)

fifa_data <- fifa_data %>%
    select(-Name)

Question 2.1

Visualize the relationship between all the variables and the distribution of each variable.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 2.2

Do the following:

Split the data into a train dataset and a test dataset, based on a 60%-40% split.
Fit a linear regression model on the training data set
Print out the coefficients of the model.

Remember that we want to predict the Penalties based on all of the other variables.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 2.3

Describe the relationship between the Finishing score of a player and their Penalties score? (i.e., does the slope indicate a positive or negative relationship between penalties and each of the predictors).

YOUR ANSWER HERE

Question 2.4

Use the coefficients above to write a mathematical equation for your prediction model. You can round the coefficients to two decimal places for this.

YOUR ANSWER HERE

Question 2.5

If we used k-nn regression on this problem, in place of linear regression, could we come up with an equivalent mathematical description of the model as we have for linear regression?

YOUR ANSWER HERE

Question 2.6

Calculate the $RMSPE$ using the test data.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 2.7

Predict the Penalties score for each of the players from Mexico. This is the information the Director of Sports Analytics of the federation wants so that the athletic team can make the corresponding technical decisions.

In [ ]:

mexico_players <- read_csv('data/mexico_players.csv', col_types = cols()) %>%
                select(Name, Finishing, ShotPower, Agility, HeadingAccuracy)

head(mexico_players)

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Question 2.7 - Optional

Repeat this process with KNN regression and compare the predictions. Scaling isn't necessary here as all the variables have a possible range of 0-100.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

Tutorial 10: Regression wrap-up

1. Ames Housing Prices

2. World Cup 2022

Product

Resources

Company