Path: blob/master/2019-spring/materials/tutorial_10/tutorial_10.ipynb
2051 views
Tutorial 10: Regression wrap-up
Regression learning objectives: Recognize situations where a simple regression analysis would be appropriate for making predictions.
Explain the k-nearest neighbour (k-nn) regression algorithm and describe how it differs from k-nn classification.
Interpret the output of a k-nn regression.
In a dataset with two variables, perform k-nearest neighbour regression in R using
caret::train()
to predict the values for a test dataset.Using R, execute cross-validation in R to choose the number of neighbours.
Using R, evaluate k-nn regression prediction accuracy using a test data set and an appropriate metric (e.g., root means square prediction error).
Describe advantages and disadvantages of the k-nearest neighbour regression approach.
In the context of k-nn regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
In a dataset with 2 variables, perform simple ordinary least squares regression in R using
caret
'strain
withmethod = "lm"
to predict the values for a test dataset.Compare and contrast predictions obtained from k-nearest neighbour regression to those obtained using simple ordinary least squares regression from the same dataset.
In R, overlay the ordinary least squares regression lines from
geom_smooth
on a single plot.In a dataset with > 2 variables, perform k-nn regression in R using
caret
’strain
withmethod = "k-nn"
to predict the values for a test dataset.In a dataset with > 2 variables, perform simple ordinary least squares regression in R using caret’s train with
method = "lm"
to predict the values for a test dataset.
1. Ames Housing Prices
Source: https://media.giphy.com/media/xUPGGuzpmG3jfeYWIg/giphy.gif
If we take a look at the Business Insider report What do millenials want in a home?, we can see that millenials like newer houses that have their own defined spaces. Today we are going to be looking at housing data to understand how the sale price of a house is determined. Finding highly detailed housing data with the final sale prices is very hard, however researchers from Truman State Univeristy have studied and made available a dataset containing multiple variables for the city of Ames, Iowa. The data set describes the sale of individual residential property in Ames, Iowa from 2006 to 2010. You can read more about the data set here. Today we will be looking at 5 different variables to predict the sale price of a house. These variables are:
Lot Area:
lot_area
Year Built:
year_built
Basement Square Footage:
bsmt_sf
First Floor Square Footage:
first_sf
Second Floor Square Footage:
second_sf
We are going to be looking at two approaches, linear regression and KNN regression. First, load the data with the script given below.
Question 1.1
Let's start by exploring the data. Use the ggpairs()
function from the GGally package to explore the relationships between the different variables.
Assign your plot object to a variable named answer1.1
.
Question 1.2
Now that we have seen the multiple relationships between the variables, which one(s) do you think will be good predictors for sale_price
? On what do you base this? If you think none would be good predictors, explain why.
YOUR ANSWER HERE
Question 1.3
Let's split the data into a train dataset and a test dataset, based on a 70%-30% split. Remember that we want to predict the sale_price
based on all of the other variables.
Assign the objects to X_train
, Y_train
, X_test
, and Y_test
respectively.
Use 2019 as your seed for the split.
Question 1.4 - Linear Regression
Fit a linear regression model with X_train
and y_train
and save it to an object called lm_reg
.
Question 1.5
Extract the coefficients of the model and provide a brief interpretation of what is the relationship between sale_price
and each variable (i.e., does the slope indicate a positive or negative relationship between house sale price and each of the predictors).
You don't have to interpret the intercept.
Assign the coefficients to an object named lm_coefs
.
YOUR ANSWER HERE
Question 1.51
Use the coefficients above to write a mathematical equation for your prediction model. You can round the coefficients to two decimal places for this.
YOUR ANSWER HERE
Question 1.52 (Optional - not graded)
Could we easily visualize the predictions of the model above as a line or a plane in a single plot? If so, explain how. If not, explain why not.
YOUR ANSWER HERE
Question 1.6
We need to evaluate how well our model is doing. For this calculate the RMSPE of the linear regression model and assign it to an object named lm_rmspe
.
Question 1.7 - KNN Regression
We are now going to do KNN Regression to see how both models compare on this dataset. We first need to scale our variables. Assign an object with the scaled predictors to an object named scaled_ames_data
.
Optional Question - Won't be graded
Try doing the scaling using the mutate_at()
function or one of the functions in the map_*
family.
Question 1.8
Let's split the data into a train dataset and a test dataset, based on a 70%-30% split.
Assign the objects to X_train
, Y_train
, X_test
, and Y_test
respectively.
Use 2019 as your seed for the split.
Question 1.9
Remember that one of the steps of K-NN regression is choosing which K we are going to use. Use 10-fold cross validation to determine which K you will use. Assign the best K to an object called best_k
.
Question 1.10
Now that we know the best K, we can go ahead and train our K-NN regression model. Assign the model to an object called knn_reg
.
Question 1.11
Let's evaluate how well our model is doing. For this calculate the RMSPE of the KNN regression model and assign it to an object named knn_rmspe
.
Question 1.12
Compare the RMSPE of both KNN and Linear regression. Explain why you think one method might be better than the other.
YOUR ANSWER HERE
Question 1.13
Describe and explain one advantage of using Linear Regression over K-NN regression.
YOUR ANSWER HERE
Question 1.14
Above we calculated to assess how well our model predicts on new data. Why did we not calculate to answer that question? (hint - think of the definitions of and )
YOUR ANSWER HERE
Question 1.15(OPTIONAL)
"Logarithmically transforming variables in a regression model is a very common way to handle situations where a non-linear relationship exists between the independent and dependent variables. Using the logarithm of one or more variables instead of the un-logged form makes the effective relationship non-linear, while still preserving the linear model." - Linear Regression Models with Logarithmic Transformations
Take the logarithm of the sale_price
variable and fit your linear model again. Do you have a lower RMSPE? Do you have the same model as in question 1.4?
2. World Cup 2022
Source: https://media.giphy.com/media/75NksGAgEUicg/giphy.gif
Football, the most loved sport around the world. Canada's football federation is interested in getting qualified for the 2022 World Cup in Qatar. The director of Sports Analytics of the federation has told you that we have data of how good the players are at penalties for players around the world , except for players from Mexico, which is the best team in the region. To predict how good the Mexican players are at penalties, we are going to use regression based on the following variables:
Finishing
: How good the player is at scoring goalsShotPower
: How strong the player kicks the ballAgility
: How agile the player isHeadingAccuracy
: How good is the player with headers
What is a penalty?
"A penalty kick (commonly known as a penalty or a PK) is a method of restarting play in association football, in which a player is allowed to take a single shot on the goal while it is defended only by the opposing team's goalkeeper. It is awarded when a foul punishable by a direct free kick is committed by a player in his or her own penalty area. The shot is taken from the penalty mark, which is 12 yards (11 m) from the goal line and centred between the touch lines. In practice, penalty kicks result in goals more often than not, even against the best and most experienced goalkeepers. This means that penalty awards are often decisive, especially in low-scoring games." - Wikipedia
Question 2.1
Visualize the relationship between all the variables and the distribution of each variable.
Question 2.2
Do the following:
Split the data into a train dataset and a test dataset, based on a 60%-40% split.
Fit a linear regression model on the training data set
Print out the coefficients of the model.
Remember that we want to predict the Penalties
based on all of the other variables.
Question 2.3
Describe the relationship between the Finishing
score of a player and their Penalties
score? (i.e., does the slope indicate a positive or negative relationship between penalties and each of the predictors).
YOUR ANSWER HERE
Question 2.4
Use the coefficients above to write a mathematical equation for your prediction model. You can round the coefficients to two decimal places for this.
YOUR ANSWER HERE
Question 2.5
If we used k-nn regression on this problem, in place of linear regression, could we come up with an equivalent mathematical description of the model as we have for linear regression?
YOUR ANSWER HERE
Question 2.6
Calculate the using the test data.
Question 2.7
Predict the Penalties
score for each of the players from Mexico. This is the information the Director of Sports Analytics of the federation wants so that the athletic team can make the corresponding technical decisions.
Question 2.7 - Optional
Repeat this process with KNN regression and compare the predictions. Scaling isn't necessary here as all the variables have a possible range of 0-100.