Path: blob/master/2019-spring/slides/08_regression1.ipynb
2051 views
DSCI 100 - Introduction to Data Science
Lecture 8 - Introduction to regression with k-nearest neighbours
2019-02-28
Welcome back
... and I hope you feel like this after the break!
... not like this!
Reminder of where we are...
Follow the syllabus: https://github.com/UBC-DSCI/dsci-100#schedule
Quiz 2 on Thursday, March 7
practice quiz will be available Monday
I will post the solutions to the past worksheets and tutorials to help you study
like last time, follow the learning objectives when you study!
Project proposals are due this Saturday
Tutorial on Tuesday March 12 will be a dedicated group project working session
Regression prediction problem
What if we want to predict a quantitative value instead of a class label?
For example, the price of a 2000 square foot home (from this reduced data set):
k-nn for regression
As in k-nn classification, we find the -nearest neighbours (here 5)
k-nn for regression
Then we average the values for the -nearest neighbours, and use that as the prediction:
Regression prediction problem
We still have to answer these two questions:
Is our model any good?
How do we choose
k
?
1. Is our model any good?
Same general strategy
But a different calculation to assess our model
The mathematical formula for calculation is shown below:
Where:
is the number of observations
is the observed value for the observation
is the forcasted/predicted value for the observation
So if we had this predicted blue line from doing k-nn regression:
The red lines are in:
Not out of 1, but instead in units of the target/response variable
so, a bit harder to interpret in the context of test error
Final model from k-nn regression
For this model, is 91620.4, how can we interpret this?
2. How do we choose k?
cross-validation
choose the model with the smallest RMSE
How does affect k-nn regression?
What did we learn?
use this when the target/response variable (Y) is quantitative
RMSE as measure of prediction error
choose k so that our data do well predicting on other data sets, not too small of k (overfitting), and too big (underfitting)
quiz next week!