Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/slides/08_regression1.ipynb
2051 views
Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 8 - Introduction to regression with k-nearest neighbours

2019-02-28

Welcome back

... and I hope you feel like this after the break!

... not like this!

Reminder of where we are...

  • Follow the syllabus: https://github.com/UBC-DSCI/dsci-100#schedule

  • Quiz 2 on Thursday, March 7

    • practice quiz will be available Monday

    • I will post the solutions to the past worksheets and tutorials to help you study

    • like last time, follow the learning objectives when you study!

  • Project proposals are due this Saturday

  • Tutorial on Tuesday March 12 will be a dedicated group project working session

Regression prediction problem

What if we want to predict a quantitative value instead of a class label?

For example, the price of a 2000 square foot home (from this reduced data set):

k-nn for regression

As in k-nn classification, we find the kk-nearest neighbours (here 5)

k-nn for regression

Then we average the values for the kk-nearest neighbours, and use that as the prediction:

Regression prediction problem

We still have to answer these two questions:

  1. Is our model any good?

  2. How do we choose k?

1. Is our model any good?

Same general strategy

But a different calculation to assess our model

The mathematical formula for calculation RMSERMSE is shown below:

RMSE=1ni=1(yiyi^)2RMSE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{\infty}(y_i - \hat{y_i})^2}

Where:

  • nn is the number of observations

  • yiy_i is the observed value for the ithith observation

  • yi^\hat{y_i} is the forcasted/predicted value for the ithith observation

So if we had this predicted blue line from doing k-nn regression:

The red lines are (yiyi^)(y_i - \hat{y_i}) in:

RMSE=1ni=1n(yiyi^)2RMSE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}

RMSERMSE

  • Not out of 1, but instead in units of the target/response variable

  • so, a bit harder to interpret in the context of test error

Final model from k-nn regression

For this model, RMSERMSE is 91620.4, how can we interpret this?

2. How do we choose k?

  • cross-validation

  • choose the model with the smallest RMSE

How does kk affect k-nn regression?

What did we learn?

  • use this when the target/response variable (Y) is quantitative

  • RMSE as measure of prediction error

  • choose k so that our data do well predicting on other data sets, not too small of k (overfitting), and too big (underfitting)

  • quiz next week!