CoCalc -- 08_regression1.ipynb

GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/slides/08_regression1.ipynb
²⁰⁵¹ views

Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 8 - Introduction to regression with k-nearest neighbours

2019-02-28

Welcome back

... and I hope you feel like this after the break!

... not like this!

Reminder of where we are...

Follow the syllabus: https://github.com/UBC-DSCI/dsci-100#schedule
Quiz 2 on Thursday, March 7
- practice quiz will be available Monday
- I will post the solutions to the past worksheets and tutorials to help you study
- like last time, follow the learning objectives when you study!

Project proposals are due this Saturday

Tutorial on Tuesday March 12 will be a dedicated group project working session

Regression prediction problem

What if we want to predict a quantitative value instead of a class label?

For example, the price of a 2000 square foot home (from this reduced data set):

k-nn for regression

As in k-nn classification, we find the $k$ -nearest neighbours (here 5)

k-nn for regression

Then we average the values for the $k$ -nearest neighbours, and use that as the prediction:

Regression prediction problem

We still have to answer these two questions:

Is our model any good?
How do we choose k?

1. Is our model any good?

Same general strategy

But a different calculation to assess our model

The mathematical formula for calculation $RMSE$ is shown below:

RMSE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{\infty}(y_i - \hat{y_i})^2}

Where:

$n$ is the number of observations
$y_i$ is the observed value for the $ith$ observation
$\hat{y_i}$ is the forcasted/predicted value for the $ith$ observation

So if we had this predicted blue line from doing k-nn regression:

The red lines are $(y_i - \hat{y_i})$ in:

RMSE = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y_i})^2}

$RMSE$

Not out of 1, but instead in units of the target/response variable
so, a bit harder to interpret in the context of test error

Final model from k-nn regression

For this model, $RMSE$ is 91620.4, how can we interpret this?

2. How do we choose k?

cross-validation
choose the model with the smallest RMSE

How does $k$ affect k-nn regression?

What did we learn?

use this when the target/response variable (Y) is quantitative
RMSE as measure of prediction error
choose k so that our data do well predicting on other data sets, not too small of k (overfitting), and too big (underfitting)
quiz next week!

DSCI 100 - Introduction to Data Science

Lecture 8 - Introduction to regression with k-nearest neighbours

2019-02-28

Welcome back

Reminder of where we are...

Regression prediction problem

k-nn for regression

k-nn for regression

Regression prediction problem

1. Is our model any good?

Same general strategy

But a different calculation to assess our model

$RMSE$

Final model from k-nn regression

2. How do we choose k?

How does $k$ affect k-nn regression?

What did we learn?

Product

Resources

Company

DSCI 100 - Introduction to Data Science

Lecture 8 - Introduction to regression with k-nearest neighbours

2019-02-28

Welcome back

Reminder of where we are...

Regression prediction problem

k-nn for regression

k-nn for regression

Regression prediction problem

1. Is our model any good?

Same general strategy

But a different calculation to assess our model

RMSERMSERMSE

Final model from k-nn regression

2. How do we choose k?

How does kkk affect k-nn regression?

What did we learn?

$RMSE$

How does $k$ affect k-nn regression?