CoCalc -- 09_regression2.ipynb

GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/slides/09_regression2.ipynb
²⁰⁵¹ views

Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 9 - Introduction to linear regression

2019-03-13

News and reminders

Tuesday, March 19th - in class peer review session

Friday, April 26th at 19:00 - Final exam (format TBD)

Regression prediction problem

What if we want to predict a quantitative value instead of a class label?

Today we will focus on another regression approach - linear regression.

For example, the price of a 2000 square foot home (from this reduced data set):

linear regression

First we find the line of "best-fit" through the data points:

linear regression

And then we "look up" the value we want to predict of off of the line.

linear regression

How do we choose the line of "best fit"? We can draw many lines through the data:

linear regression

We choose the line that minimzes the average vertical distance between itself and each of the observed data points

Linear vs k-nn regression

Why linear regression?

Advantages to restricting the model to straight line: interpretability!

Remembering that the equation for a straight line is: $Y = \beta_0 + \beta_1X$

Where:

$\beta_0$ is the y-intercept of the line (the value where the line cuts the y-axis)
$\beta_1$ is the slope of the line

We can then write:

$house\: price = \beta_0 + \beta_1house\: size$

And finally, fill in the values for $\beta_0$ and $\beta_1$ :

$house\: price = -64542.2 + 175.9*house\: size$

k-nn regression, as simple as it is to implement and understand, has no such interpretability from it's wiggly line.

Why not linear regression (sometimes?)

Models are not like kitten hugs

They are more like suits:

ONE SIZE DOES NOT FIT ALL!

In [1]:

library(tidyverse)
library(repr)
theta <- seq(0,2*pi, length.out = 300)
circle <- tibble(X = sin(theta) + 0.75 * runif(300, min = 1, max = 2),
                 Y = cos(theta) + 0.75 * runif(300, min = 1, max = 2))
options(repr.plot.width = 4, repr.plot.height = 4)
circle_plot <- ggplot(circle, aes(x = X, y = Y)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", se = FALSE) +
    xlim(c(-0.5, 2.5)) +
    ylim(c(-0.5, 2.5))

zigzag <- tibble(X = seq(0,3*pi, length.out = 200),
                Y = cos(X) + runif(200, min = 1, max = 2))
zigzag_plot <- ggplot(zigzag, aes(x = X, y = Y)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", se = FALSE)

Out[1]:

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0       ✔ purrr   0.2.5  
✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.0       ✔ stringr 1.3.1  
✔ readr   1.1.1       ✔ forcats 0.3.0  
Warning message:
“package ‘tibble’ was built under R version 3.5.2”── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Be cautious with linear regression with data like this:

In [2]:

circle_plot

Out[2]:

MIME type unknown not supported

and this:

In [3]:

zigzag_plot

Out[3]:

MIME type unknown not supported

A cool app to explore more about linear regression

http://setosa.io/ev/ordinary-least-squares-regression/

What did we learn

linear regression
has to be a straight line
RMSE vs RMSPE
geom_smooth
don't need to use $k$ or cross-validation to fit a linear regression

In [ ]:

DSCI 100 - Introduction to Data Science

Lecture 9 - Introduction to linear regression

2019-03-13

News and reminders

Regression prediction problem

linear regression

linear regression

linear regression

linear regression

Linear vs k-nn regression

Why linear regression?

Why not linear regression (sometimes?)

Models are not like kitten hugs

They are more like suits:

Be cautious with linear regression with data like this:

and this:

A cool app to explore more about linear regression

What did we learn

Product

Resources

Company