Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/slides/09_regression2.ipynb
2051 views
Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 9 - Introduction to linear regression

2019-03-13

News and reminders

  • Tuesday, March 19th - in class peer review session

  • Friday, April 26th at 19:00 - Final exam (format TBD)

Regression prediction problem

What if we want to predict a quantitative value instead of a class label?

Today we will focus on another regression approach - linear regression.

For example, the price of a 2000 square foot home (from this reduced data set):

linear regression

First we find the line of "best-fit" through the data points:

linear regression

And then we "look up" the value we want to predict of off of the line.

linear regression

How do we choose the line of "best fit"? We can draw many lines through the data:

linear regression

We choose the line that minimzes the average vertical distance between itself and each of the observed data points

Linear vs k-nn regression

Why linear regression?

Advantages to restricting the model to straight line: interpretability!

Remembering that the equation for a straight line is: Y=β0+β1XY = \beta_0 + \beta_1X

Where:

  • β0\beta_0 is the y-intercept of the line (the value where the line cuts the y-axis)

  • β1\beta_1 is the slope of the line

We can then write:

houseprice=β0+β1housesizehouse\: price = \beta_0 + \beta_1house\: size

And finally, fill in the values for β0\beta_0 and β1\beta_1:

houseprice=64542.2+175.9housesizehouse\: price = -64542.2 + 175.9*house\: size

k-nn regression, as simple as it is to implement and understand, has no such interpretability from it's wiggly line.

Why not linear regression (sometimes?)

Models are not like kitten hugs

They are more like suits:

ONE SIZE DOES NOT FIT ALL!

library(tidyverse) library(repr) theta <- seq(0,2*pi, length.out = 300) circle <- tibble(X = sin(theta) + 0.75 * runif(300, min = 1, max = 2), Y = cos(theta) + 0.75 * runif(300, min = 1, max = 2)) options(repr.plot.width = 4, repr.plot.height = 4) circle_plot <- ggplot(circle, aes(x = X, y = Y)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", se = FALSE) + xlim(c(-0.5, 2.5)) + ylim(c(-0.5, 2.5)) zigzag <- tibble(X = seq(0,3*pi, length.out = 200), Y = cos(X) + runif(200, min = 1, max = 2)) zigzag_plot <- ggplot(zigzag, aes(x = X, y = Y)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", se = FALSE)
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ── ✔ ggplot2 3.1.0 ✔ purrr 0.2.5 ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1 ✔ tidyr 0.8.0 ✔ stringr 1.3.1 ✔ readr 1.1.1 ✔ forcats 0.3.0 Warning message: “package ‘tibble’ was built under R version 3.5.2”── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()

Be cautious with linear regression with data like this:

circle_plot
MIME type unknown not supported
Image in a Jupyter notebook

and this:

zigzag_plot
MIME type unknown not supported
Image in a Jupyter notebook

A cool app to explore more about linear regression

http://setosa.io/ev/ordinary-least-squares-regression/

What did we learn

  • linear regression

  • has to be a straight line

  • RMSE vs RMSPE

  • geom_smooth

  • don't need to use kk or cross-validation to fit a linear regression