CoCalc -- 06_classfication

GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/slides/06_classfication_intro.ipynb
²⁰⁵¹ views

Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 6 - Classification using K-Nearest Neighbours

First, a little housekeeping

Quiz grading will be finished as soon as possible!
Please fill out the mid-course survey (if you already have, thanks!)

Reminder

Where are we? Where are we going?

source: R for Data Science by Grolemund & Wickham

Classification

Suppose we have past data of cancer tumour cell diagnosis labelled "benign" and "malignant".

Do you think a new cell with Concavity = 4.2 and Perimeter = -1 would be cancerous?

What kind of data analysis question is this?

K-nearest neighbours classification

Predict the label / class for a new observation using the K closest points from our dataset.

Compute the distance between the new observation and each observation in our training set

\text{Distance} = \sqrt{(p_{\text{new}} - p_{\text{train}})^2 + (c_{\text{new}} - c_{\text{train}})^2}

K-nearest neighbours classification

Predict the label / class for a new observation using the K closest points from our dataset.

Sort the data in ascending order according to the distances
Choose the top K rows as "neighbours"

##         ID Perimeter Concavity Class dist_from_new
##      <dbl>     <dbl>     <dbl> <fct>         <dbl>
## 1   859471    -1.24       4.70 B             0.553
## 2 84501001    -0.286      3.99 M             0.744
## 3  8710441    -1.08       2.63 B             1.57 
## 4  9013838    -0.461      2.72 M             1.57 
## 5   925622     0.638      4.30 M             1.64

K-nearest neighbours classification

Predict the label / class for a new observation using the K closest points from our dataset.

Classify the new observation based on majority vote.

What would the predicted class be?

We can go beyond 2 predictors

For two observations $u, v$ , each with $m$ variables (columns) labelled $1, \dots, m$ ,

\text{Distance} = \sqrt{(u_1-v_1)^2 + (u_2-v_2)^2 + \dots + (u_m - v_m)^2}

Aside from that, it's the same algorithm!

Standardized Data

What if one variable is much larger than the other? e.g. Salary (10,000+) and Age (0-100)

Standardize: shift and scale so that the average is 0 and the standard deviation is 1.

Unbalanced Data

What if one label is far more common than another?

E.g. if this is a very rare kind of cancer, we may have far more benign observations

Will K = 7-Nearest Neighbours ever predict malignance?

Unbalanced Data

Oversampling to Rebalance

Replicate the data in the smaller class to increase its count / voting power

Introduction to the `caret` package in R

Caret handles computing distances, standardization, balancing, and prediction for us!

Load the libraries and data we need (new: caret)

In [ ]:

library(tidyverse)
library(caret)

cancer <- read_csv('data/clean-wdbc.data.csv')

Introduction to the `caret` package in R

Split your table of training data into
- $Y$ (make this a vector)
- $X$ 's (make this a data.frame, not a tibble)

In [ ]:

cancer_train <- cancer %>%
  select("Perimeter", "Concavity")

cancer_labels <- cancer %>% 
  select(Class) %>% 
  unlist()

head(cancer_labels)

Introduction to the `caret` package in R

"Fit" your model:

choose $k$ and create a data.frame with one column (named k) and one value
use train and feed it $X$ , $Y$ , the method ("knn"), and $k$

In [ ]:

k <- data.frame(k = 5)
model_knn <- train(x = data.frame(cancer_train), y = cancer_labels, method='knn', tuneGrid = k)

model_knn

Introduction to the `caret` package in R

Predict $\hat{Y}$ using your model by using predict and passing it your model object and the new observation (as a data.frame)

In [ ]:

new_obs <- data.frame(Perimeter = -1, Concavity = 4.2)
predict(object=model_knn, new_obs)

Go forth and ... model?

Class challenge

Suppose we have a new observation in the iris dataset, with

petal length = 5
petal width = 0.6

Using R and the caret package, how would you classify this observation based on $k=3$ nearest neighbours?

In [ ]:

Y_train <- select(iris, Species) %>% unlist()
X_train <- select(iris, Petal.Length, Petal.Width) %>% data.frame()
k = data.frame(k = 3)
model_knn <- train(x = X_train, y = Y_train, method='knn', tuneGrid = k)
new_obs <- data.frame(Petal.Length = 5, Petal.Width =  0.5)
predict(object=model_knn, new_obs)

In [ ]:

options(repr.plot.width = 6, repr.plot.height = 3)
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
    geom_point()

DSCI 100 - Introduction to Data Science

Lecture 6 - Classification using K-Nearest Neighbours

First, a little housekeeping

Reminder

Classification

What kind of data analysis question is this?

K-nearest neighbours classification

K-nearest neighbours classification

K-nearest neighbours classification

What would the predicted class be?

We can go beyond 2 predictors

Standardized Data

Unbalanced Data

Will K = 7-Nearest Neighbours ever predict malignance?

Unbalanced Data

Unbalanced Data

Oversampling to Rebalance

Introduction to the `caret` package in R

Introduction to the `caret` package in R

Introduction to the `caret` package in R

Introduction to the `caret` package in R

Go forth and ... model?

Class challenge

Product

Resources

Company

DSCI 100 - Introduction to Data Science

Lecture 6 - Classification using K-Nearest Neighbours

First, a little housekeeping

Reminder

Classification

What kind of data analysis question is this?

K-nearest neighbours classification

K-nearest neighbours classification

K-nearest neighbours classification

What would the predicted class be?

We can go beyond 2 predictors

Standardized Data

Unbalanced Data

Will K = 7-Nearest Neighbours ever predict malignance?

Unbalanced Data

Unbalanced Data

Oversampling to Rebalance

Introduction to the caret package in R

Introduction to the caret package in R

Introduction to the caret package in R

Introduction to the caret package in R

Go forth and ... model?

Class challenge

Introduction to the `caret` package in R

Introduction to the `caret` package in R

Introduction to the `caret` package in R

Introduction to the `caret` package in R