Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/slides/06_classfication_intro.ipynb
2051 views
Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 6 - Classification using K-Nearest Neighbours

First, a little housekeeping

  1. Quiz grading will be finished as soon as possible!

  2. Please fill out the mid-course survey (if you already have, thanks!)

Reminder

Where are we? Where are we going?

source: R for Data Science by Grolemund & Wickham

Classification

Suppose we have past data of cancer tumour cell diagnosis labelled "benign" and "malignant".

Do you think a new cell with Concavity = 4.2 and Perimeter = -1 would be cancerous?

What kind of data analysis question is this?

K-nearest neighbours classification

Predict the label / class for a new observation using the K closest points from our dataset.

  1. Compute the distance between the new observation and each observation in our training set

Distance=(pnewptrain)2+(cnewctrain)2\text{Distance} = \sqrt{(p_{\text{new}} - p_{\text{train}})^2 + (c_{\text{new}} - c_{\text{train}})^2}

K-nearest neighbours classification

Predict the label / class for a new observation using the K closest points from our dataset.

  1. Sort the data in ascending order according to the distances

  2. Choose the top K rows as "neighbours"

## ID Perimeter Concavity Class dist_from_new ## <dbl> <dbl> <dbl> <fct> <dbl> ## 1 859471 -1.24 4.70 B 0.553 ## 2 84501001 -0.286 3.99 M 0.744 ## 3 8710441 -1.08 2.63 B 1.57 ## 4 9013838 -0.461 2.72 M 1.57 ## 5 925622 0.638 4.30 M 1.64

K-nearest neighbours classification

Predict the label / class for a new observation using the K closest points from our dataset.

  1. Classify the new observation based on majority vote.

What would the predicted class be?

We can go beyond 2 predictors

For two observations u,vu, v, each with mm variables (columns) labelled 1,,m1, \dots, m,

Distance=(u1v1)2+(u2v2)2++(umvm)2\text{Distance} = \sqrt{(u_1-v_1)^2 + (u_2-v_2)^2 + \dots + (u_m - v_m)^2}

Aside from that, it's the same algorithm!

Standardized Data

What if one variable is much larger than the other? e.g. Salary (10,000+) and Age (0-100)

Standardize: shift and scale so that the average is 0 and the standard deviation is 1.


Unbalanced Data

What if one label is far more common than another?

E.g. if this is a very rare kind of cancer, we may have far more benign observations

Will K = 7-Nearest Neighbours ever predict malignance?

Unbalanced Data

Unbalanced Data

Oversampling to Rebalance

Replicate the data in the smaller class to increase its count / voting power

Introduction to the caret package in R

Caret handles computing distances, standardization, balancing, and prediction for us!

  1. Load the libraries and data we need (new: caret)

library(tidyverse) library(caret) cancer <- read_csv('data/clean-wdbc.data.csv')

Introduction to the caret package in R

  1. Split your table of training data into

    • YY (make this a vector)

    • XX's (make this a data.frame, not a tibble)

cancer_train <- cancer %>% select("Perimeter", "Concavity") cancer_labels <- cancer %>% select(Class) %>% unlist() head(cancer_labels)

Introduction to the caret package in R

  1. "Fit" your model:

  • choose kk and create a data.frame with one column (named k) and one value

  • use train and feed it XX, YY, the method ("knn"), and kk

k <- data.frame(k = 5) model_knn <- train(x = data.frame(cancer_train), y = cancer_labels, method='knn', tuneGrid = k) model_knn

Introduction to the caret package in R

  1. Predict Y^\hat{Y} using your model by using predict and passing it your model object and the new observation (as a data.frame)

new_obs <- data.frame(Perimeter = -1, Concavity = 4.2) predict(object=model_knn, new_obs)

Go forth and ... model?

Class challenge

Suppose we have a new observation in the iris dataset, with

  • petal length = 5

  • petal width = 0.6

Using R and the caret package, how would you classify this observation based on k=3k=3 nearest neighbours?

Y_train <- select(iris, Species) %>% unlist() X_train <- select(iris, Petal.Length, Petal.Width) %>% data.frame() k = data.frame(k = 3) model_knn <- train(x = X_train, y = Y_train, method='knn', tuneGrid = k) new_obs <- data.frame(Petal.Length = 5, Petal.Width = 0.5) predict(object=model_knn, new_obs)
options(repr.plot.width = 6, repr.plot.height = 3) ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point()