Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/slides/06_classfication_intro.ipynb
2051 views
Kernel: R

DSCI 100 - Introduction to Data Science

Lecture 6 - Classification, an introduction using k-nearest neighbours

2019-02-07

First, a little housekeeping

  1. Quiz grading will be finished Monday.

  1. Feedback forms will now be returned to you on the server where you do your homework. At some point today in your home you will see a feedback folder. We will put all the forms there.

  1. Please fill out the mid-course survey (and if you already have THANK-YOU)!

  1. Assignment to groups for group project has been done (see Canvas), and all have been given a private GitHub repository

Reminder

Where are we? Where are we going?

image source: R for Data Science by Grolemund & Wickham

Classification problem

Can we use data we have seen in the past, to predict something about the future?

For example, the diagnosis class of tumour cells with Concavity = 2 and Perimeter = 2?

K-nearest neighbours classification algorithm

In order to classify a new observation using a k-nearest neighbor classifier, we have to do the follow steps:

  1. Compute the distance between the new observation and each observation in our training set

  1. Sort the data table in ascending order according to the distances.

  1. Choose the top kk rows of the sorted table.

  1. Classify the new observation based on majority vote.

Classification problem

How is this problem represented as a data table in R?

Data table for example above

YY = diagnosisX1X_1 = ConcavityX2X_2 = Perimeter
M2.12.3
M-0.11.5
B-0.2-0.2
.........

Where:

  • YY is our class label/target/outcome/response variable

  • the XX's are our predictors/features/attributes/explanatory variables, and we have 2 of these

  • we have 569 observations (sets of measurements about tumour cells)

Data table for example above

YY = diagnosisX1X_1 = ConcavityX2X_2 = PerimeterX3X_3 = Symmetry
M2.12.32.7
M-0.11.5-0.2
B-0.2-0.20.12
.........

Where:

  • YY is our class label/target/outcome/response variable

  • the XX's are our predictors/features/attributes/explanatory variables, and we have 2 of these

  • we have 569 observations (sets of measurements about tumour cells)

Classification data table (general)

What does our general data table look like in the classification setting?

YYX1X_1X2X_2X3X_3...XpX_p
y1y_1x1,1x_{1,1}x1,2x_{1,2}x1,3x_{1,3}...x1,px_{1,p}
y2y_2x2,1x_{2,1}x2,2x_{2,2}x2,3x_{2,3}...x2,px_{2,p}
..................
yny_nxn,1x_{n,1}xn,2x_{n,2}xn,3x_{n,3}...xn,px_{n,p}

Where:

  • YY is our class label/target/outcome/response variable

  • the XX's are our predictors/features/attributes/explanatory variables, and we have pp of these

  • we have nn observations

Introduction to caret package in R

Steps to doing k-nn with caret in R:

  1. Split your data table of training data into YY (make this a vector) and XX's (make this a data.frame not a tibble)

  1. "Fit" your model to the data by:

  • choose kk and create a data.frame with one column (named k) and one value (your choice for kk)

  • use train and feed it XX, YY, the method ("knn"), and kk

  1. Predict Y^\hat{Y} using your model by using predict and passing it your model object and the new observation (as a data.frame)

Code example:

  1. Split your data table of training data into YY and XX's

cancer_train <- cancer %>% select("Perimeter", "Concavity") cancer_labels <- cancer %>% select(Class) %>% unlist()
  1. "Fit" your model to the data:

k <- data.frame(k = 5) model_knn <- train(x = data.frame(cancer_train), y = cancer_labels, method='knn', tuneGrid = k)
  1. Predict Y^\hat{Y} using your model

new_obs <- data.frame(Perimeter = -1, Concavity = 4.2) predict(object=model_knn, new_obs)

Unanswered questions at this point:

  1. How do we choose k? (answer coming next week...)

  1. Is our model any good?

"All models are wrong, but some are useful" -- George Box

... but we should try to say how useful (more coming next week...)

Go forth and ... model?

install.packages("e1071")

Class challenge

Suppose we have a new observation in the iris dataset, with petal length = 5 and petal width = 0.6. Using R and the caret package, how would you classify this observation based on k=3k=3 nearest neighbours using the predictors petal length and petal width.

library(tidyverse) library(caret) 5 + 6
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ── ✔ ggplot2 3.1.0 ✔ purrr 0.2.5 ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1 ✔ tidyr 0.8.0 ✔ stringr 1.3.1 ✔ readr 1.1.1 ✔ forcats 0.3.0 Warning message: “package ‘tibble’ was built under R version 3.5.2”── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() Loading required package: lattice Attaching package: ‘caret’ The following object is masked from ‘package:purrr’: lift
Y_train <- select(iris, Species) %>% unlist() X_train <- select(iris, Petal.Length, Petal.Width) %>% data.frame() k = data.frame(k = 3) model_knn <- train(x = X_train, y = Y_train, method='knn', tuneGrid = k) new_obs <- data.frame(Petal.Length = 5, Petal.Width = 0.5) predict(object=model_knn, new_obs)
options(repr.plot.width = 6, repr.plot.height = 3) ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point()
MIME type unknown not supported
Image in a Jupyter notebook