GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_06/worksheet_06.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 6 - Classification

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Recognize situations where a simple classifier would be appropriate for making predictions.
Explain the k-nearest neighbour classification algorithm.
Interpret the output of a classifier.
Compute, by hand, the distance between points when there are two explanatory variables/predictors.
Describe what a training data set is and how it is used in classification.
In a dataset with two explanatory variables/predictors, perform k-nearest neighbour classification in R using caret::train(method = "knn", ...) to predict the class of a single new observation.

This worksheet covers parts of Chapter 6 of the online textbook. You should read this chapter before attempting the worksheet.

In [ ]:

 ### Run this cell before continuing. 
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(caret)

Question 0.1 Multiple Choice:

Which of the following statements is NOT true of a training data set (in the context of classification)?

A. A training data set is a collection of observations, where we know the class of each observation.

B. We can use a training set to explore and build our classifier.

C. The training data set is the underlying collection of observations for which we don't know the true classes.

Assign your answer to an object called answer1.

In [ ]:

# Assign your answer to an object called: answer1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer1), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 0.2 Multiple Choice

Adapted from James et al, "An introduction to statistical learning" (page 53)

Consider the scenario below:

We collect data on 20 similar products. For each product we have recorded whether it was a success or failure (labelled as such by the Sales team), price charged for the product, marketing budget, competition price, customer data, and ten other variables.

Which of the following is a classification problem?

A. We are interested in comparing the profit margins for products that are a success and products that are a failure.

B. We are considering launching a new product and wish to know whether it will be a success or a failure.

C. We wish to group customers based on their preferences and use that knowledge to develop targeted marketing programs.

Assign your answer to an object called answer2.

In [ ]:

# Assign your answer to an object called: answer2
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

1. Breast Cancer Data Set

We will work with the breast cancer data from this week's pre-reading.

Question 1.0

Read the clean-wdbc-data.csv file (found in the worksheet_05 directory) into the notebook and store it as a data frame. Name it cancer.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(cancer)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(cancer), 569)
    expect_equal(ncol(cancer), 12)
    expect_equal(digest(as.numeric(sum(cancer$Area))), 'a2c1855f3fa92423aa169c350fc95232') # we hid the answer to the test here so you can't see it, but we can still run the test           
})
print("Success!")

Question 1.1 True or False:

After looking at the first six rows of the cancer data fame, we ask you to predict the variable "area" for a new observation. Is this a classification problem?

Assign your answer to an object called answer1.1.

In [ ]:

# Assign your answer to an object called: answer1.1
# Make sure the correct answer is written in lower-case (true / false)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer1.1), 'd2a90307aac5ae8d0ef58e2fe730d38b') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.2

Create a scatterplot of the data with Symmetry on the x-axis and Radius on the y-axis. Modify your aesthetics by colouring for Class. As you create this plot, ensure you follow the guidelines for creating effective visualizations.

Assign your plot to an object called cancer_plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
cancer_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(rlang::get_expr(cancer_plot$mapping$x)), 'b1b55a59eb094370888619c320746c93')
    expect_equal(digest(rlang::get_expr(cancer_plot$mapping$y)), '6b4117c4cb5b6a1fccd1b1ba44cc4390')
    expect_true(digest(rlang::get_expr(cancer_plot$mapping$colour)) %in% c('a4abb3d43fde633563dd1f5c3ea31f31', 'f9e884084b84794d762a535f3facec85'))
    expect_true('GeomPoint' %in% class(rlang::get_expr(cancer_plot$layers[[1]]$geom)))
    })
print("Success!")

Question 1.3

Just by looking at the scatterplot above, how would you classify an observation with symmetry 1 and radius 1?

a) Benign
b) Malignant

Assign your answer to an object called answer1.3.

In [ ]:

# Assign your answer to an object called: answer1.3
# Make sure the correct answer is written fully (Benign / Malignant)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer1.3), '891e8a631267b478c03e25594808709d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Calculating the distance between points

Using R as a calculator and the formula below, compute the distance between the first and second observation in the breast cancer dataset using the explanatory variables/predictors symmetry and radius.

Recall we can calculate the distance between two points using the following formula: $Distance = \sqrt{(x_a -x_b)^2 + (y_a - y_b)^2}$

Question 1.4

Find coordinates for the two variables and assign them to objects called: xa (Symmetry value for the first row), ya (Radius value for the first row), xb (Symmetry value for the second row) and yb (Radius value for the second row).

Scaffolding for half a coordinate (xa) is given. Do the same for the second half (xb) but use Symmetry and also for row number 2 (ya and yb)!

In [ ]:

# (xa <- filter(cancer, row_number()==1)  %>%  
#    select(Radius) %>%
#    unlist())

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(xa)), '218983ef51880f784c62ff2aedc196f3')
    expect_equal(digest(as.numeric(ya)), 'a1914c10445a398934c0e0015b9b18ae')
    expect_equal(digest(as.numeric(xb)), '5b34d8796880663f75ea423ccb4ea8cd')    
    expect_equal(digest(as.numeric(yb)), '4490c7a115f39cede8cd353713230e95')
    
})
print("Success!")

Question 1.5

Plug in the coordinates into the distance equation.

Assign your answer to an object called distance2.

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

In [ ]:

# ... <- sqrt((xa - ...)^2 + (ya - yb)^...)

# your code here
fail() # No Answer - remove if you provide an answer
distance2

In [ ]:

test_that('Solution is incorrect', {    
    expect_equal(digest(as.numeric(distance2)), 'ab39ff487bddaa92a62eadbbe3e46da6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Now we want to calculate the distance between the first and second observation in the breast cancer dataset using 3 explanatory variables/predictors: Symmetry, Radius and Concavity. Again, use the first two rows in the data set as the points you are calculating the distance between (point $a$ is row 1, and point $b$ is row 2).

Question 1.6

Find the coordinates for the third variable (Concavity) and assign them to objects called za and zb. Use the scaffolding given in Question 1.4 as a guide.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {   
    expect_equal(digest(as.numeric(zb)), 'b62bcabaf783e3e2d0745ca4a41219da')    
    expect_equal(digest(as.numeric(za)), '8f22ef4a815b2e1bd4f7ec511bbc30f2') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.6

Again, using R as a calculator, calculate the distance between the first and second observation in the breast cancer dataset using 3 explanatory variables/predictors: Symmetry, Radius and Concavity.

Assign your answer to an object called distance3. Use the scaffolding given to calculate distance2 as a guide.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
distance3

In [ ]:

test_that('Solution is incorrect', {    
    expect_equal(digest(as.numeric(distance3)), '97c5e6129bc96a23ed7298d78bf7f8b2') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.7

Use the c function to create a vector for the coordinates for each point. Name one vector point_a and the other vector point_b. Within the vector, the order of coordinates should be: Symmetry, Radius, Concavity.

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

In [ ]:

# point_a <- filter(cancer, row_number() == 1) %>%
#    select(..., Radius, ...) %>%
#    unlist()

# This is only the scaffolding for one vector (you need to make another one for row number 2)

# your code here
fail() # No Answer - remove if you provide an answer
point_a
point_b

In [ ]:

test_that('Solution is incorrect', {        
    expect_equal(digest(as.numeric(sum(point_a))), '309d3b37c196b24341299aabdac15644')    
    expect_equal(digest(as.numeric(sum(point_b))), '00bb41bc0f538b06f627ffbd9874a6a8') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.8

Calculate the differences between the two vectors, point_a and point_b. The result should be a vector of length 3 named difference.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
difference

In [ ]:

test_that('Solution is incorrect', {    
    expect_equal(digest(as.numeric(sum(difference))), 'ef1fc2c1e06df149b42dcfb47596319f') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.9

Square the differences between the two vectors, point_a and point_b. The result should be a vector of length 3 named dif_square. Hint: ^ is the exponent symbol in R.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
dif_square

In [ ]:

test_that('Solution is incorrect', {    
    expect_equal(digest(as.numeric(sum(dif_square))), '0299530505a02b47c2a30af0ecd6026b') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.9.1

Sum the squared differences between the two vectors, point_a and point_b. The result should be a vector of length 3 named dif_sum.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
dif_sum

In [ ]:

test_that('Solution is incorrect', {    
    expect_equal(digest(as.numeric(dif_sum)), '0299530505a02b47c2a30af0ecd6026b') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.9.2

Square root the sum of your squared differences (calculated in Question 1.9.1). The result should be a vector of length 3 named root_dif_sum.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
root_dif_sum

In [ ]:

test_that('Solution is incorrect', {    
    expect_equal(digest(as.numeric(root_dif_sum)), '97c5e6129bc96a23ed7298d78bf7f8b2') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.9.3

If we have more than a few points, calculating distances as we did in parts (a) and (b) is VERY slow. Let's use the dist() function to find the distance between the first and second observation in the breast cancer dataset using Symmetry, Radius and Concavity.

Fill in the ... in the cell below. Copy and paste your finished answer into the fail().

Assign your answer to an object called dist_cancer_two_rows.

In [ ]:

# ... <- head(cancer, 2)  %>% 
#    select(..., ..., Concavity)  %>% 
#    dist()

# your code here
fail() # No Answer - remove if you provide an answer
dist_cancer_two_rows

In [ ]:

test_that('Solution is incorrect', {    
    expect_equal(digest(as.numeric(dist_cancer_two_rows)), '97c5e6129bc96a23ed7298d78bf7f8b2') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.9.4 True or False:

Compare distance3, root_dif_sum, and dist_cancer_two_rows.

Are they all the same value?

Assign your answer to an object called answer1.9.4

In [ ]:

# Assign your answer to an object called: answer1.9.4
# Make sure the correct answer is written in lower-case (true / false)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {    
    expect_equal(digest(answer1.9.4.), '05ca18b596514af73f6880309a21b5dd') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Classification - a simple example done manually

Question 2.0

Let's take a random sample of 5 observations from the breast cancer dataset using the sample_n function. To make this random sample reproducible, we will use set.seed(2). This means that the random number generator will start at the same point each time when we run the code and we will always get back the same random samples.

We will focus on the predictors Symmetry and Radius only. Thus, we will need to select the columns Symmetry and Radius and Class. Save these 5 rows and 3 columns to a data frame named small_sample.

Finally, create a scatter plot where Symmetry is on the x-axis, and Radius is on the y-axis. Color the points by Class. Name your plot small_sample_plot

Fill in the ... in the scaffolding provided below.

In [ ]:

#set.seed(2)                           
#... <- sample_n(cancer, 5) %>%  
#    select(...) 

#... <- ...%>%   
#    ggplot(...) + 
#        geom_...() +
#        ...

# your code here
fail() # No Answer - remove if you provide an answer
small_sample_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(rlang::get_expr(small_sample_plot$mapping$x)), 'b1b55a59eb094370888619c320746c93')
    expect_equal(digest(rlang::get_expr(small_sample_plot$mapping$y)), '6b4117c4cb5b6a1fccd1b1ba44cc4390')
    expect_true(digest(rlang::get_expr(small_sample_plot$mapping$colour)) %in% c('a4abb3d43fde633563dd1f5c3ea31f31', 'f9e884084b84794d762a535f3facec85'))
    expect_true('GeomPoint' %in% class(rlang::get_expr(small_sample_plot$layers[[1]]$geom)))
    })
print("Success!")

Question 2.1

Suppose we are interested in classifying a new observation with Symmetry = 0 and Radius = 0.25, but unknown Class. Using the small_sample data frame, add another row with Symmetry = 0, Radius = 0.25, and Class = "unknown"

Fill in the ... in the scaffolding provided below.

Assign your answer to an object called newData.

In [ ]:

# newData <- ... %>%
#    add_row(Symmetry = ..., ... = 0.25, Class = ...)

# your code here
fail() # No Answer - remove if you provide an answer
newData

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(sum(as.numeric(newData$Radius))), '5c1af5711abdeed77edef29de6416924')
    expect_equal(digest(sum(as.numeric(newData$Symmetry))), '730bb9e72036915df0c5f16cb86eb669')
    })
print("Success!")

Question 2.2

Using the subset of 5 observations above, classify this new observation:

Symmetry = 0, Radius = 0.25, unknown Class

using the dist() function for $k = 1$ . Fill in the ... in the scaffolding provided below.

Assign your answer to an object called bc_matrix.

In [ ]:

# From the subset data with the new observation selecting symmetry and radius columns.
# Calculate distance between pairs of observations.
# Make it into 6 x 6 matrix.

# ... <- newData %>%
#    ...(Symmetry, ...) %>% 
#    ...() %>%                   
#    as.matrix() 

# your code here
fail() # No Answer - remove if you provide an answer
dist_matrix

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.numeric(sum(dist_matrix[1, ]))), digest(sum(bc_matrix[, 1])))
    expect_equal(digest(as.numeric(sum(dist_matrix[2, ]))), digest(sum(bc_matrix[, 2])))
    expect_equal(digest(as.numeric(sum(dist_matrix[5, ]))), digest(sum(bc_matrix[, 5])))
    expect_equal(digest(as.numeric(sum(dist_matrix[6, ]))), digest(sum(bc_matrix[, 6])))
    })
print("Success!")

Question 2.3 Multiple Choice:

In the table above the row and column numbers reflect the row number from the data frame the dist function was applied to. Thus numbers 1 - 5 were the points/observations from rows 1 - 5 in the small_sample data frame. Row 6 was the new observation that we do not know the diagnosis class for. The values in dist_matric are the distances between the points of the row and column number. For example, the distance between the point 2 and point 4 is 2.9759541. And the distance between point 3 and point 3 (the same point) is 0.

Which observation is the nearest to our new point (smallest distance)?

Assign your answer to an object called answer2.3.

In [ ]:

# Assign your answer to an object called: aanswer2.3
# Make sure the correct answer is a number (1/2/3/4/5). 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.3), '5b58e040ee35f3bcc6023fb7836c842e') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.4 Multiple Choice:

Based on your answer above (with $k = 1$ ) and the dist_matrix table generated in Question 2.2, is the new data point benign or malignant?

Assign your answer to an object called answer2.4.

In [ ]:

# Assign your answer to an object called: answer2.4
# Make sure the correct answer is written fully (Benign / Malignant)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.4), '891e8a631267b478c03e25594808709d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Using the subset of 5 observations above, classify this new observation:

Symmetry = 0, Radius = 0.25, Class = 'unknown'

using the dist() function for $k = 3$ .

Question 2.5 Multiple Choice:

What are the three closest observations to your new point?

A. 1, 4, 5

B. 1, 3, 2

C. 5, 2, 1

D. 3, 4, 2

Assign your answer to an object called answer2.5.

In [ ]:

# Assign your answer to an object called: answer2.5
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.5), '3a5505c06543876fe45598b5e5e5195d') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.6 Multiple Choice:

Based on your answer above (with $k = 3$ ) and the dist_matrix table generated in Question 2.2, is the new data point benign or malignant?

Assign your answer to an object called answer2.6.

In [ ]:

# Assign your answer to an object called: answer2.6
# Make sure the correct answer is written fully (Benign / Malignant)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.6), '9c8cb5538e7778bf0b1bd53e45fb78c9') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 2.7

Compare your answers in 2.4 and 2.6. Are they the same?

Assign your answer to an object called answer2.7.

In [ ]:

# Assign your answer to an object called: answer2.7
# Make sure the correct answer is written in lower-case (yes / no)
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer2.7), '863dfc36ab2bfe97404cc8fc074a5241') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Using `caret` to do perform k-nearest neighbours

Now that we understand how k-nearest neighbours classification works, let's get familar with how we can use the caret R packages to do this so we can get more done faster and with less errors.

We'll again focus on Radius and Symmetry as the two predictors. And this time we would like to predict the class of a new observation with Symmetry = 1 and Radius = 0. This one is a bit tricky to do visually from the plot below, and so is a motivating example for us to compute the prediction using k-nn with the caret package. Let's use k = 7.

In [ ]:

cancer_plot

Question 3.0

Create the two objects needed to train a model using the caret package. A data.frame containing the predictors Symmetry and Radius, and a vector containing the Class. Name the data.frame containing the predictors X_train and the vector containing the classes/labels Y_train.

Hints:

use data.frame to make tibbles into data.frames
use unlist to make a single columned tibble into a vector

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(X_train)
head(Y_train)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(class(X_train), 'data.frame')
    expect_equal(ncol(X_train), 2)
    expect_equal(nrow(X_train), 569)
    expect_true('Symmetry' %in% colnames(X_train))
    expect_true('Radius' %in% colnames(X_train))
    expect_equal(class(Y_train), 'character')
    expect_equal(length(Y_train), 569)
    
})
print("Success!")

Question 3.1

Next we "train" our model (tell caret what columns are the predictors/X's and which is the target/outcome/Y, as well as what value for $k$ we are using).

In [ ]:

# k <- ...
# ... <- train(x = ..., y = ..., method = ..., tuneGrid = ...)

# your code here
fail() # No Answer - remove if you provide an answer
print(model_knn)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.numeric(model_knn$results$k), 7)
    expect_equal(as.character(model_knn$method), 'knn')
    expect_equal(digest(as.numeric(sum(model_knn$trainingData$Symmetry))), '47d0e881a9a1b19e57f9c068c08765fa')
    expect_equal(digest(as.numeric(sum(model_knn$trainingData$Radius))), '5818709a65b4a5df9cb392b9cc66e32b')
    expect_equal(as.numeric(summary(model_knn$trainingData$.outcome)[1]), 357)
})
print("Success!")

Question 3.2

Create a data.frame with our new observation (Symmetry = 1 and Radius = 0) and predict the label of the new observation using the predict function. Name the object outputted as predicted_knn_7.

In [ ]:

# new_obs <- data.frame(..., ...)
# ... <- predict(object = ..., new_obs)

# your code here
fail() # No Answer - remove if you provide an answer
print(predicted_knn_7)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(predicted_knn_7)), '5f0922939c45ef1054f852e83f91c660')
})
print("Success!")

Looking back at the plot (shown again below), is this what you would have been able to guess visually? And do you think $k = 7$ was the "best" value to choose for $k$ ? Think on this, and we will discuss it next week.

In [ ]:

cancer_plot

Question 3.3

Perform k-nn classification again, using the caret package and $k=7$ to classify a new observation (and so we do not know the diagnosis class) where we measure Symmetry = 1, Radius = 0 and Concavity = 1. Store the output of predict in an object called predicted_3_knn_7

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
print(predicted_3_knn_7)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(predicted_3_knn_7)), '5f0922939c45ef1054f852e83f91c660')
})
print("Success!")

Question 3.4

Finally, perform k-nn classification again, using the caret package and $k=7$ to classify a new observation (and so we do not know the diagnosis class) where we have measurements for all the predictors in our training data set (we give you the values in the code below). Store the output of predict in an object called predicted_all_knn_7

Hint: ID is not a measurement, but a label for each observation. Thus, do not include this in your analysis.

In [ ]:

new_obs_all <- data.frame(Radius = 0, 
                        Texture = 0, 
                        Perimeter = 0, 
                        Area = 0, 
                        Smoothness = 0.5, 
                        Compactness = 0,
                        Concavity = 1,
                        Concave_points = 0,
                        Symmetry = 1, 
                        Fractal_dimension = 0)


# your code here
fail() # No Answer - remove if you provide an answer
print(predicted_all_knn_7)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(predicted_all_knn_7)), '3a5505c06543876fe45598b5e5e5195d')
})
print("Success!")

Reviewing some concepts

Here are two multiple choice questions to end off with to review and reinforce some key concepts when doing classification with k-nn:

Question 4.0

In the k-nn classification algorithm, we calculate the distance between the new observation (for which we are trying to predict the class/label/outcome) and each of the observations in the training data set so that we can:

A. Find the $k$ nearest neighbours of the new observation

B. Assess how well our model fits the data

C. Find outliers

D. assign the new observation to a cluster

Assign your answer to an object called: answer4.0

In [ ]:

# Assign your answer to an object called: answer4.0
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer4.0)), '75f1160e72554f4270c809f041c7a776')
})
print("Success!")

Question 4.1

In the k-nn classification algorithm, we choose the label/class for a new observation by:

A. taking the mean (average value) label/class of the $k$ nearest neighbours

B. taking the median (middle value) label/class of the $k$ nearest neighbours

C. Taking the mode (value that appears most often) label/class of the $k$ nearest neighbours

Assign your answer to an object called: answer4.1

In [ ]:

# Assign your answer to an object called: answer4.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(answer4.1)), '475bf9280aab63a82af60791302736f6')
})
print("Success!")

Worksheet 6 - Classification

Lecture and Tutorial Learning Goals:

1. Breast Cancer Data Set

Calculating the distance between points

Classification - a simple example done manually

Using `caret` to do perform k-nearest neighbours

Reviewing some concepts

Product

Resources

Company

Worksheet 6 - Classification

Lecture and Tutorial Learning Goals:

1. Breast Cancer Data Set

Calculating the distance between points

Classification - a simple example done manually

Using caret to do perform k-nearest neighbours

Reviewing some concepts

Using `caret` to do perform k-nearest neighbours