GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/worksheet_07-fix/worksheet_07-fix.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 7 (fix) - Classification (Part II)

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Describe what a test data set is and how it is used in classification.
Using R, evaluate classification accuracy using a test data set and appropriate metrics.
Using R, execute cross validation in R to choose the number of neighbours.
Identify when it is necessary to scale variables before classification and do this using R
In a dataset with > 2 attributes, perform k-nearest neighbour classification in R using the caret package to predict the class of a test dataset.
Describe advantages and disadvantages of the k-nearest neighbour classification algorithm.

This worksheet covers parts of Chapter 7 of the online textbook. You should read this chapter before attempting the worksheet.

In [ ]:

### Run this cell before continuing.

library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(caret)

Question 1 Multiple Choice:

Before applying k-nearest neighbour to a classification task, we need to scale the data. What is the purpose of this step?

A. To help improve the processing power of the knn algorithm.

B. To convert all data observations to numeric values.

C. To ensure all data observations will be on a comparable scale and contribute equal shares to the calculatiuon of the distance between points.

D. None of the above.

Assign your answer to an object called answer1.

In [ ]:

# Assign your answer to an object called: answer1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_that(exists('answer1'), is_true())
    expect_equal(digest(answer1), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

1. Fruit Data Example - (Part II)

Load the file, fruit_data.csv, into your notebook.

mutate() the fruit_name column such that it is a factor.

Assign your data to an object called fruit_data.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_that(exists('fruit_data'), is_true())
    expect_equal(ncol(fruit_data), 7)
    expect_equal(nrow(fruit_data), 59)
    is.factor(fruit_data$fruit_name)
    # we hid the answer to the test here so you can't see it, but we can still run the test
    })
print("Success!")

Let's take a look at the first six observations in the fruit dataset. Run the cell below.

In [ ]:

# Run this cell. 
head(fruit_data)

Find the nearest neighbour based on mass and width to the first observation just by looking at the scatterplot (the first observation has been circled for you). Run the cell below.

In [ ]:

# Run this cell. 
options(repr.plot.width=6, repr.plot.height=4)
point1 <- c(192, 8.4)
point2 <- c(180, 8)
point44 <- c(194, 7.2)

fruit_data %>%  
    ggplot(aes(x=mass, 
               y= width, 
               colour = fruit_name)) +
        labs(x = "Mass (grams)",
             y = "Width (cm)",
            colour = 'Name of the Fruit') +
        geom_point() +
        annotate("path", 
                 x=point1[1] + 5*cos(seq(0,2*pi,length.out=100)),
                 y=point1[2] + 0.1*sin(seq(0,2*pi,length.out=100))) +
        annotate("text", x = 183, y =  8.5, label = "1")

Question 1.1 Based on the graph generated, what is fruit_name of the closest data point to the one circled?

A. apple

B. lemon

C. mandarin

D. orange

Assign your answer to an object called answer1.1.

In [ ]:

# Assign your answer to an object called: answer1.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {
    expect_that(exists('answer1.1'), is_true())
    expect_equal(digest(answer1.1), '75f1160e72554f4270c809f041c7a776') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.2

Using mass and width, calculate the distance between the first observation and the second observation.

We provide a scaffolding to get you started.

Assign your answer to an object called fruit_dist_2.

In [ ]:

# ... <- fruit_data %>%
#    ...(row_number() %in% c(..., ...)) %>%
#    select(mass, ...) %>%
#    dist()

# your code here
fail() # No Answer - remove if you provide an answer
fruit_dist_2

In [ ]:

test_that('Solution is correct', {
    expect_that(exists('fruit_dist_2'), is_true())
    expect_equal(digest(fruit_dist_2), 'bc63ef43e7c7349f09aca14a55e2c9c0') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

Question 1.3

Calculate the distance between the first and the the 44th observation in the fruit dataset using the mass and width variables. You can see from the data frame output from the cell below that, observation 44 has mass = 194 g and width = 7.2 cm.

Assign your answer to an object called fruit_dist_44.

In [ ]:

# Run this cell. 
filter(fruit_data, row_number() == 44)

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
fruit_dist_44

In [ ]:

test_that('Solution is correct', {
    expect_that(exists('fruit_dist_44'), is_true())
    expect_equal(digest(fruit_dist_44), 'd775055ad9a05350f4548e1dbe872297') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

In [ ]:

# Run this cell. 
point1 <- c(192, 8.4)
point2 <- c(180, 8)
point44 <- c(194, 7.2)

fruit_data %>%
    ggplot(aes(x = mass, 
               y = width, 
               colour = fruit_name)) +
        labs(x = "Mass (grams)",
             y = "Width (cm)",
            colour = 'Name of the Fruit') +
        geom_point() +
        annotate("path", 
                 x=point1[1] + 5*cos(seq(0,2*pi,length.out=100)),
                 y=point1[2] + 0.1*sin(seq(0,2*pi,length.out=100))) +
        annotate("text", x = 183, y =  8.5, label = "1") +
        annotate("path",
                 x=point2[1] + 5*cos(seq(0,2*pi,length.out=100)),
                 y=point2[2] + 0.1*sin(seq(0,2*pi,length.out=100))) +
        annotate("text", x = 169, y =  8.1, label = "2") +
        annotate("path",
                 x=point44[1] + 5*cos(seq(0,2*pi,length.out=100)),
                 y=point44[2]+0.1*sin(seq(0,2*pi,length.out=100))) +
        annotate("text", x = 204, y =  7.1, label = "44")

Question 1.4

Discuss with the person sitting next to you.

i) What do you notice about your answers from Question 1.2/3 that you just calculated?

(Hint: look at where the observations are on the scatterplot in the cell above this question)

ii) Is it what you would expect? Why or why not?

(Hint: what might happen if we changed grams into kilograms to measure the mass?)

When you finish with you discussion, read the cell below:

The distance between the first and second observation is 12.01 and the distance between the first and 44th observation is 2.33. So by the formula, observation 1 and 44 are closer. However, if we look at the scatterplot the distance of the first observation to the second observation appears closer than to the 44th observation because of the axes scales.

Because the classifier predicts class by identifying the nearest points, the scale of the variables matters. Variables on a large scale compared to variables on a small scale will have a greater effect on the distance between the observations. Here we have width (measured in cm) and mass (in grams). As far as knn is concerned, a difference of 12 g in mass between observation 1 and 2 is large compared to a difference of 1.2 cm in width between observation 1 and 44. Consequently, mass will drive the classification results, and width will have less of an effect. Hence, our distance calculation reflects that. Also, if we measured mass in kilograms, or if we measured width in meters, then we’d get different classification results. Thus we can standardize the data so that all variables will be on a comparable scale.

Question 1.5

Scale all the variables of the fruit dataset and save them as columns in your data table.

Keep the dataset name as the original = fruit_data.

In [ ]:

# Example scaffolding: mutate(scaled_height = scale(height, center = FALSE))

# your code here
fail() # No Answer - remove if you provide an answer
head(fruit_data)

In [ ]:

test_that('Solution is correct', {
    expect_equal(ncol(fruit_data), 11)
    expect_equal(nrow(fruit_data), 59)
    is.factor(fruit_data$fruit_name)
    expect_equal(fruit_data$scaled_mass, scale(fruit_data$mass, center = FALSE))
    expect_equal(fruit_data$scaled_height, scale(fruit_data$height, center = FALSE))
    expect_equal(fruit_data$scaled_width, scale(fruit_data$width, center = FALSE))
    })
print("Success!")

Question 1.6

Let's repeat Question 1.2 and 1.3 with the scaled variables. Calculate the distance with the scaled mass and width variables between observations 1 and 2. Calculate the distances with the scaled mass and width variables between observations 1 and 44.

After you do this, think about how these distances compared to the distances you computed in Question 1.2 and 1.3 for the same points.

Assign your answers to objects called distance_2 and distance_44 respectively.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
distance_2
distance_44

In [ ]:

test_that('Solution is correct', {
    expect_equal(digest(distance_44), 'cad4f8a209a80063e9f2b649861a4d3a')
    expect_equal(digest(distance_2), 'f51dceda1f1fd46995f034908fedd7b8') # we hid the answer to the test here so you can't see it, but we can still run the test
    
})
print("Success!")

In [ ]:

head(fruit_data)

Splitting the data into a training and test set

Next, we will be partitioning the data into a training (70%) and testing (30%) set using the caret package. We will put this test set away in a lock box and not touch it again until we have found the best k-nn classifier we can make using the training set.

Question 2.0

To do this we first use the createDataPartition function to get the row numbers of the data we should include in our training set. This function uses a random process, so to ensure replicable results we need to set a seed using set.seed to tell the random number generator where we'd like to start from. Name the object you create training_rows.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {    
    expect_equal(length(training_rows), 46)
    expect_equal(ncol(training_rows), 1)
    expect_equal(colnames(training_rows), "Resample1")
})
print("Success!")

Question 2.1

Next we use the slice function to get the rows from the original data frame that match the ones we have in training_rows.

In [ ]:

#... <- fruit_data %>% slice(training_rows)
#... <- ... %>% ...(-training_rows)

# your code here
fail() # No Answer - remove if you provide an answer
head(training_set)
head(testing_set)

In [ ]:

test_that('Solution is correct', {    
    expect_equal(nrow(training_set), 46)
    expect_equal(ncol(training_set), 11)
    expect_equal(nrow(testing_set), 13)
    expect_equal(ncol(testing_set), 11)
    expect_equal(colnames(training_set), c("fruit_label", "fruit_name", "fruit_subtype", "mass", "width", 
"height", "color_score", "scaled_mass", "scaled_width", "scaled_height", 
"scaled_color_score"))
    expect_equal(colnames(testing_set), c("fruit_label", "fruit_name", "fruit_subtype", "mass", "width", 
"height", "color_score", "scaled_mass", "scaled_width", "scaled_height", 
"scaled_color_score"))
})
print("Success!")

Using cross-validation to choose k

Let's start with a simple classifier, one that uses only scaled_color_score and scaled_mass as predictors. fruit_name should be the class label. As we build this simple classifier from the training set, let's use cross-validation to choose the best $k$ .

Question 2.2

We now need to take our training data and specify what columns are going to be the predictors and which are going to the class labels. Name the predictors X_simple and the class labels Y_fruit.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(ncol(X_simple), 2)
    expect_equal(nrow(X_simple), 46)
    expect_equal('tbl' %in% class(X_simple), FALSE)
    expect_equal(length(Y_fruit), 46)
    })
print("Success!")

Question 2.3

Next, we need to create a data frame, named ks, that contains a single column, named k, that holds the k-values we'd like to try out. Let's try the values 1, 3, 5, 7, 9 and 11.

Hint - the c function is useful for creating vectors, which are what data frame columns are.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {  
    expect_equal(class(ks), 'data.frame')
    expect_equal(as.numeric(sum(ks)), 36)
})
print("Success!")

Question 2.4

Next we use the trainControl function. This function passes additional information to the train function we use to create our classifier. Here we would like to set the arguments to method="cv" (for cross-validation) and number = 10 (for 10-fold cross validation). Name this object train_control.

In [ ]:

#... <- trainControl(method = ..., ... = 10)
# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is correct', {    
    expect_equal(train_control$method, 'cv')
    expect_equal(train_control$number, 10)
})
print("Success!")

Question 2.5

Now we create out classifier as we did last week, but to do cross-validation as well (so we can assess classifier accruacy based on each $k$ ) we supply an additional argument to the train function, trControl. For that argument we pass it the name of the object we created using the trainControl function. Name the classifier choose_k.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
choose_k

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.numeric(sum(choose_k$results$k)), 36)
    expect_equal(as.character(choose_k$method), 'knn')
    expect_equal(as.numeric(summary(choose_k$trainingData$.outcome)[1]), 15)
    expect_equal(choose_k$control$method, 'cv')
    expect_equal(choose_k$control$number, 10)
})
print("Success!")

Then to help us choose $k$ it is very useful to visualize the accuracies as we increase $k$ . This will help us choose the smallest $k$ with the biggest accuracy. To do this, create a line and point plot of accuracy (y-axis) versus $k$ (x-axis). We can get these values from the results attribute of the classifier object using the $ operator. We demonstrate this in the cell below:

In [ ]:

# run this cell
k_accuracies <- choose_k$results
k_accuracies

Question 2.6

Now that we have the accuracy and $k$ values in a data frame, create a line and point plot of accuracy (y-axis) versus $k$ (x-axis). Remember to do all the things for making your visualization effective. Name your plot object choose_k_plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
choose_k_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.character(rlang::get_expr(choose_k_plot$mapping$x)) , 'k')
    expect_equal(as.character(rlang::get_expr(choose_k_plot$mapping$y)) , 'Accuracy')
    expect_that('GeomPoint' %in% c(class(rlang::get_expr(choose_k_plot$layers[[1]]$geom)), class(rlang::get_expr(choose_k_plot$layers[[2]]$geom))), is_true())
    expect_that('GeomLine' %in% c(class(rlang::get_expr(choose_k_plot$layers[[1]]$geom)), class(rlang::get_expr(choose_k_plot$layers[[2]]$geom))), is_true())
    })
print("Success!")

Question 2.7

From the plot of accuracy versus $k$ you created above, which $k$ should we choose?

Assign the value of $k$ we should choose to a variable named answer2.7

In [ ]:

# Assign your answer to an object called: answer2.7
# your code here
fail() # No Answer - remove if you provide an answer
answer2.7

In [ ]:

test_that('Solution is correct', {    
    expect_that(digest(answer2.7) %in% c('e5b57f323c7b3719bbaaf9f96b260d39'), is_true()) # we hid the answer to the test here so you can't see it, but we can still run the test
})
print("Success!")

Question 2.8

What is the cross-validation accuracy for the optimal $k$ ? Give at least 3 decimal places for your answer.

Assign the value of the cross-validation accuracy for the optimal $k$ to a variable named answer2.8

In [ ]:

# Assign your answer to an object called: answer2.8
# your code here
fail() # No Answer - remove if you provide an answer
answer2.8

In [ ]:

test_that('Solution is correct', {    
    expect_that(digest(round(answer2.8,2)) %in% c('6ea813eb6bb804c94ea2d022ae9e6480'), is_true()) # we hid the answer to the test here so you can't see it, but we can still run the test
})
print("Success!")

Training error as a diagnostic tool

Is this the best we can do with our classifier? Maybe, or maybe not. To get a hint we can use the training error as a diagnostic to tell us if we are underfitting and could afford to make our model more complex, say by including additional predictors.

Question 3.0

Create another simple classifier object (same columns as the classifier above) using the train function that does not use cross-validation, and only a single $k$ value of 5. Name it simple.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
simple

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.numeric(sum(simple$results$k)), 5)
    expect_equal(as.character(simple$method), 'knn')
    expect_equal(as.numeric(summary(simple$trainingData$.outcome)[1]), 15)
    expect_equal(simple$control$method, 'boot')
    expect_equal(simple$control$number, 25)
})
print("Success!")

Question 3.1

Use the simple classifier to predict labels for all the observations in the training set (X_simple). Name the predictions training_pred.

In [ ]:

# ... <- predict(..., ...)
# your code here
fail() # No Answer - remove if you provide an answer
head(training_pred)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(length(training_pred), 46)
    expect_equal(levels(training_pred), c("apple", "lemon", "mandarin", "orange"))
})
print("Success!")

Question 3.2

Use the confusionMatrix function to obtain the training accuracy. The confusionMatrix function takes two arguments, the predictions and the true class labels.

Name the object output training_results.

In [ ]:

#... <- confusionMatrix(..., Y_fruit)
# your code here
fail() # No Answer - remove if you provide an answer
training_results

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(class(confusionMatrix), 'function')
    expect_equal(rownames(training_results$byClass), c("Class: apple", "Class: lemon", "Class: mandarin", "Class: orange"
))
})
print("Success!")

Question 3.3

From the output of the confusionMatrix function what is the training accuracy? Give the answer to at least 3 decimal places.

Assign the value of the training accuracy to a variable named answer3.2

In [ ]:

# Assign your answer to an object called: answer3.2
# your code here
fail() # No Answer - remove if you provide an answer
answer3.2

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(round(answer3.2,2)), '24e6c856062f8f0d701f73e085be6524')
})
print("Success!")

Improving the classifier beyond changing k

As long as the training accuracy is not 1 (or very close to it) we may be able to further improve the classifier by adding predictors. This is not a guarantee, but something worth trying. When we do this, we also need to re-choose $k$ again, as the optimal $k$ may change with a different number of predictors.

Question 4.0

Create a new classifer called complex that uses scaled_mass, scaled_width, scaled_height and scaled_color_score as predictors. Again, try the values 1, 3, 5, 7, 9 and 11 for $k$ and use 10-fold cross validation.

In [ ]:

set.seed(3456)

# your code here
fail() # No Answer - remove if you provide an answer
complex

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.numeric(sum(complex$results$k)), 36)
    expect_equal(as.character(complex$method), 'knn')
    expect_equal(as.numeric(summary(complex$trainingData$.outcome)[1]), 15)
    expect_equal(complex$control$method, 'cv')
    expect_equal(complex$control$number, 10)
})
print("Success!")

Question 4.1

Get the accuracy and $k$ values from the classifier and create a line and point plot of accuracy (y-axis) versus kk (x-axis). Remember to do all the things for making your visualization effective. Name your plot object choose_k_again_plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
choose_k_again_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.character(rlang::get_expr(choose_k_again_plot$mapping$x)) , 'k')
    expect_equal(as.character(rlang::get_expr(choose_k_again_plot$mapping$y)) , 'Accuracy')
    expect_that('GeomPoint' %in% c(class(rlang::get_expr(choose_k_again_plot$layers[[1]]$geom)), class(rlang::get_expr(choose_k_plot$layers[[2]]$geom))), is_true())
    expect_that('GeomLine' %in% c(class(rlang::get_expr(choose_k_again_plot$layers[[1]]$geom)), class(rlang::get_expr(choose_k_plot$layers[[2]]$geom))), is_true())
    })
print("Success!")

Question 4.2

From the plot of accuracy versus $k$ you created above, which $k$ should we choose for this more complex classifier?

Assign the value of kk we should choose to a variable named answer4.2

In [ ]:

# Assign your answer to an object called: answer4.2
# your code here
fail() # No Answer - remove if you provide an answer
answer4.2

In [ ]:

test_that('Solution is correct', {    
    expect_that(digest(answer4.2) %in% c('e5b57f323c7b3719bbaaf9f96b260d39'), is_true()) # we hid the answer to the test here so you can't see it, but we can still run the test
})
print("Success!")

Question 4.3

What is the cross-validation accuracy for the optimal $k$ for this more complex classifier?

Assign the value of the cross-validation accuracy for the optimal $k$ to a variable named answer4.3

In [ ]:

# Assign your answer to an object called: answer4.3
# your code here
fail() # No Answer - remove if you provide an answer
answer4.3

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(round(answer4.3,2)), '3ebecc6a758993f6f3e74e58d7235403')
})
print("Success!")

Question 4.4

Did increasing the classifier complexity improve the cross-validation accuracy?

Answer by assigning the value of "True" or "False" to a variable named answer4.4

In [ ]:

# Assign your answer to an object called: answer4.3
# your code here
fail() # No Answer - remove if you provide an answer
answer4.4

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer4.4), '96c24a598c808db5ff9c1aa505c6aa15')
})
print("Success!")

Assessing test accuracy

How good is our model? Assessing the accuracy score on a test data set that was never used to choose our classifier is the only way to know. Let's do that!

Question 5.0

Now that we have chosen the optimal model, re-train your classifier on the entire training data set (i.e., do not use cross-validation) with the "settings" that made it an optimal model (here $k$ and the number of predictors). Name your classifier object final_classifier

In [ ]:

#final_k = data.frame(k = ...)
#... <- train(x = ..., y = Y_fruit, method = "knn", tuneGrid = final_k)

# your code here
fail() # No Answer - remove if you provide an answer
final_classifier

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.numeric(sum(final_classifier$results$k)), 3)
    expect_equal(as.character(final_classifier$method), 'knn')
    expect_equal(as.numeric(summary(final_classifier$trainingData$.outcome)[1]), 15)
    expect_equal(final_classifier$control$method, 'boot')
    expect_equal(final_classifier$control$number, 25)
})
print("Success!")

Question 5.1

Now use the final_classifier to predict the labels for the test set, and then calculate the test accuracy. Name the output from the confusionMatrix function test_results.

In [ ]:

#X_test <- test_set %>% 
#    select(...) %>% 
#    data.frame()
#Y_test <- test_set %>% 
#    select(...) %>% 
#    unlist()
#test_pred <- ...(final_classifier, X_test)
#... <- confusionMatrix(test_pred, ...)


# your code here
fail() # No Answer - remove if you provide an answer
test_results
digest(round(test_results$overall[[1]],3))

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(round(test_results$overall[[1]],3)), 'd97daf0654812554d2494d3fd228ab02')
})
print("Success!")

Question 5.2

What is the test accuracy for the final classifier?

Assign the value of the test accuracy for the final classifier to a variable named answer5.2

In [ ]:

# Assign your answer to an object called: answer5.2
# your code here
fail() # No Answer - remove if you provide an answer
answer5.2

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(round(answer5.2,2)), '651ba44efc6a75d694ff482aae958ccc')
})
print("Success!")