GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/materials/worksheet_07/worksheet_07.ipynb
²⁰⁵¹ views

Kernel: R

Worksheet 7 - Classification (Part II)

Lecture and Tutorial Learning Goals:

After completing this week's lecture and tutorial work, you will be able to:

Describe what a test data set is and how it is used in classification.
Using R, evaluate classification accuracy using a test data set and appropriate metrics.
Using R, execute cross validation in R to choose the number of neighbours.
Identify when it is necessary to scale variables before classification and do this using R
In a dataset with > 2 attributes, perform k-nearest neighbour classification in R using the caret package to predict the class of a test dataset.
Describe advantages and disadvantages of the k-nearest neighbour classification algorithm.

This worksheet covers parts of Chapter 7 of the online textbook. You should read this chapter before attempting the worksheet.

In [ ]:

### Run this cell before continuing.
library(tidyverse)
library(repr)
library(caret)
source('tests_worksheet_07.R')

Question 0.1 Multiple Choice:
{points: 1}

Before applying k-nearest neighbour to a classification task, we need to scale the data. What is the purpose of this step?

A. To help speed up the knn algorithm.

B. To convert all data observations to numeric values.

C. To ensure all data observations will be on a comparable scale and contribute equal shares to the calculation of the distance between points.

D. None of the above.

Assign your answer to an object called answer1.

Note: we typically standardize (i.e., scale and center) the data before doing classification. For the K-nearest neighbour algorithm specifically, centering has no effect. But it doesn't hurt, and can help with other predictive data analyses, so we will do it below.

In [ ]:

# Assign your answer to an object called: answer1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_0.1()

1. Fruit Data Example - (Part II)

Question 1.0 Multiple Choice:
{points: 1}

Load the file, fruit_data.csv, into your notebook.

mutate() the fruit_name column such that it is a factor.

Assign your data to an object called fruit_data.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.0()

Let's take a look at the first six observations in the fruit dataset. Run the cell below.

In [ ]:

# Run this cell. 
head(fruit_data)

Run the cell below, and find the nearest neighbour based on mass and width to the first observation just by looking at the scatterplot (the first observation has been circled for you).

In [ ]:

# Run this cell. 
options(repr.plot.width=6, repr.plot.height=4)
point1 <- c(192, 8.4)
point2 <- c(180, 8)
point44 <- c(194, 7.2)

fruit_data %>%  
    ggplot(aes(x=mass, 
               y= width, 
               colour = fruit_name)) +
        labs(x = "Mass (grams)",
             y = "Width (cm)",
            colour = 'Name of the Fruit') +
        geom_point() +
        annotate("path", 
                 x=point1[1] + 5*cos(seq(0,2*pi,length.out=100)),
                 y=point1[2] + 0.1*sin(seq(0,2*pi,length.out=100))) +
        annotate("text", x = 183, y =  8.5, label = "1")

Question 1.1 Multiple Choice:
{points: 1}

Based on the graph generated, what is fruit_name of the closest data point to the one circled?

A. apple

B. lemon

C. mandarin

D. orange

Assign your answer to an object called answer1.1.

In [ ]:

# Assign your answer to an object called: answer1.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.1()

Question 1.2
{points: 1}

Using mass and width, calculate the distance between the first observation and the second observation.

We provide a scaffolding to get you started.

Assign your answer to an object called fruit_dist_2.

In [ ]:

# ... <- fruit_data %>%
#    slice(1, 2) %>% # We use slice to get the first two rows of the fruit dataset
#    select(mass, ...) %>%
#    dist()

# your code here
fail() # No Answer - remove if you provide an answer
fruit_dist_2

In [ ]:

test_1.2()

Question 1.3
{points: 1}

Calculate the distance between the first and the the 44th observation in the fruit dataset using the mass and width variables.

You can see from the data frame output from the cell below that, observation 44 has mass = 194 g and width = 7.2 cm.

Assign your answer to an object called fruit_dist_44.

In [ ]:

# Run this cell. 

filter(fruit_data, row_number() == 44)

point1 <- c(192, 8.4)
point2 <- c(180, 8)
point44 <- c(194, 7.2)

fruit_data %>%
    ggplot(aes(x = mass, 
               y = width, 
               colour = fruit_name)) +
        labs(x = "Mass (grams)",
             y = "Width (cm)",
            colour = 'Name of the Fruit') +
        geom_point() +
        annotate("path", 
                 x=point1[1] + 5*cos(seq(0,2*pi,length.out=100)),
                 y=point1[2] + 0.1*sin(seq(0,2*pi,length.out=100))) +
        annotate("text", x = 183, y =  8.5, label = "1") +
        annotate("path",
                 x=point2[1] + 5*cos(seq(0,2*pi,length.out=100)),
                 y=point2[2] + 0.1*sin(seq(0,2*pi,length.out=100))) +
        annotate("text", x = 169, y =  8.1, label = "2") +
        annotate("path",
                 x=point44[1] + 5*cos(seq(0,2*pi,length.out=100)),
                 y=point44[2]+0.1*sin(seq(0,2*pi,length.out=100))) +
        annotate("text", x = 204, y =  7.1, label = "44")

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
fruit_dist_44

In [ ]:

test_1.3()

Question 1.4

What do you notice about your answers from Question 1.2 & 1.3 that you just calculated? Is it what you would expect given the scatter plot above? Why or why not?

Hint: Look at where the observations are on the scatterplot in the cell above this question, and think about what might happen if we changed grams into kilograms to measure the mass.

YOUR ANSWER HERE

Question 1.5
{points: 1}

Scale all the variables of the fruit dataset and save them as columns in your data table.

Save the dataset object using the same name from previously, i.e., fruit_data. Make sure to name the new columns scaled_* where * is the old column name.

In [ ]:

# Example scaffolding: 
# mutate(scaled_height = scale(height, center = TRUE))

# your code here
fail() # No Answer - remove if you provide an answer
head(fruit_data)

In [ ]:

test_1.5()

Question 1.6
{points: 1}

Let's repeat Question 1.2 and 1.3 with the scaled variables:

calculate the distance with the scaled mass and width variables between observations 1 and 2
calculate the distances with the scaled mass and width variables between observations 1 and 44

After you do this, think about how these distances compared to the distances you computed in Question 1.2 and 1.3 for the same points.

Assign your answers to objects called distance_2 and distance_44 respectively.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
distance_2
distance_44

In [ ]:

test_1.6()

Randomness and Setting Seeds

The remaining material in the worksheet uses functions from the caret library, many of which make the use of randomness (for many purposes: resolving ties in the nearest neighbour vote, splitting the data, balancing, etc). In order to ensure that the steps in the worksheet are reproducible, we need to set a seed, i.e., a numerical "starting value," which determines the sequence of random numbers R will generate.

Below in many cells we have included a call to set.seed. Do not remove these lines of code; they are necessary to make sure the autotesting code functions properly.

Optional extra info for those who are curious: the reason we have set.seed in so many places is that Jupyter notebooks are organized into cells that can be run out of order. Since things can be run out of order, the exact sequence of random values that is used in each cell is hard to determine, which makes autotesting really difficult. We had two options: either enforce that you only ever run the code by hitting "Restart & Run All" to ensure that we get the same values of randomness each time, or put set.seed in a lot of places (we chose the latter). One drawback of calling set.seed everywhere is that the numbers that will be generated won't really be random. For the purposes of teaching and learning, that is fine here. But in a typical data analysis, you should really only call set.seed once at the beginning of the analysis, so that your random numbers are actually reasonably random.

2. Splitting the data into a training and test set

Next, we will be partitioning the data into a training (75%) and testing (25%) set using the caret package. We will put this test set away in a lock box and not touch it again until we have found the best k-nn classifier we can make using the training set.

Question 2.0
{points: 1}

To do this we first use the createDataPartition function to get the row numbers of the data we should include in our training set. Name the object you create training_rows.

In [ ]:

# Set the seed. Don't remove this!
set.seed(3456) 

# Randomly take 75% of the data in the training set. 
# This will be proportional to the different number of fruit names in the dataset.

# your code here
fail() # No Answer - remove if you provide an answer
head(training_rows)

In [ ]:

test_2.0()

Question 2.1
{points: 1}

Next we use the slice function to get the rows from the original data frame that match the ones we have in training_rows. The goal is to create one object for the training data (training_set) and one for the testing data (testing_set) using the rows that we have designated via createDataPartition.

Use the scaffolding provided. Name the two subsets of data training_set and testing_set.

In [ ]:

#... <- fruit_data %>% slice(training_rows)
#... <- ... %>% ...(-training_rows)

# your code here
fail() # No Answer - remove if you provide an answer
head(training_set)
head(testing_set)

In [ ]:

test_2.1()

Using cross-validation to choose k

Let's start with a simple classifier, one that uses only scaled_color_score and scaled_mass as predictors. fruit_name should be the class label. As we build this simple classifier from the training set, let's use cross-validation to choose the best $k$ .

Question 2.2
{points: 1}

We now need to take our training data and specify what columns are going to be the predictors and which are going to the class labels. Name the predictors X_simple and the class labels Y_fruit.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(X_simple)
head(Y_fruit)

In [ ]:

test_2.2()

Question 2.3
{points: 1}

Next, we need to create a data frame, named ks, that contains a single column, named k, that holds the k-values we'd like to try out. Let's try the values 1, 3, 5, 7, 9 and 11.

Hint - the c function is useful for creating vectors, which are what data frame columns are.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
ks

In [ ]:

test_2.3()

Question 2.4
{points: 1}

Next, we use the trainControl function. This function passes additional information to the train function we use to create our classifier. Here we would like to set the arguments to method = "cv" (for cross-validation) and number = 10 (for 10-fold cross-validation). Name this object train_control.

In [ ]:

#... <- trainControl(method = ..., ... = 10)

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_2.4()

Question 2.5
{points: 1}

Now we create our classifier as we did last week, but to do cross-validation as well (so we can assess classifier accuracy based on each $k$ ) we supply an additional argument to the train function, trControl. For that argument, we pass the name of the object we created using the trainControl function. Name the classifier choose_k.

In [ ]:

# Set the seed. Don't remove this!
set.seed(1234) 

# your code here
fail() # No Answer - remove if you provide an answer
choose_k

In [ ]:

test_2.5()

Then to help us choose $k$ it is very useful to visualize the accuracies as we increase $k$ . This will help us choose the smallest $k$ with the largest accuracy. To do this, create a line and point plot of accuracy (y-axis) versus $k$ (x-axis). We can get these values from the results attribute of the classifier object using the $ operator. We demonstrate this in the cell below:

In [ ]:

# run this cell
k_accuracies <- choose_k$results %>%
                    select(k, Accuracy)
k_accuracies

Question 2.6
{points: 1}

Now that we have the accuracy and $k$ values in a data frame, create a line and point plot of accuracy (y-axis) versus $k$ (x-axis). Remember to do all the things for making your visualization effective. Name your plot object choose_k_plot.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
choose_k_plot

In [ ]:

test_2.6()

Question 2.7
{points: 1}

If we were judging based on the plot of accuracy versus $k$ you created above, which $k$ should we choose?

Assign the value of $k$ we should choose to a variable named answer2.7

Note: there may be multiple reasonable answers. Just pick one of these, any one will suffice.

In [ ]:

# Assign your answer to an object called: answer2.7
# your code here
fail() # No Answer - remove if you provide an answer
answer2.7

In [ ]:

test_2.7()

Question 2.8
{points: 1}

What is the cross-validation accuracy for the optimal $k$ ?

Assign the value of the cross-validation accuracy for the optimal $k$ to a variable named answer2.8

In [ ]:

# Assign your answer to an object called: answer2.8
# your code here
fail() # No Answer - remove if you provide an answer
answer2.8

In [ ]:

test_2.8()

3. Training error as a diagnostic tool

Is this the best we can do with our classifier? Maybe, or maybe not. To get a hint we can use the training error as a diagnostic to tell us if we are underfitting and could afford to make our model more complex, say by including additional predictors.

Question 3.0
{points: 1}

Create another simple classifier object (same columns as the classifier above) using the train function that does not use cross-validation, and only a single $k$ value of 3. Name it simple.

In [ ]:

# Set the seed. Don't remove this!
set.seed(3456)

# your code here
fail() # No Answer - remove if you provide an answer
simple

In [ ]:

test_3.0()

Question 3.1
{points: 1}

Use the simple classifier to predict labels for all the observations in the training set (X_simple). Name the predictions training_pred.

In [ ]:

# Set the seed. Don't remove this!
set.seed(3456)

# ... <- predict(..., ...)

# your code here
fail() # No Answer - remove if you provide an answer
head(training_pred)

In [ ]:

test_3.1()

Question 3.2
{points: 1}

Use the confusionMatrix function to obtain the training accuracy. The confusionMatrix function takes two arguments, the predictions and the true class labels.

Name the object output training_results.

In [ ]:

#... <- confusionMatrix(..., Y_fruit)
# your code here
fail() # No Answer - remove if you provide an answer
training_results

In [ ]:

test_3.2()

Question 3.3
{points: 1}

From the output of the confusionMatrix function what is the training accuracy? Give the answer to at least 3 decimal places.

Assign the value of the training accuracy to a variable named answer3.3

In [ ]:

# Assign your answer to an object called: answer3.3
# your code here
fail() # No Answer - remove if you provide an answer
answer3.3

In [ ]:

test_3.3()

4. Improving the classifier beyond changing k

As long as the training accuracy is not 1 (or very close to it) we may be able to further improve the classifier by adding predictors. This is not a guarantee, but something worth trying. When we do this, we also need to re-choose $k$ again, as the optimal $k$ may change with a different number of predictors.

Question 4.0
{points: 1}

Create a new classifer called complex that uses scaled_mass, scaled_width, scaled_height and scaled_color_score as predictors. Again, try the values 1, 3, 5, 7, 9 and 11 for $k$ and use 10-fold cross validation.

In [ ]:

# Set the seed. Don't remove this!
set.seed(4567)

# your code here
fail() # No Answer - remove if you provide an answer
complex

In [ ]:

test_4.0()

Question 4.1
{points: 1}

Get the accuracy and $k$ values from the classifier, and name it k_accuracies_again. Use the scaffolding provided below.
Create a line and point plot of Accuracy (vertical axis) versus k (horizontal axis). Remember to do all the things for making your visualization effective. Name your plot object choose_k_again_plot.

In [ ]:

#... <- complex$... %>% select(..., Accuracy)
#choose_k_again_plot <- ggplot(..., aes(..., ...)) +
#                          geom_line() +
#                          ...()


# your code here
fail() # No Answer - remove if you provide an answer
choose_k_again_plot

In [ ]:

test_4.1()

Question 4.2
{points: 1}

From the plot of accuracy versus $k$ you created above, which $k$ should we choose for this more complex classifier?

Assign the value of k we should choose to a variable named answer4.2

In [ ]:

# Assign your answer to an object called: answer4.2
# your code here
fail() # No Answer - remove if you provide an answer
answer4.2

In [ ]:

test_4.2()

Question 4.3
{points: 1}

What is the cross-validation accuracy for the optimal $k$ for this more complex classifier?

Assign the value of the cross-validation accuracy for the optimal $k$ to a variable named answer4.3

In [ ]:

# Assign your answer to an object called: answer4.3
# your code here
fail() # No Answer - remove if you provide an answer
answer4.3

In [ ]:

test_4.3()

Question 4.4
{points: 1}

Did increasing the classifier complexity improve the cross-validation accuracy?

Answer by assigning the value of "True" or "False" to a variable named answer4.4

In [ ]:

# Assign your answer to an object called: answer4.4
# your code here
fail() # No Answer - remove if you provide an answer
answer4.4

In [ ]:

test_4.4()

5. Assessing test accuracy

How good is our model? Assessing the accuracy score on a test data set that was never used to choose our classifier is the only way to know. Let's do that!

Question 5.0
{points: 1}

Now that we have chosen the optimal model, re-train your classifier on the entire training data set (i.e., do not use cross-validation) with the "settings" that made it an optimal model (here $k$ and the number of predictors). Name your classifier object final_classifier.

In [ ]:

# Set the seed. Don't remove this!
set.seed(4567)

#final_k = data.frame(k = ...)
#... <- train(x = ..., y = Y_fruit, method = "knn", tuneGrid = final_k)

# your code here
fail() # No Answer - remove if you provide an answer
final_classifier

In [ ]:

test_5.0()

Question 5.1
{points: 1}

Now use the final_classifier to predict the labels for the test set, and then calculate the test accuracy. Name the output from the confusionMatrix function test_results.

In [ ]:

# Set the seed. Don't remove this!
set.seed(4567)

#X_test <- testing_set %>% 
#    select(...) %>% 
#    data.frame()
#Y_test <- testing_set %>% 
#    select(...) %>% 
#    unlist()
#test_pred <- ...(final_classifier, X_test)
#... <- confusionMatrix(test_pred, ...)


# your code here
fail() # No Answer - remove if you provide an answer
test_results

In [ ]:

test_5.1()

Question 5.2
{points: 1}

What is the test accuracy for the final classifier?

Assign the value of the test accuracy for the final classifier to a variable named answer5.2

In [ ]:

# Assign your answer to an object called: answer5.2
# your code here
fail() # No Answer - remove if you provide an answer
answer5.2

In [ ]:

test_5.2()