Path: blob/master/2019-fall/materials/worksheet_07/worksheet_07.ipynb
2051 views
Worksheet 7 - Classification (Part II)
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
Describe what a test data set is and how it is used in classification.
Using R, evaluate classification accuracy using a test data set and appropriate metrics.
Using R, execute cross validation in R to choose the number of neighbours.
Identify when it is necessary to scale variables before classification and do this using R
In a dataset with > 2 attributes, perform k-nearest neighbour classification in R using the
caret
package to predict the class of a test dataset.Describe advantages and disadvantages of the k-nearest neighbour classification algorithm.
This worksheet covers parts of Chapter 7 of the online textbook. You should read this chapter before attempting the worksheet.
Question 0.1 Multiple Choice:
{points: 1}
Before applying k-nearest neighbour to a classification task, we need to scale the data. What is the purpose of this step?
A. To help speed up the knn algorithm.
B. To convert all data observations to numeric values.
C. To ensure all data observations will be on a comparable scale and contribute equal shares to the calculation of the distance between points.
D. None of the above.
Assign your answer to an object called answer1
.
Note: we typically standardize (i.e., scale and center) the data before doing classification. For the K-nearest neighbour algorithm specifically, centering has no effect. But it doesn't hurt, and can help with other predictive data analyses, so we will do it below.
1. Fruit Data Example - (Part II)
Question 1.0 Multiple Choice:
{points: 1}
Load the file, fruit_data.csv
, into your notebook.
mutate()
the fruit_name
column such that it is a factor.
Assign your data to an object called fruit_data
.
Let's take a look at the first six observations in the fruit dataset. Run the cell below.
Run the cell below, and find the nearest neighbour based on mass and width to the first observation just by looking at the scatterplot (the first observation has been circled for you).
Question 1.1 Multiple Choice:
{points: 1}
Based on the graph generated, what is fruit_name
of the closest data point to the one circled?
A. apple
B. lemon
C. mandarin
D. orange
Assign your answer to an object called answer1.1
.
Question 1.2
{points: 1}
Using mass and width, calculate the distance between the first observation and the second observation.
We provide a scaffolding to get you started.
Assign your answer to an object called fruit_dist_2
.
Question 1.3
{points: 1}
Calculate the distance between the first and the the 44th observation in the fruit dataset using the mass and width variables.
You can see from the data frame output from the cell below that, observation 44 has mass = 194 g and width = 7.2 cm.
Assign your answer to an object called fruit_dist_44
.
Question 1.4
What do you notice about your answers from Question 1.2 & 1.3 that you just calculated? Is it what you would expect given the scatter plot above? Why or why not?
Hint: Look at where the observations are on the scatterplot in the cell above this question, and think about what might happen if we changed grams into kilograms to measure the mass.
YOUR ANSWER HERE
Question 1.5
{points: 1}
Scale all the variables of the fruit dataset and save them as columns in your data table.
Save the dataset object using the same name from previously, i.e., fruit_data
. Make sure to name the new columns scaled_*
where *
is the old column name.
Question 1.6
{points: 1}
Let's repeat Question 1.2 and 1.3 with the scaled variables:
calculate the distance with the scaled mass and width variables between observations 1 and 2
calculate the distances with the scaled mass and width variables between observations 1 and 44
After you do this, think about how these distances compared to the distances you computed in Question 1.2 and 1.3 for the same points.
Assign your answers to objects called distance_2
and distance_44
respectively.
Randomness and Setting Seeds
The remaining material in the worksheet uses functions from the caret
library, many of which make the use of randomness (for many purposes: resolving ties in the nearest neighbour vote, splitting the data, balancing, etc). In order to ensure that the steps in the worksheet are reproducible, we need to set a seed, i.e., a numerical "starting value," which determines the sequence of random numbers R will generate.
Below in many cells we have included a call to set.seed
. Do not remove these lines of code; they are necessary to make sure the autotesting code functions properly.
Optional extra info for those who are curious: the reason we have
set.seed
in so many places is that Jupyter notebooks are organized into cells that can be run out of order. Since things can be run out of order, the exact sequence of random values that is used in each cell is hard to determine, which makes autotesting really difficult. We had two options: either enforce that you only ever run the code by hitting "Restart & Run All" to ensure that we get the same values of randomness each time, or putset.seed
in a lot of places (we chose the latter). One drawback of callingset.seed
everywhere is that the numbers that will be generated won't really be random. For the purposes of teaching and learning, that is fine here. But in a typical data analysis, you should really only callset.seed
once at the beginning of the analysis, so that your random numbers are actually reasonably random.
2. Splitting the data into a training and test set
Next, we will be partitioning the data into a training (75%) and testing (25%) set using the caret
package. We will put this test set away in a lock box and not touch it again until we have found the best k-nn classifier we can make using the training set.
Question 2.0
{points: 1}
To do this we first use the createDataPartition
function to get the row numbers of the data we should include in our training set. Name the object you create training_rows
.
Question 2.1
{points: 1}
Next we use the slice
function to get the rows from the original data frame that match the ones we have in training_rows
. The goal is to create one object for the training data (training_set
) and one for the testing data (testing_set
) using the rows that we have designated via createDataPartition
.
Use the scaffolding provided. Name the two subsets of data training_set
and testing_set
.
Using cross-validation to choose k
Let's start with a simple classifier, one that uses only scaled_color_score
and scaled_mass
as predictors. fruit_name
should be the class label. As we build this simple classifier from the training set, let's use cross-validation to choose the best .
Question 2.2
{points: 1}
We now need to take our training data and specify what columns are going to be the predictors and which are going to the class labels. Name the predictors X_simple
and the class labels Y_fruit
.
Question 2.3
{points: 1}
Next, we need to create a data frame, named ks
, that contains a single column, named k
, that holds the k-values we'd like to try out. Let's try the values 1, 3, 5, 7, 9 and 11.
Hint - the c
function is useful for creating vectors, which are what data frame columns are.
Question 2.4
{points: 1}
Next, we use the trainControl
function. This function passes additional information to the train
function we use to create our classifier. Here we would like to set the arguments to method = "cv"
(for cross-validation) and number = 10
(for 10-fold cross-validation). Name this object train_control
.
Question 2.5
{points: 1}
Now we create our classifier as we did last week, but to do cross-validation as well (so we can assess classifier accuracy based on each ) we supply an additional argument to the train
function, trControl
. For that argument, we pass the name of the object we created using the trainControl
function. Name the classifier choose_k
.
Then to help us choose it is very useful to visualize the accuracies as we increase . This will help us choose the smallest with the largest accuracy. To do this, create a line and point plot of accuracy (y-axis) versus (x-axis). We can get these values from the results
attribute of the classifier object using the $
operator. We demonstrate this in the cell below:
Question 2.6
{points: 1}
Now that we have the accuracy and values in a data frame, create a line and point plot of accuracy (y-axis) versus (x-axis). Remember to do all the things for making your visualization effective. Name your plot object choose_k_plot
.
Question 2.7
{points: 1}
If we were judging based on the plot of accuracy versus you created above, which should we choose?
Assign the value of we should choose to a variable named answer2.7
Note: there may be multiple reasonable answers. Just pick one of these, any one will suffice.
Question 2.8
{points: 1}
What is the cross-validation accuracy for the optimal ?
Assign the value of the cross-validation accuracy for the optimal to a variable named answer2.8
3. Training error as a diagnostic tool
Is this the best we can do with our classifier? Maybe, or maybe not. To get a hint we can use the training error as a diagnostic to tell us if we are underfitting and could afford to make our model more complex, say by including additional predictors.
Question 3.0
{points: 1}
Create another simple classifier object (same columns as the classifier above) using the train
function that does not use cross-validation, and only a single value of 3. Name it simple
.
Question 3.1
{points: 1}
Use the simple
classifier to predict labels for all the observations in the training set (X_simple
). Name the predictions training_pred
.
Question 3.2
{points: 1}
Use the confusionMatrix
function to obtain the training accuracy. The confusionMatrix
function takes two arguments, the predictions and the true class labels.
Name the object output training_results
.
Question 3.3
{points: 1}
From the output of the confusionMatrix
function what is the training accuracy? Give the answer to at least 3 decimal places.
Assign the value of the training accuracy to a variable named answer3.3
4. Improving the classifier beyond changing k
As long as the training accuracy is not 1 (or very close to it) we may be able to further improve the classifier by adding predictors. This is not a guarantee, but something worth trying. When we do this, we also need to re-choose again, as the optimal may change with a different number of predictors.
Question 4.0
{points: 1}
Create a new classifer called complex
that uses scaled_mass
, scaled_width
, scaled_height
and scaled_color_score
as predictors. Again, try the values 1, 3, 5, 7, 9 and 11 for and use 10-fold cross validation.
Question 4.1
{points: 1}
Get the accuracy and values from the classifier, and name it
k_accuracies_again
. Use the scaffolding provided below.Create a line and point plot of
Accuracy
(vertical axis) versusk
(horizontal axis). Remember to do all the things for making your visualization effective. Name your plot objectchoose_k_again_plot
.
Question 4.2
{points: 1}
From the plot of accuracy versus you created above, which should we choose for this more complex classifier?
Assign the value of k we should choose to a variable named answer4.2
Question 4.3
{points: 1}
What is the cross-validation accuracy for the optimal for this more complex classifier?
Assign the value of the cross-validation accuracy for the optimal to a variable named answer4.3
Question 4.4
{points: 1}
Did increasing the classifier complexity improve the cross-validation accuracy?
Answer by assigning the value of "True"
or "False"
to a variable named answer4.4
5. Assessing test accuracy
How good is our model? Assessing the accuracy score on a test data set that was never used to choose our classifier is the only way to know. Let's do that!
Question 5.0
{points: 1}
Now that we have chosen the optimal model, re-train your classifier on the entire training data set (i.e., do not use cross-validation) with the "settings" that made it an optimal model (here and the number of predictors). Name your classifier object final_classifier
.
Question 5.1
{points: 1}
Now use the final_classifier
to predict the labels for the test set, and then calculate the test accuracy. Name the output from the confusionMatrix
function test_results
.
Question 5.2
{points: 1}
What is the test accuracy for the final classifier?
Assign the value of the test accuracy for the final classifier to a variable named answer5.2