Path: blob/master/2019-spring/materials/worksheet_07-fix/worksheet_07-fix.ipynb
2051 views
Worksheet 7 (fix) - Classification (Part II)
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
Describe what a test data set is and how it is used in classification.
Using R, evaluate classification accuracy using a test data set and appropriate metrics.
Using R, execute cross validation in R to choose the number of neighbours.
Identify when it is necessary to scale variables before classification and do this using R
In a dataset with > 2 attributes, perform k-nearest neighbour classification in R using the
caret
package to predict the class of a test dataset.Describe advantages and disadvantages of the k-nearest neighbour classification algorithm.
This worksheet covers parts of Chapter 7 of the online textbook. You should read this chapter before attempting the worksheet.
Question 1 Multiple Choice:
Before applying k-nearest neighbour to a classification task, we need to scale the data. What is the purpose of this step?
A. To help improve the processing power of the knn algorithm.
B. To convert all data observations to numeric values.
C. To ensure all data observations will be on a comparable scale and contribute equal shares to the calculatiuon of the distance between points.
D. None of the above.
Assign your answer to an object called answer1
.
1. Fruit Data Example - (Part II)
Load the file, fruit_data.csv
, into your notebook.
mutate()
the fruit_name
column such that it is a factor.
Assign your data to an object called fruit_data
.
Let's take a look at the first six observations in the fruit dataset. Run the cell below.
Find the nearest neighbour based on mass and width to the first observation just by looking at the scatterplot (the first observation has been circled for you). Run the cell below.
Question 1.1 Based on the graph generated, what is fruit_name
of the closest data point to the one circled?
A. apple
B. lemon
C. mandarin
D. orange
Assign your answer to an object called answer1.1
.
Question 1.2
Using mass and width, calculate the distance between the first observation and the second observation.
We provide a scaffolding to get you started.
Assign your answer to an object called fruit_dist_2
.
Question 1.3
Calculate the distance between the first and the the 44th observation in the fruit dataset using the mass and width variables. You can see from the data frame output from the cell below that, observation 44 has mass = 194 g and width = 7.2 cm.
Assign your answer to an object called fruit_dist_44
.
Question 1.4
Discuss with the person sitting next to you.
i) What do you notice about your answers from Question 1.2/3 that you just calculated?
(Hint: look at where the observations are on the scatterplot in the cell above this question)
ii) Is it what you would expect? Why or why not?
(Hint: what might happen if we changed grams into kilograms to measure the mass?)
When you finish with you discussion, read the cell below:
The distance between the first and second observation is 12.01 and the distance between the first and 44th observation is 2.33. So by the formula, observation 1 and 44 are closer. However, if we look at the scatterplot the distance of the first observation to the second observation appears closer than to the 44th observation because of the axes scales.
Because the classifier predicts class by identifying the nearest points, the scale of the variables matters. Variables on a large scale compared to variables on a small scale will have a greater effect on the distance between the observations. Here we have width (measured in cm) and mass (in grams). As far as knn is concerned, a difference of 12 g in mass between observation 1 and 2 is large compared to a difference of 1.2 cm in width between observation 1 and 44. Consequently, mass will drive the classification results, and width will have less of an effect. Hence, our distance calculation reflects that. Also, if we measured mass in kilograms, or if we measured width in meters, then we’d get different classification results. Thus we can standardize the data so that all variables will be on a comparable scale.
Question 1.5
Scale all the variables of the fruit dataset and save them as columns in your data table.
Keep the dataset name as the original = fruit_data
.
Question 1.6
Let's repeat Question 1.2 and 1.3 with the scaled variables. Calculate the distance with the scaled mass and width variables between observations 1 and 2. Calculate the distances with the scaled mass and width variables between observations 1 and 44.
After you do this, think about how these distances compared to the distances you computed in Question 1.2 and 1.3 for the same points.
Assign your answers to objects called distance_2
and distance_44
respectively.
Splitting the data into a training and test set
Next, we will be partitioning the data into a training (70%) and testing (30%) set using the caret
package. We will put this test set away in a lock box and not touch it again until we have found the best k-nn classifier we can make using the training set.
Question 2.0
To do this we first use the createDataPartition
function to get the row numbers of the data we should include in our training set. This function uses a random process, so to ensure replicable results we need to set a seed using set.seed
to tell the random number generator where we'd like to start from. Name the object you create training_rows
.
Question 2.1
Next we use the slice
function to get the rows from the original data frame that match the ones we have in training_rows
.
Using cross-validation to choose k
Let's start with a simple classifier, one that uses only scaled_color_score
and scaled_mass
as predictors. fruit_name
should be the class label. As we build this simple classifier from the training set, let's use cross-validation to choose the best .
Question 2.2
We now need to take our training data and specify what columns are going to be the predictors and which are going to the class labels. Name the predictors X_simple
and the class labels Y_fruit
.
Question 2.3
Next, we need to create a data frame, named ks
, that contains a single column, named k
, that holds the k-values we'd like to try out. Let's try the values 1, 3, 5, 7, 9 and 11.
Hint - the c
function is useful for creating vectors, which are what data frame columns are.
Question 2.4
Next we use the trainControl
function. This function passes additional information to the train
function we use to create our classifier. Here we would like to set the arguments to method="cv"
(for cross-validation) and number = 10
(for 10-fold cross validation). Name this object train_control
.
Question 2.5
Now we create out classifier as we did last week, but to do cross-validation as well (so we can assess classifier accruacy based on each ) we supply an additional argument to the train
function, trControl
. For that argument we pass it the name of the object we created using the trainControl
function. Name the classifier choose_k
.
Then to help us choose it is very useful to visualize the accuracies as we increase . This will help us choose the smallest with the biggest accuracy. To do this, create a line and point plot of accuracy (y-axis) versus (x-axis). We can get these values from the results
attribute of the classifier object using the $
operator. We demonstrate this in the cell below:
Question 2.6
Now that we have the accuracy and values in a data frame, create a line and point plot of accuracy (y-axis) versus (x-axis). Remember to do all the things for making your visualization effective. Name your plot object choose_k_plot
.
Question 2.7
From the plot of accuracy versus you created above, which should we choose?
Assign the value of we should choose to a variable named answer2.7
Question 2.8
What is the cross-validation accuracy for the optimal ? Give at least 3 decimal places for your answer.
Assign the value of the cross-validation accuracy for the optimal to a variable named answer2.8
Training error as a diagnostic tool
Is this the best we can do with our classifier? Maybe, or maybe not. To get a hint we can use the training error as a diagnostic to tell us if we are underfitting and could afford to make our model more complex, say by including additional predictors.
Question 3.0
Create another simple classifier object (same columns as the classifier above) using the train
function that does not use cross-validation, and only a single value of 5. Name it simple
.
Question 3.1
Use the simple
classifier to predict labels for all the observations in the training set (X_simple
). Name the predictions training_pred
.
Question 3.2
Use the confusionMatrix
function to obtain the training accuracy. The confusionMatrix
function takes two arguments, the predictions and the true class labels.
Name the object output training_results
.
Question 3.3
From the output of the confusionMatrix
function what is the training accuracy? Give the answer to at least 3 decimal places.
Assign the value of the training accuracy to a variable named answer3.2
Improving the classifier beyond changing k
As long as the training accuracy is not 1 (or very close to it) we may be able to further improve the classifier by adding predictors. This is not a guarantee, but something worth trying. When we do this, we also need to re-choose again, as the optimal may change with a different number of predictors.
Question 4.0
Create a new classifer called complex
that uses scaled_mass
, scaled_width
, scaled_height
and scaled_color_score
as predictors. Again, try the values 1, 3, 5, 7, 9 and 11 for and use 10-fold cross validation.
Question 4.1
Get the accuracy and values from the classifier and create a line and point plot of accuracy (y-axis) versus kk (x-axis). Remember to do all the things for making your visualization effective. Name your plot object choose_k_again_plot
.
Question 4.2
From the plot of accuracy versus you created above, which should we choose for this more complex classifier?
Assign the value of kk we should choose to a variable named answer4.2
Question 4.3
What is the cross-validation accuracy for the optimal for this more complex classifier?
Assign the value of the cross-validation accuracy for the optimal to a variable named answer4.3
Question 4.4
Did increasing the classifier complexity improve the cross-validation accuracy?
Answer by assigning the value of "True"
or "False"
to a variable named answer4.4
Assessing test accuracy
How good is our model? Assessing the accuracy score on a test data set that was never used to choose our classifier is the only way to know. Let's do that!
Question 5.0
Now that we have chosen the optimal model, re-train your classifier on the entire training data set (i.e., do not use cross-validation) with the "settings" that made it an optimal model (here and the number of predictors). Name your classifier object final_classifier
Question 5.1
Now use the final_classifier
to predict the labels for the test set, and then calculate the test accuracy. Name the output from the confusionMatrix
function test_results
.
Question 5.2
What is the test accuracy for the final classifier?
Assign the value of the test accuracy for the final classifier to a variable named answer5.2