Path: blob/master/2020-fall/materials/tutorial_06/tutorial_06.ipynb
2051 views
Tutorial 6: Classification
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
Recognize situations where a simple classifier would be appropriate for making predictions.
Explain the k-nearest neighbour classification algorithm.
Interpret the output of a classifier.
Compute, by hand, the distance between points when there are two explanatory variables/predictors.
Describe what a training data set is and how it is used in classification.
In a dataset with two explanatory variables/predictors, perform k-nearest neighbour classification in R using
tidymodels
to predict the class of a single new observation.
Question 0.1 Multiple Choice:
{points: 1}
Before applying k-nearest neighbour to a classification task, we need to scale the data. What is the purpose of this step?
A. To help speed up the knn algorithm.
B. To convert all data observations to numeric values.
C. To ensure all data observations will be on a comparable scale and contribute equal shares to the calculation of the distance between points.
D. None of the above.
Assign your answer to an object called answer0.1
. Make sure the correct answer is an uppercase letter. Surround your answer with quotation marks (e.g. "F"
).
Note: we typically standardize (i.e., scale and center) the data before doing classification. For the K-nearest neighbour algorithm specifically, centering has no effect. But it doesn't hurt, and can help with other predictive data analyses, so we will do it below.
1. Fruit Data Example
In the agricultural industry, cleaning, sorting, grading, and packaging food products are all necessary tasks in the post-harvest process. Products are classified based on appearance, size and shape, attributes which helps determine the quality of the food. Sorting can be done by humans, but it is tedious and time consuming. Automatic sorting could help save time and money. Images of the food products are captured and analysed to determine visual characteristics.
The dataset contains observations of fruit described with four features 1) mass 2) width 3) height and 4) color score.
Question 1.0
{points: 1}
Load the file, fruit_data.csv
, into your notebook.
mutate()
the fruit_name
column such that it is a factor using the as_factor()
function.
Assign your data to an object called fruit_data
.
Let's take a look at the first six observations in the fruit dataset. Run the cell below.
Question 1.0.1
{points: 1}
Which of the columns are categorical?
A. Fruit label, width, fruit subtype
B. Fruit name, color score, height
C. Fruit label, fruit subtype, fruit name
D. Color score, mass, width
Assign your answer to an object called answer1.0.1
. Make sure the correct answer is an uppercase letter. Remember to surround your answer with quotation marks (e.g. "E"
).
Run the cell below, and find the nearest neighbour based on mass and width to the first observation just by looking at the scatterplot (the first observation has been circled for you).
Question 1.1 Multiple Choice:
{points: 1}
Based on the graph generated, what is fruit_name
of the closest data point to the one circled?
A. apple
B. lemon
C. mandarin
D. orange
Assign your answer to an object called answer1.1
. Make sure the correct answer is an uppercase letter. Surround your answer with quotation marks (e.g. "F"
).
Question 1.2
{points: 1}
Using mass and width, calculate the distance between the first observation and the second observation.
We provide a scaffolding to get you started.
Assign your answer to an object called fruit_dist_2
.
Question 1.3
{points: 1}
Calculate the distance between the first and the the 44th observation in the fruit dataset using the mass and width variables.
Assign your answer to an object called fruit_dist_44
.
Let's circle these three observations on the plot from earlier.
What do you notice about your answers from Question 1.2 & 1.3 that you just calculated? Is it what you would expect given the scatter plot above? Why or why not? Discuss with your neighbour.
Hint: Look at where the observations are on the scatterplot in the cell above this question, and what might happen if we changed grams into kilograms to measure the mass?
Question 1.4
{points: 1}
The distance between the first and second observation is 12.01 and the distance between the first and 44th observation is 2.33. By the formula, observation 1 and 44 are closer, however, if we look at the scatterplot the distance of the first observation to the second observation appears closer than to the 44th observation.
Which of the following statements is correct?
A. A difference of 12 g in mass between observation 1 and 2 is large compared to a difference of 1.2 cm in width between observation 1 and 44. Consequently, mass will drive the classification results, and width will have less of an effect.
B. If we measured mass in kilograms, then we’d get different nearest neighbours.
C. We should standardize the data so that all variables will be on a comparable scale.
D. All of the above.
Assign your answer to an object called answer1.4
. Make sure the correct answer is an uppercase letter. Surround your answer with quotation marks (e.g. "F"
).
Question 1.5
{points: 1}
Let's create a tidymodels
recipe to standardize (i.e., center and scale) all of the variables in the fruit dataset. Centering will make sure that every variable has an average of 0, and scaling will make sure that every variable has standard deviation of 1. We will use the step_scale
and step_center
preprocessing steps in the recipe. Then bake
the recipe so that we can examine the output.
Specify your recipe with class variable fruit_name
and predictors mass
, width
, height
, and color_score
.
Name the recipe fruit_data_recipe
, and name the preprocessed data fruit_data_scaled
.
Question 1.6
{points: 1}
Let's repeat Question 1.2 and 1.3 with the scaled variables:
calculate the distance with the scaled mass and width variables between observations 1 and 2
calculate the distances with the scaled mass and width variables between observations 1 and 44
After you do this, think about how these distances compared to the distances you computed in Question 1.2 and 1.3 for the same points.
Assign your answers to objects called distance_2
and distance_44
respectively.
Question 1.7
{points: 1}
Make a scatterplot of scaled mass on the horizontal axis and scaled color score on the vertical axis. Color the points by fruit name.
Assign your plot to an object called fruit_plot
. Make sure to do all the things to make an effective visualization.
Question 1.8 Multiple Choice:
{points: 1}
Suppose we have a new observation in the fruit dataset with scaled mass 0.5 and scaled color score 0.5.
Just by looking at the scatterplot, how would you classify this observation using K-nearest neighbours if you use K = 3
?
A. Apple
B. Orange
C. Mandarin
D. Lemon
Assign your answer to an object called answer1.8
. Make sure your answer is an uppercase letter. Remember to surround your answer with quotation marks (e.g. "F"
).
Question 1.9
{points: 1}
Now, let's use the tidymodels
package in R to predict fruit_name
for another new observation. The new observation we are interested in has mass 150g and scaled color score 0.73.
First, create the K-nearest neighbour model specification. Specify we want neighbors and that we want to use the straight-line distance. Name this model specification as knn_spec
.
Then create a new recipe named fruit_data_recipe_2
that centers and scales the predictors, but only uses mass
and color_score
as predictors.
Combine this with your recipe from before in a workflow
, and fit to the fruit_data
dataset.
Name the fitted model fruit_fit
.
Question 1.10
{points: 1}
Create a new tibble where mass = 150
and color_score = 0.73
and call it new_fruit
. Then, pass fruit_fit
and new_fruit
to the predict
function to predict the class for the new fruit observation. Save your prediction to an object named fruit_predicted
.
Question 1.11 Multiple Choice:
{points: 1}
If we plot the original data considering the prediction given by K-nearest neighbours above, it appears that the classification model did a decent job of predicting this new fruit observation. The black point below looks like it could be an orange or a lemon. Which of the following might we want to do to help with tricky prediction cases such as this?
A. Zoom-in onto the black point and see what is the next closest point to it to determine if it is an orange or a lemon
B. Visualize the data in a 3D plot
C. Consider other useful additional predictors/explanatory variables
D. None of the above
Assign your answer to an object called answer1.11
. Make sure your answer is a letter, in uppercase, and is surrounded by quotation marks (e.g. "F"
).
You can use the code below to visualize the observation whose label we just tried to predict.
Question 1.12
{points: 1}
Now do K-nearest neighbours classification again with the same data set, same K, and same new observation. However, this time, let's use all the columns in the dataset as predictors (except for the categorical fruit_label
and fruit_subtype
variables).
We have provided the new_fruit_all
dataframe below, which encodes the predictors for our new observation. Your job is to use K-nearest neighbours to predict the class of this point. You can reuse the model specification you created earlier.
Assign your answer (the output of predict
) to an object called fruit_all_predicted
.
Question 1.13 Multiple Choice:
{points: 1}
Compare the predictions between Question 1.7
and 1.9
. Why did the prediction change?
A. A different K-nearest neighbour model specification was utilized
B. New predictors (height
and width
) were added to the training data set which helped to further differentiate the fruits (e.g., lemons are taller and skinner than oranges, which are shorter and fatter)
C. The values for color_score
and mass
were changed which helped to differentiate oranges from lemons
D. None of the above
Assign your answer to an object called answer1.13
. Make sure you answer is an uppercase letter and is surrounded by quotation marks (e.g. "F"
).
2. Wheat Seed Dataset
X-ray images can be used to analyze and sort seeds. In this data set, we have 7 measurements from x-ray images from 3 varieties of wheat seeds (Kama, Rosa and Canadian).
Question 2.0
{points: 1}
Let's use tidymodels
with this data to perform K-nearest neighbours to classify the wheat variety of a new seed we measure with the given observed measurements (from an x-ray image) shown below. Specify that we want neighbors to perform the classification.
The seven measurements were taken below for each wheat kernel:
area A,
perimeter P,
compactness C = 4piA/P^2,
length of kernel,
width of kernel,
asymmetry coefficient
length of kernel groove.
The data set is available here: https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt. The last column in the data set is the variety label. The mapping for the numbers to varieties is listed below:
1 == Kama
2 == Rosa
3 == Canadian
Assign your answer to an object called seed_predict
.
Hints:
colnames()
can be used to specify the column names of a data frame.the wheat variety column appears numerical, but you want it to be treated as categorical for this analysis, thus
as_factor()
might be helpful.
Question 2.1 Multiple Choice:
{points: 1}
What is classification of the new_seed
observation?
A. Kama
B. Rosa
C. Canadian
Assign your answer to an object called answer2.1
. Make sure your answer is in uppercase and is surrounded by quotation marks (e.g. "F"
).