Path: blob/master/2021-summer/materials/tutorial_06/tutorial_06.ipynb
2051 views
Tutorial 6: Classification
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
Recognize situations where a simple classifier would be appropriate for making predictions.
Explain the k-nearest neighbour classification algorithm.
Interpret the output of a classifier.
Compute, by hand, the distance between points when there are two explanatory variables/predictors.
Describe what a training data set is and how it is used in classification.
In a dataset with two explanatory variables/predictors, perform k-nearest neighbour classification in R using
tidymodels
to predict the class of a single new observation.
Question 0.1 Multiple Choice:
{points: 1}
Before applying k-nearest neighbour to a classification task, we need to scale the data. What is the purpose of this step?
A. To help speed up the knn algorithm.
B. To convert all data observations to numeric values.
C. To ensure all data observations will be on a comparable scale and contribute equal shares to the calculation of the distance between points.
D. None of the above.
Assign your answer to an object called answer0.1
. Make sure the correct answer is an uppercase letter. Surround your answer with quotation marks (e.g. "F"
).
Note: we typically standardize (i.e., scale and center) the data before doing classification. For the K-nearest neighbour algorithm specifically, centering has no effect. But it doesn't hurt, and can help with other predictive data analyses, so we will do it below.
1. Fruit Data Example
In the agricultural industry, cleaning, sorting, grading, and packaging food products are all necessary tasks in the post-harvest process. Products are classified based on appearance, size and shape, attributes which helps determine the quality of the food. Sorting can be done by humans, but it is tedious and time consuming. Automatic sorting could help save time and money. Images of the food products are captured and analysed to determine visual characteristics.
The dataset contains observations of fruit described with four features 1) mass (in g) 2) width (in cm) 3) height (in cm) and 4) color score (on a scale from 0 - 1).
Question 1.0
{points: 1}
Load the file, fruit_data.csv
, into your notebook.
mutate()
the fruit_name
column such that it is a factor using the as_factor()
function.
Assign your data to an object called fruit_data
.
Let's take a look at the first few observations in the fruit dataset. Run the cell below.
Question 1.0.1 Multiple Choice:
{points: 1}
Which of the columns should we treat as categorical variables?
A. Fruit label, width, fruit subtype
B. Fruit name, color score, height
C. Fruit label, fruit subtype, fruit name
D. Color score, mass, width
Assign your answer to an object called answer1.0.1
. Make sure the correct answer is an uppercase letter. Remember to surround your answer with quotation marks (e.g. "E"
).
Run the cell below, and find the nearest neighbour based on mass and width to the first observation just by looking at the scatterplot (the first observation has been circled for you).
Question 1.1 Multiple Choice:
{points: 1}
Based on the graph generated, what is the fruit_name
of the closest data point to the one circled?
A. apple
B. lemon
C. mandarin
D. orange
Assign your answer to an object called answer1.1
. Make sure the correct answer is an uppercase letter. Surround your answer with quotation marks (e.g. "F"
).
Question 1.2
{points: 1}
Using mass and width, calculate the distance between the first observation and the second observation.
We provide a scaffolding to get you started.
Assign your answer to an object called fruit_dist_2
.
Question 1.3
{points: 1}
Calculate the distance between the first and the the 44th observation in the fruit dataset using the mass and width variables.
Assign your answer to an object called fruit_dist_44
.
Let's circle these three observations on the plot from earlier.
What do you notice about your answers from Question 1.2 & 1.3 that you just calculated? Is it what you would expect given the scatter plot above? Why or why not? Discuss with your neighbour.
Hint: Look at where the observations are on the scatterplot in the cell above this question, and what might happen if we changed grams into kilograms to measure the mass?
Question 1.4 Multiple Choice:
{points: 1}
The distance between the first and second observation is 12.01 and the distance between the first and 44th observation is 2.33. By the formula, observation 1 and 44 are closer, however, if we look at the scatterplot the distance of the first observation to the second observation appears closer than to the 44th observation.
Which of the following statements is correct?
A. A difference of 12 g in mass between observation 1 and 2 is large compared to a difference of 1.2 cm in width between observation 1 and 44. Consequently, mass will drive the classification results, and width will have less of an effect.
B. If we measured mass in kilograms, then we’d get different nearest neighbours.
C. We should standardize the data so that all variables will be on a comparable scale.
D. All of the above.
Assign your answer to an object called answer1.4
. Make sure the correct answer is an uppercase letter. Surround your answer with quotation marks (e.g. "F"
).
Question 1.5
{points: 1}
Let's create a tidymodels
recipe to standardize (i.e., center and scale) all of the variables in the fruit dataset. Centering will make sure that every variable has an average of 0, and scaling will make sure that every variable has standard deviation of 1. We will use the step_scale
and step_center
preprocessing steps in the recipe. Then bake
the recipe so that we can examine the output.
Specify your recipe with class variable fruit_name
and predictors mass
, width
, height
, and color_score
.
Name the recipe fruit_data_recipe
, and name the preprocessed data fruit_data_scaled
.
Question 1.6
{points: 1}
Let's repeat Question 1.2 and 1.3 with the scaled variables:
calculate the distance with the scaled mass and width variables between observations 1 and 2
calculate the distances with the scaled mass and width variables between observations 1 and 44
After you do this, think about how these distances compared to the distances you computed in Question 1.2 and 1.3 for the same points.
Assign your answers to objects called distance_2
and distance_44
respectively.
Question 1.7
{points: 1}
Make a scatterplot of scaled mass on the horizontal axis and scaled color score on the vertical axis. Color the points by fruit name.
Assign your plot to an object called fruit_plot
. Make sure to do all the things to make an effective visualization.
Question 1.8
{points: 3}
Suppose we have a new observation in the fruit dataset with scaled mass 0.5 and scaled color score 0.5.
Just by looking at the scatterplot, how would you classify this observation using K-nearest neighbours if you use K = 3? Explain how you arrived at your answer.
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
Question 1.9
{points: 1}
Now, let's use the tidymodels
package to predict fruit_name
for another new observation. The new observation we are interested in has mass 150g and color score 0.73.
First, create the K-nearest neighbour model specification. Specify we want neighbors, set_engine
to be "kknn"
, and we want to use the straight-line distance. Name this model specification as knn_spec
.
Then create a new recipe named fruit_data_recipe_2
that centers and scales the predictors, but only uses mass
and color_score
as predictors.
Combine this with your recipe from before in a workflow
, and fit to the fruit_data
dataset.
Name the fitted model fruit_fit
.
Question 1.10
{points: 1}
Create a new tibble where mass = 150
and color_score = 0.73
and call it new_fruit
. Then, pass fruit_fit
and new_fruit
to the predict
function to predict the class for the new fruit observation. Save your prediction to an object named fruit_predicted
.
Question 1.11
{points: 3}
Revisiting fruit_plot
and considering the prediction given by K-nearest neighbours above, do you think the classification model did a "good" job predicting? Could you have done/do better? Given what we know this far in the course, what might we want to do to help with tricky prediction cases such as this?
You can use the code below to visualize the observation whose label we just tried to predict.
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
Question 1.12
{points: 1}
Now do K-nearest neighbours classification again with the same data set, same K, and same new observation. However, this time, let's use all the columns in the dataset as predictors (except for the categorical fruit_label
and fruit_subtype
variables).
We have provided the new_fruit_all
dataframe below, which encodes the predictors for our new observation. Your job is to use K-nearest neighbours to predict the class of this point. You can reuse the model specification you created earlier.
Assign your answer (the output of predict
) to an object called fruit_all_predicted
.
Question 1.13
{points: 3}
Did your second classification on the same data set with the same K change the prediction? If so, why do you think this happened?
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.
2. Wheat Seed Dataset
X-ray images can be used to analyze and sort seeds. In this data set, we have 7 measurements from x-ray images from 3 varieties of wheat seeds (Kama, Rosa and Canadian).
Question 2.0
{points: 3}
Let's use tidymodels
to perform K-nearest neighbours to classify the wheat variety of seeds. The data set is available here: https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt. Download the data set directly from this URL using the read_table2()
function, which is helpful when the columns are separated by one or more white spaces.
The seven measurements were taken below for each wheat kernel:
area A,
perimeter P,
compactness C = 4piA/P^2,
length of kernel,
width of kernel,
asymmetry coefficient
length of kernel groove.
The last column in the data set is the variety label. The mapping for the numbers to varieties is listed below:
1 == Kama
2 == Rosa
3 == Canadian
Use tidymodels
with this data to perform K-nearest neighbours to classify the wheat variety of a new seed we measure with the given observed measurements (from an x-ray image) listed above. Specify that we want neighbors to perform the classification. Don't forget to perform any necessary preprocessing!
Assign your answer to an object called seed_predict
.
Hints:
colnames()
can be used to specify the column names of a data frame.the wheat variety column appears numerical, but you want it to be treated as categorical for this analysis, thus
as_factor()
might be helpful.
Question 2.1 Multiple Choice:
{points: 1}
What is classification of the new_seed
observation?
A. Kama
B. Rosa
C. Canadian
Assign your answer to an object called answer2.1
. Make sure your answer is in uppercase and is surrounded by quotation marks (e.g. "F"
).