Path: blob/master/2020-spring/materials/worksheet_06/worksheet_06.ipynb
2051 views
Worksheet 6 - Classification
Lecture and Tutorial Learning Goals:
After completing this week's lecture and tutorial work, you will be able to:
Recognize situations where a simple classifier would be appropriate for making predictions.
Explain the k-nearest neighbour classification algorithm.
Interpret the output of a classifier.
Compute, by hand, the distance between points when there are two explanatory variables/predictors.
Describe what a training data set is and how it is used in classification.
In a dataset with two explanatory variables/predictors, perform k-nearest neighbour classification in R using
caret::train(method = "knn", ...)
to predict the class of a single new observation.
This worksheet covers parts of Chapter 6 of the online textbook. You should read this chapter before attempting the worksheet.
Question 0.1 Multiple Choice:
{points: 1}
Which of the following statements is NOT true of a training data set (in the context of classification)?
A. A training data set is a collection of observations for which we know the true classes.
B. We can use a training set to explore and build our classifier.
C. The training data set is the underlying collection of observations for which we don't know the true classes.
Assign your answer to an object called answer0.1
.
Question 0.2 Multiple Choice
{points: 1}
(Adapted from James et al, "An introduction to statistical learning" (page 53))
Consider the scenario below:
We collect data on 20 similar products. For each product we have recorded whether it was a success or failure (labelled as such by the Sales team), price charged for the product, marketing budget, competition price, customer data, and ten other variables.
Which of the following is a classification problem?
A. We are interested in comparing the profit margins for products that are a success and products that are a failure.
B. We are considering launching a new product and wish to know whether it will be a success or a failure.
C. We wish to group customers based on their preferences and use that knowledge to develop targeted marketing programs.
Assign your answer to an object called answer0.2
.
1. Breast Cancer Data Set
We will work with the breast cancer data from this week's pre-reading.
Question 1.0
{points: 1}
Read the clean-wdbc-data.csv
file (found in the data
directory) into the notebook and store it as a data frame. Name it cancer
.
Question 1.1 True or False:
{points: 1}
After looking at the first six rows of the cancer
data fame, suppose we asked you to predict the variable "area" for a new observation. Is this a classification problem?
Assign your answer (either "true" or "false") to an object called answer1.1
.
Question 1.2
{points: 1}
Create a scatterplot of the data with Symmetry
on the x-axis and Radius
on the y-axis. Modify your aesthetics by colouring for Class
. As you create this plot, ensure you follow the guidelines for creating effective visualizations.
Assign your plot to an object called cancer_plot
.
Question 1.3
{points: 1}
Just by looking at the scatterplot above, how would you classify an observation with symmetry 1 and radius 1?
Benign
Malignant
Assign your answer to an object called answer1.3
.
We will now compute the distance between the first and second observation in the breast cancer dataset using the explanatory variables/predictors Symmetry
and Radius
. Recall we can calculate the distance between two points using the following formula:
Question 1.4
{points: 1}
First, extract the coordinates for the two observations and assign them to objects called:
xa
(Symmetry value for the first row)ya
(Radius value for the first row)xb
(Symmetry value for the second row)yb
(Radius value for the second row).
Scaffolding for xa
is given
Question 1.5
{points: 1}
Plug the coordinates into the distance equation.
Assign your answer to an object called answer1.5
.
Fill in the ...
in the cell below. Copy and paste your finished answer into the fail()
.
Question 1.6
{points: 1}
Now we'll do the same thing with 3 explanatory variables/predictors: Symmetry, Radius and Concavity. Again, use the first two rows in the data set as the points you are calculating the distance between (point is row 1, and point is row 2).
Find the coordinates for the third variable (Concavity) and assign them to objects called za
and zb
. Use the scaffolding given in Question 1.4 as a guide.
Question 1.7
{points: 1}
Again, calculate the distance between the first and second observation in the breast cancer dataset using 3 explanatory variables/predictors: Symmetry, Radius and Concavity.
Assign your answer to an object called answer1.7
. Use the scaffolding given to calculate answer1.5
as a guide.
Question 1.8
{points: 1}
Let's do this without explicitly making coordinate variables!
Create a vector of the coordinates for each point. Name one vector point_a
and the other vector point_b
. Within the vector, the order of coordinates should be: Symmetry
, Radius
, Concavity
.
Fill in the ...
in the cell below. Copy and paste your finished answer into the fail()
.
Question 1.9
{points: 1}
Calculate the differences between the two vectors, point_a
and point_b
. The result should be a vector of length 3 named difference
.
Question 1.10
{points: 1}
Square the differences between the two vectors, point_a
and point_b
. The result should be a vector of length 3 named dif_square
. Hint: ^
is the exponent symbol in R.
Question 1.10.1
{points: 1}
Sum the squared differences between the two vectors, point_a
and point_b
. The result should be a double named dif_sum
.
Hint: the sum
function in R returns the sum of the elements of a vector
Question 1.10.2
{points: 1}
Square root the sum of your squared differences. The result should be a double named root_dif_sum
.
Question 1.10.3
{points: 1}
If we have more than a few points, calculating distances as we did in the previous questions is VERY slow. Let's use the dist()
function to find the distance between the first and second observation in the breast cancer dataset using Symmetry, Radius and Concavity.
Fill in the ...
in the cell below. Copy and paste your finished answer into the fail()
.
Assign your answer to an object called dist_cancer_two_rows
.
Question 1.10.4 True or False:
{points: 1}
Compare answer1.7
, root_dif_sum
, and dist_cancer_two_rows
.
Are they all the same value?
Assign your answer (either "true" or "false") to an object called answer1.10.4
2. Classification - A Simple Example Done Manually
Question 2.0.0
{points: 1}
Let's take a random sample of 5 observations from the breast cancer dataset using the sample_n
function. To make this random sample reproducible, we will use set.seed(2)
. This means that the random number generator will start at the same point each time when we run the code and we will always get back the same random samples.
We will focus on the predictors Symmetry and Radius only. Thus, we will need to select the columns Symmetry
and Radius
and Class
. Save these 5 rows and 3 columns to a data frame named small_sample
.
Fill in the ...
in the scaffolding provided below.
Question 2.0.1
{points: 1} Finally, create a scatter plot where Symmetry
is on the x-axis, and Radius
is on the y-axis. Color the points by Class
. Name your plot small_sample_plot
Fill in the ...
in the scaffolding provided below.
Question 2.1
{points: 1}
Suppose we are interested in classifying a new observation with Symmetry = 0
and Radius = 0.25
, but unknown Class
. Using the small_sample
data frame, add another row with Symmetry = 0
, Radius = 0.25
, and Class = "unknown"
.
Fill in the ...
in the scaffolding provided below.
Assign your answer to an object called newData
.
Question 2.2
{points: 1}
Compute the distance between each pair of the 6 observations in the newData
dataframe using the dist()
function based on two variables: Symmetry
and Radius
. Fill in the ...
in the scaffolding provided below.
Assign your answer to an object called dist_matrix
.
Question 2.3 Multiple Choice:
{points: 1}
In the table above, the row and column numbers reflect the row number from the data frame the dist
function was applied to. Thus numbers 1 - 5 were the points/observations from rows 1 - 5 in the small_sample
data frame. Row 6 was the new observation that we do not know the diagnosis class for. The values in dist_matrix
are the distances between the points of the row and column number. For example, the distance between the point 2 and point 4 is 0.8155068. And the distance between point 3 and point 3 (the same point) is 0.
Which observation is the nearest to our new point?
Assign your answer to an object called answer2.3
.
Question 2.4 Multiple Choice:
{points: 1}
Use the K-nearest neighbour classification algorithm with K = 1 to classify the new observation using your answers to Questions 2.2 & 2.3. Is the new data point predicted to be benign or malignant?
Assign your answer to an object called answer2.4
.
Question 2.5 Multiple Choice:
{points: 1}
Using your answers to Questions 2.2 & 2.3, what are the three closest observations to your new point?
A. 1, 3, 2
B. 1, 4, 2
C. 5, 2, 4
D. 3, 4, 2
Assign your answer to an object called answer2.5
.
Question 2.6 Multiple Choice:
{points: 1}
We will now use the K-nearest neighbour classification algorithm with K = 3 to classify the new observation using your answers to Questions 2.2 & 2.3. Is the new data point predicted to be benign or malignant?
Assign your answer to an object called answer2.6
.
Question 2.7
{points: 1}
Compare your answers in 2.4 and 2.6. Are they the same?
Assign your answer (either "yes" or "no") to an object called answer2.7
.
3. Using caret
to perform k-nearest neighbours
Now that we understand how K-nearest neighbours classification works, let's get familar with the caret
R package so we can run classification analyses faster and with fewer errors.
We'll again focus on Radius
and Symmetry
as the two predictors. This time, we would like to predict the class of a new observation with Symmetry = 1
and Radius = 0
. This one is a bit tricky to do visually from the plot below, and so is a motivating example for us to compute the prediction using k-nn with the caret
package. Let's use K = 7
.
Question 3.0
{points: 1}
Using the cancer
data, create the two objects needed to train a model using the caret
package:
A
data.frame
containing the predictorsSymmetry
andRadius
, andA
vector
containing theClass
.
Name the data.frame
containing the predictors X_train
and the vector
containing the classes/labels Y_train
.
Hints:
use the
data.frame
function which makestibble
s intodata.frame
suse the
unlist
function to make a single columntibble
into a vector
Question 3.1
{points: 1}
Next, use the train
function to train the K-nearest neighbours model. Make sure you pass the correct arguments to tell caret
what columns are the predictors and which is the target/outcome, as well as the value of K
we are using. Save the output as an object called model_knn
.
Note: because caret
is designed to make it easy to try a few different values of K
, you need to specify K
as a data.frame
. We have provided the scaffolding for you to do this below.
Question 3.2
{points: 1}
Create a data.frame
with our single new observation (Symmetry = 1
and Radius = 0
), naming it new_obs
. Predict the label of the new observation using the predict
function. Store the output of predict
in an object called predicted_knn_7
.
Looking back at the plot (shown again below), is this what you would have been able to guess visually? And do you think K = 7
was the "best" value to choose? Think on this, and we will discuss it next week. No answer is required in this worksheet.
Question 3.3
{points: 1}
Perform K-nearest neighbour classification again, using the caret
package and K=7
to classify a new observation where we measure Symmetry = 1
, Radius = 0
and Concavity = 1
.
store the training predictors in an object called
X_train_3
store the training labels in an object called
Y_train_3
store the output of
train
in an object calledmodel_knn_3
store the new observation in an object called
new_obs_3
store the output of
predict
in an object calledpredicted_3_knn_7
Question 3.4
{points: 1}
Finally, perform K-nearest neighbour classification again, using the caret
package and K = 7
to classify a new observation where we have measurements for all the predictors in our training data set (we give you the values in the code below).
store the training predictors in an object called
X_train_all
store the training labels in an object called
Y_train_all
store the output of
train
in an object calledmodel_knn_all
store the new observation in an object called
new_obs_all
store the output of
predict
in an object calledpredicted_all_knn_7
Hint: ID is not a measurement, but a label for each observation. Thus, do not include this in your analysis.
4. Reviewing Some Concepts
We will conclude with two multiple choice questions to reinforce some key concepts when doing classification with K-nearest neighbours.
Question 4.0
{points: 1}
In the K-nearest neighbours classification algorithm, we calculate the distance between the new observation (for which we are trying to predict the class/label/outcome) and each of the observations in the training data set so that we can:
A. Find the K
nearest neighbours of the new observation
B. Assess how well our model fits the data
C. Find outliers
D. Assign the new observation to a cluster
Assign your answer (e.g. "E"
) to an object called: answer4.0
Question 4.1
{points: 1}
In the K-nearest neighbours classification algorithm, we choose the label/class for a new observation by:
A. Taking the mean (average value) label/class of the K nearest neighbours
B. Taking the median (middle value) label/class of the K nearest neighbours
C. Taking the mode (value that appears most often) label/class of the K nearest neighbours
Assign your answer (e.g., "E"
) to an object called answer4.1