GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-spring/materials/tutorial_06/tutorial_06.ipynb
²⁰⁵¹ views

Kernel: R

Tutorial 6: Classification

In [ ]:

### Run this cell before continuing. 
library(digest)
library(tidyverse)
library(testthat)
library(repr)
library(digest)
library(caret)

2. Fruit Dataset

In the agricultural industry, cleaning, sorting, grading and packaging food products are all necessary tasks in the post-harvest process. Products are classified based on appearance, size and shape, attributes which helps determine the quality of the food. Sorting can be done by humans, but it is tedious and time consuming. Automatic sorting by machine could help this process by saving time and money. Images of the food products are captured and analysed to determine visual characteristics.

The dataset contains observations of fruit described with four features 1) mass 2) width 3) height and 4) color score. The dataset "fruit_data_scaled.csv" has been scaled as part of the data preparation. Scaling will be discussed in more detail next week.

Question 1.0

Read the data in fruit_data_scaled.csv into the notebook. Name it fruit_data.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(fruit_data)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(nrow(fruit_data), 59)
    expect_equal(ncol(fruit_data), 7)
    expect_equal(digest(as.numeric(sum(fruit_data$scaled_width))), '3043b93a18750881f7178956ba203cfd')
})
print("Success!")

Question 1.1

Which of the columns are categorical?

A. Fruit label, width, fruit subtype

B. Fruit name, color score, height

C. Fruit label, fruit subtype, fruit name

D. Color score, mass, width

Assign your answer to an object called answer1.1

In [ ]:

# Assign your answer to an object called: answer1.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(answer1.1), '475bf9280aab63a82af60791302736f6') # we hid the answer to the test here so you can't see it, but we can still run the test
})
print("Success!")

Question 1.2

Change the variable fruit_name to a factor (using as.factor). Your object should still be named fruit_data.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(fruit_data)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(ncol(fruit_data), 7)
    expect_equal(nrow(fruit_data), 59)
    expect_true(is.factor(fruit_data$fruit_name))
    })
print("Success!")

Question 1.3

Make a scatterplot of scaled color score and scaled mass. Color the points by fruit name.

Assign your plot to an object called fruit_plot. Make sure to do all the things to make an effective visualization.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
fruit_plot

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(rlang::get_expr(fruit_plot$mapping$x)), '3986fbac2d38b023685e9f642b0979d6')
    expect_equal(digest(rlang::get_expr(fruit_plot$mapping$y)), 'cd2f8d5f1bef36aa7fd8c33f0843ed08')
    expect_true(digest(rlang::get_expr(fruit_plot$mapping$colour)) %in% c('7aaa2f0f12c253204a1e8490580820db', 'f9e884084b84794d762a535f3facec85'))
    expect_true('GeomPoint' %in% class(rlang::get_expr(fruit_plot$layers[[1]]$geom)))
    })
print("Success!")

Question 1.4

Suppose we have a new observation in the fruit dataset with scaled mass 0.5 and scaled colour score 0.5

Label this new data point in black on the scatterplot below.

Assign your new plot to an object called fruit_plot_new. Again, make sure to label your axes!

In [ ]:

# Add this layer to the fruit_plot object and fill in the missing parts
# geom_point(aes(x = ..., y = ...), color = "black", size = 2.5)

# your code here
fail() # No Answer - remove if you provide an answer
fruit_plot_new

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(rlang::get_expr(fruit_plot_new$mapping$x)), '3986fbac2d38b023685e9f642b0979d6')
    expect_equal(digest(rlang::get_expr(fruit_plot_new$mapping$y)), 'cd2f8d5f1bef36aa7fd8c33f0843ed08')
    expect_true(digest(rlang::get_expr(fruit_plot_new$mapping$colour)) %in% c('7aaa2f0f12c253204a1e8490580820db', 'f9e884084b84794d762a535f3facec85'))
    expect_true('GeomPoint' %in% class(rlang::get_expr(fruit_plot_new$layers[[1]]$geom)))
    expect_true('GeomPoint' %in% class(rlang::get_expr(fruit_plot_new$layers[[2]]$geom)))
    })
print("Success!")

Question 1.5

Just by looking at the scatterplot, how would you classify this observation using k-nearest neighbours if you use $k=3$ . Explain how you arrived at your answer.

YOUR ANSWER HERE

Now, let's use the caret package in R to use k-nn classify fruit name for another new observation, one where scaled mass is -0.3 and scaled colour score is -0.4. Use scaled color score and scaled mass as the predictors/explanatory variables and choose $k=5$ .

Question 1.6

Split the fruit_data data frame into the predictors (name it X_train) and the class/outcome (name it Y_train).

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
head(X_train)
head(Y_train)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(ncol(X_train), 2)
    expect_equal(nrow(X_train), 59)
    expect_equal('tbl' %in% class(X_train), FALSE)
    expect_equal(length(Y_train), 59)
    })
print("Success!")

Question 1.7

Specify $k$ and create the k-nn model object (name it fruit_class).

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
print(fruit_class)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(as.numeric(fruit_class$results$k), 5)
    expect_equal(as.character(fruit_class$method), 'knn')
    expect_equal(digest(as.numeric(sum(fruit_class$trainingData$scaled_color))), '2a93b1da4bcda1113ef03e891938eac4')
    expect_equal(digest(as.numeric(sum(fruit_class$trainingData$scaled_mass))), '9c46762a9ec19d9658dc07063ead8f30')
    expect_equal(as.numeric(summary(fruit_class$trainingData$.outcome)[1]), 19)
})
print("Success!")

Question 1.8

Use the fruit_class model to predict the class for the new fruit observation (where scaled mass is -0.3 and scaled colour score is -0.4). Save your prediction to an object named fruit_predicted.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
print(fruit_predicted)

In [ ]:

test_that('Solution is incorrect', {
    expect_equal(digest(as.character(fruit_predicted)), '17f79d7a98f732174cc5a86dc56380d6')
})
print("Success!")

Question 1.9

Revisiting fruit_plot and considering the prediction given by k-nn above, do you think k-nn did a "good" job predicting? Could you have done/do better? Given what we know this far in the course, what might we want to do to help with tricky prediction cases such as this?

In [ ]:

fruit_plot +
    geom_point(aes(x = -0.3, y = -0.4), color = "black", size = 2.5)

YOUR ANSWER HERE

Question 2.0

Now do k-nn classification in R again, with the same data set and the same $k$ and the same new observation (with the additional information given below), but this time let's use all the columns in the dataset (except for fruit_label, fruit_name and fruit_subtype) as predictors.

In [ ]:

new_fruit_all <- data.frame(scaled_mass = -0.3, 
                            scaled_width = -0.5, 
                            scaled_height = 1.0,
                            scaled_color = -0.4)


# your code here
fail() # No Answer - remove if you provide an answer

Question 2.1

Did your second classification on the same data set with the same $k$ change the prediction? If so, why do you think this happened?

YOUR ANSWER HERE

Question 3.0

X-ray images can be used to analyze and sort seeds. In this data set, we have 7 measurements from x-ray images from 3 varieties of wheat seeds (Kama, Rosa and Canadian). Let's use this data to perform k-nn to classify the wheat variety of a new seed we measure with the given observed measurements (from an x-ray image) shown below. Choose $k = 5$ to perform the classification.

The seven measurements were taken below for each wheat kernel:

area A,
perimeter P,
compactness C = 4piA/P^2,
length of kernel,
width of kernel,
asymmetry coefficient
length of kernel groove.

The data set is available here: https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt. The last column in the data set is the variety label. The mapping for the numbers to varieties is listed below:

1 == Kama
2 == Rosa
3 == Canadian

Hints:

colnames() can be used to specify the column names of a data frame.
the wheat variety column appears numerical, but you want it to be treated as categorical for this analysis, thus as.factor() might be helpful.

In [ ]:

new_seed <- data.frame(area = 12.1,
                        perimeter = 14.2,
                        compactness = 0.9,
                        length = 4.9,
                        width = 2.8,
                        asymmetry_coefficient = 3.0, 
                        groove_length = 5.1)

# your code here
fail() # No Answer - remove if you provide an answer

Question 3.1

In 2-3 sentences, in your own words describe your findings from the classification task above.

YOUR ANSWER HERE

Tutorial 6: Classification

2. Fruit Dataset

Product

Resources

Company