Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
UBC-DSCI
GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2020-spring/materials/tutorial_06/tutorial_06.ipynb
2051 views
Kernel: R

Tutorial 6: Classification

### Run this cell before continuing. library(tidyverse) library(repr) library(caret) source('tests_tutorial_06.R') source("cleanup_tutorial_06.R")

1. Fruit Dataset

In the agricultural industry, cleaning, sorting, grading, and packaging food products are all necessary tasks in the post-harvest process. Products are classified based on appearance, size and shape, attributes which helps determine the quality of the food. Sorting can be done by humans, but it is tedious and time consuming. Automatic sorting could help save time and money. Images of the food products are captured and analysed to determine visual characteristics.

The dataset contains observations of fruit described with four features 1) mass 2) width 3) height and 4) color score. The dataset fruit_data_scaled.csv in the data folder has been scaled as part of the data preparation.

Question 1.0
{points: 1}

Read the data in fruit_data_scaled.csv into the notebook. Name it fruit_data.

# your code here fail() # No Answer - remove if you provide an answer head(fruit_data)
test_1.0()

Question 1.1
{points: 1}

Which of the columns are categorical?

A. Fruit label, width, fruit subtype

B. Fruit name, color score, height

C. Fruit label, fruit subtype, fruit name

D. Color score, mass, width

Assign your answer (e.g. "E") to an object called answer1.1

# Make sure the correct answer is an uppercase letter. # Surround your answer with quotation marks. # Replace the fail() with your answer. # your code here fail() # No Answer - remove if you provide an answer
test_1.1()

Question 1.2
{points: 1}

Change the variable fruit_name to a factor (using as.factor). The variable should still be named fruit_data.

# your code here fail() # No Answer - remove if you provide an answer head(fruit_data)
test_1.2()

Question 1.3
{points: 1}

Make a scatterplot of scaled mass on the horizontal axis and scaled color score on the vertical axis. Color the points by fruit name.

Assign your plot to an object called fruit_plot. Make sure to do all the things to make an effective visualization.

# your code here fail() # No Answer - remove if you provide an answer fruit_plot
test_1.3()

Question 1.4
{points: 1}

Suppose we have a new observation in the fruit dataset with scaled_mass = 0.5 and scaled_colour = 0.5.

Label this new data point in black on the scatterplot below.

Assign your new plot to an object called fruit_plot_new. Again, make sure to label your axes!

# Add this layer to the fruit_plot object and fill in the missing parts # geom_point(aes(x = ..., y = ...), color = "black", size = 2.5) # your code here fail() # No Answer - remove if you provide an answer fruit_plot_new
test_1.4()

Question 1.5
{points: 3}

Just by looking at the scatterplot, how would you classify this observation using K-nearest neighbours if you use K = 3? Explain how you arrived at your answer.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 1.6
{points: 1}

Now, let's use the caret package in R to predict fruit_name for another new observation. The new observation we are interested in has scaled_mass = -0.3 and scaled_color = -0.4. Use scaled_mass and scaled_color as the predictors/explanatory variables and choose K = 5.

To begin, split the fruit_data data frame into the predictors (name it X_train) and the class/outcome (name it Y_train).

# your code here fail() # No Answer - remove if you provide an answer head(X_train) head(Y_train)
test_1.6()

Question 1.7
{points: 1}

Specify K and create the K-nearest neighbour model object. Name it fruit_class.

# your code here fail() # No Answer - remove if you provide an answer print(fruit_class)
test_1.7()

Question 1.8
{points: 1}

Use the fruit_class model and the predict function to predict the class for the new fruit observation, where scaled_mass = -0.3 and scaled_color = -0.4. Save your prediction to an object named fruit_predicted.

#This is the new observation to predict new_fruit <- data.frame(scaled_color = -0.4, scaled_mass = -0.3) # your code here fail() # No Answer - remove if you provide an answer print(fruit_predicted)
test_1.8()

Question 1.9
{points: 3}

Revisiting fruit_plot and considering the prediction given by K-nearest neighbours above, do you think the classification model did a "good" job predicting? Could you have done/do better? Given what we know this far in the course, what might we want to do to help with tricky prediction cases such as this?

You can use the code below to visualize the observation whose label we just tried to predict

fruit_plot + geom_point(aes(x = -0.3, y = -0.4), color = "black", size = 2.5)

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

Question 1.10
{points: 3}

Now do K-nearest neighbours classification again with the same data set, same K, and same new observation. However, this time, let's use all the columns in the dataset as predictors (except for the categorical fruit_label, fruit_name, and fruit_subtype variables).

We have provided the new_fruit_all data.frame below, which encodes the predictors for our new observation. Your job is to use K-nearest neighbours to predict the class of this point.

#This is the new observation to predict new_fruit_all <- data.frame(scaled_mass = -0.3, scaled_width = -0.5, scaled_height = 1.0, scaled_color = -0.4) # your code here fail() # No Answer - remove if you provide an answer

Question 1.11
{points: 3}

Did your second classification on the same data set with the same K change the prediction? If so, why do you think this happened?

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.

2. Wheat Seed Dataset

X-ray images can be used to analyze and sort seeds. In this data set, we have 7 measurements from x-ray images from 3 varieties of wheat seeds (Kama, Rosa and Canadian).

Question 2.0
{points: 3}

Let's use caret with this data to perform K-nearest neighbours to classify the wheat variety of a new seed we measure with the given observed measurements (from an x-ray image) shown below. Choose K = 5 to perform the classification.

The seven measurements were taken below for each wheat kernel:

  1. area A,

  2. perimeter P,

  3. compactness C = 4piA/P^2,

  4. length of kernel,

  5. width of kernel,

  6. asymmetry coefficient

  7. length of kernel groove.

The data set is available here: https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt. The last column in the data set is the variety label. The mapping for the numbers to varieties is listed below:

  • 1 == Kama

  • 2 == Rosa

  • 3 == Canadian

Hints:

  • colnames() can be used to specify the column names of a data frame.

  • the wheat variety column appears numerical, but you want it to be treated as categorical for this analysis, thus as.factor() might be helpful.

#This is the new observation to predict new_seed <- data.frame(area = 12.1, perimeter = 14.2, compactness = 0.9, length = 4.9, width = 2.8, asymmetry_coefficient = 3.0, groove_length = 5.1) # your code here fail() # No Answer - remove if you provide an answer

Question 2.1
{points: 3}

In 2-3 sentences, in your own words describe your findings from the classification task above.

DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER.