Path: blob/master/2019-spring/materials/tutorial_06/tutorial_06.ipynb
2051 views
Tutorial 6: Classification
2. Fruit Dataset
In the agricultural industry, cleaning, sorting, grading and packaging food products are all necessary tasks in the post-harvest process. Products are classified based on appearance, size and shape, attributes which helps determine the quality of the food. Sorting can be done by humans, but it is tedious and time consuming. Automatic sorting by machine could help this process by saving time and money. Images of the food products are captured and analysed to determine visual characteristics.
The dataset contains observations of fruit described with four features 1) mass 2) width 3) height and 4) color score. The dataset "fruit_data_scaled.csv" has been scaled as part of the data preparation. Scaling will be discussed in more detail next week.
Question 1.0
Read the data in fruit_data_scaled.csv
into the notebook. Name it fruit_data
.
Question 1.1
Which of the columns are categorical?
A. Fruit label, width, fruit subtype
B. Fruit name, color score, height
C. Fruit label, fruit subtype, fruit name
D. Color score, mass, width
Assign your answer to an object called answer1.1
Question 1.2
Change the variable fruit_name
to a factor (using as.factor
). Your object should still be named fruit_data
.
Question 1.3
Make a scatterplot of scaled color score and scaled mass. Color the points by fruit name.
Assign your plot to an object called fruit_plot
. Make sure to do all the things to make an effective visualization.
Question 1.4
Suppose we have a new observation in the fruit dataset with scaled mass 0.5 and scaled colour score 0.5
Label this new data point in black on the scatterplot below.
Assign your new plot to an object called fruit_plot_new
. Again, make sure to label your axes!
Question 1.5
Just by looking at the scatterplot, how would you classify this observation using k-nearest neighbours if you use . Explain how you arrived at your answer.
YOUR ANSWER HERE
Now, let's use the caret
package in R to use k-nn classify fruit name for another new observation, one where scaled mass is -0.3 and scaled colour score is -0.4. Use scaled color score and scaled mass as the predictors/explanatory variables and choose .
Question 1.6
Split the fruit_data
data frame into the predictors (name it X_train
) and the class/outcome (name it Y_train
).
Question 1.7
Specify and create the k-nn model object (name it fruit_class
).
Question 1.8
Use the fruit_class
model to predict the class for the new fruit observation (where scaled mass is -0.3 and scaled colour score is -0.4). Save your prediction to an object named fruit_predicted
.
Question 1.9
Revisiting fruit_plot
and considering the prediction given by k-nn above, do you think k-nn did a "good" job predicting? Could you have done/do better? Given what we know this far in the course, what might we want to do to help with tricky prediction cases such as this?
YOUR ANSWER HERE
Question 2.0
Now do k-nn classification in R again, with the same data set and the same and the same new observation (with the additional information given below), but this time let's use all the columns in the dataset (except for fruit_label
, fruit_name
and fruit_subtype
) as predictors.
Question 2.1
Did your second classification on the same data set with the same change the prediction? If so, why do you think this happened?
YOUR ANSWER HERE
Question 3.0
X-ray images can be used to analyze and sort seeds. In this data set, we have 7 measurements from x-ray images from 3 varieties of wheat seeds (Kama, Rosa and Canadian). Let's use this data to perform k-nn to classify the wheat variety of a new seed we measure with the given observed measurements (from an x-ray image) shown below. Choose to perform the classification.
The seven measurements were taken below for each wheat kernel:
area A,
perimeter P,
compactness C = 4piA/P^2,
length of kernel,
width of kernel,
asymmetry coefficient
length of kernel groove.
The data set is available here: https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt. The last column in the data set is the variety label. The mapping for the numbers to varieties is listed below:
1 == Kama
2 == Rosa
3 == Canadian
Hints:
colnames()
can be used to specify the column names of a data frame.the wheat variety column appears numerical, but you want it to be treated as categorical for this analysis, thus
as.factor()
might be helpful.
Question 3.1
In 2-3 sentences, in your own words describe your findings from the classification task above.
YOUR ANSWER HERE