GitHub Repository: UBC-DSCI/dsci-100-assets
Path: blob/master/2019-fall/materials/tutorial_07/tutorial_07.ipynb
²⁰⁵¹ views

Kernel: R

Tutorial 7: Classification (Part II)

Handwritten Digit Classification using R

Source: https://media.giphy.com/media/UwrdbvJz1CNck/giphy.gif

MNIST is a computer vision dataset that consists of images of handwritten digits like these:

It also includes labels for each image, telling us which digit it is. For example, the labels for the above images are 5, 0, 4, and 1.

In this tutorial, we’re going to train a classifier to look at images and predict what digits they are. Our goal isn’t to train a really elaborate model that achieves state-of-the-art performance, but rather to dip a toe into using classification with pixelated images. As such, we’re going to keep working with the simple K-nearest neighbour classifier we have been exploring in the last two weeks.

Using image data for classification

As mentioned earlier, every MNIST data point has two parts: an image of a handwritten digit and a corresponding label. Both the training set and test set contain images and their corresponding labels.

Each image is 28 pixels by 28 pixels. We can interpret this as a big matrix of numbers:

We can flatten this matrix into a vector of 28x28 = 784 numbers and give it a class label (here 1 for the number one). It doesn’t matter how we flatten the array, as long as we’re consistent between images. From this perspective, the MNIST images are just a bunch of points in a 784-dimensional vector space, with a very rich structure.

We do this for every image of the digits we have, and we create a data table like the one shown below that we can use for classification. Note, like any other classification problem that we have seen before, we need many observations for each class. This problem is also a bit different from the first classification problem we have encountered (Wisonsin breast cancer data set), in that we have more than two classes (here we have 10 classes, one for each digit from 0 to 9).

This information is taken from: https://tensorflow.rstudio.com/tensorflow/articles/tutorial_mnist_beginners.html

In [ ]:

###
### Run this cell before continuing.
###

library(repr)
library(tidyverse)
library(caret)
source('tests_tutorial_07.R')

# functions needed to work with images
# code below sourced from: https://gist.github.com/daviddalpiaz/ae62ae5ccd0bada4b9acd6dbc9008706
# helper function for visualization
show_digit = function(arr784, col = gray(12:1 / 12), ...) {
  image(matrix(as.matrix(arr784[-785]), nrow = 28)[, 28:1], col = col, ...)
}

# load image files
load_image_file = function(filename) {
  ret = list()
  f = file(filename, 'rb')
  readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  n    = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  nrow = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  ncol = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  x = readBin(f, 'integer', n = n * nrow * ncol, size = 1, signed = FALSE)
  close(f)
  data.frame(matrix(x, ncol = nrow * ncol, byrow = TRUE))
}

# load label files
load_label_file = function(filename) {
  f = file(filename, 'rb')
  readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  n = readBin(f, 'integer', n = 1, size = 4, endian = 'big')
  y = readBin(f, 'integer', n = n, size = 1, signed = FALSE)
  close(f)
  y <- data.frame(y)
}

Question 1.0 Multiple Choice:
{points: 1}

How many rows and columns does the array of an image have?

A. 784 columns and 1 row

B. 28 columns and 1 row

C. 18 columns and 18 rows

D. 28 columns and 28 rows

Assign your answer to an object called answer1.0.

In [ ]:

# Assign your answer to an object called: answer1.0
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.0()

Question 1.1 Multiple Choice:
{points: 1}

Once we linearize the array, how many rows represent a number?

A. 28

B. 784

C. 1

D. 18

Assign your answer to an object called answer1.1.

In [ ]:

# Assign your answer to an object called: answer1.1
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [ ]:

test_1.1()

2. Exploring the Data

Before we move on to do the modeling component, it is always required that we take a look at our data and understand the problem and the structure of the data well. We can start this part by loading the images and taking a look at the first rows of the dataset. You can load the data set by running the cell below. The load_image_file function we call to load the images was written for you and is in the first code cell of this notebook (so you have to make sure you run that cell before this one so R knows about this function). load_image_file takes only one argument, the path to the file you want to load.

In [ ]:

# Load images. 
# Run this cell. 
training_data <- load_image_file("data/train-images-idx3-ubyte")
testing_data  <- load_image_file("data/t10k-images-idx3-ubyte")

Look at the first 6 rows of training_data. What do you notice?

In [ ]:

head(training_data)

There are no class labels! This data set has already been split into the X's (which you loaded above) and the labels, which you will load by running the cell below. The load_label_file function we call to load the labels was written for you and is in the first code cell of this notebook (so you have to make sure you run that cell before this one so R knows about this function). load_label_file takes only one argument, the path to the file you want to load.

In [ ]:

# Next, we will load the labels.
# Run this cell. 
#training_labels <- as.factor(load_label_file("data/train-labels-idx1-ubyte"))
training_labels <- load_label_file("data/train-labels-idx1-ubyte")  %>% 
    mutate(y = as.factor(y))
testing_labels <- load_label_file("data/t10k-labels-idx1-ubyte") %>% 
    mutate(y = as.factor(y))

Look at the first 6 labels of training_labels using the head() function.

In [ ]:

# Use this cell to view the first 6 labels.
# Run this cell.
head(training_labels)

Question 2.0
{points: 1}

How many rows does the training data set have? Note, each row is a different number in the postal code system.

Use nrow(). Note, the testing data set should have fewer rows than the training data set.

Assign your answer to an object called number_of_rows.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
number_of_rows

In [ ]:

test_2.0()

Question 2.1
{points: 1}

For mutli-class classification with k-nn it is important for the classes to have about the same number of observations in each class. For example, if 90% of our training set observationas were labeled as 2's, then k-nn classification predict 2 almost every time and we would get an accuracy score of 90% even though our classifier wasn't really doing a great job.

Use the group_by and summarize function to get the counts for each group and see if the data set is balanced across the classes (has roughly equal numbers of observation for each class). Name the output counts. counts should be a data frame with 2 columns, y and n (the column n should have the counts for how many observations there were for each class group).

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer
counts

In [ ]:

test_2.1()

Question 2.2
{points: 3}

Are the classes roughly balanced?

YOUR ANSWER HERE

To view an image in the notebook, you can use the show_digit function (we gave you the code for this function in the first code cell in the notebook, All you have to do to use it is run the cell below). The show_digit function takes two arguments:

the row number of the observation who's value you would like to see
an empty value (i.e., nothing) to say you would like all of the column values for that row

The code we provide below will show you the image for the observation in the 200th row from the training data set.

In [ ]:

# Run this cell to get the images for the 200th row from the training data set.
options(repr.plot.height = 4, repr.plot.width = 3.3)
show_digit(slice(training_data, 200))

Question 2.3
{points: 3}

Show the image for row 102.

In [ ]:

# your code here
fail() # No Answer - remove if you provide an answer

If you are unsure as to what number the plot is depicting (because the handwriting is messy) you can use slice to get the label from the training_labels:

In [ ]:

# run this cell to get the training label for the 200th row
training_labels %>% 
    slice(200)

Question 2.4
{points: 1}

What is the class label for row 102?

Assign your answer to an object called label_102.

In [ ]:

# Assign your answer to an object called: label_102
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer
label_102

In [ ]:

test_2.4()

3. Splitting the Data

Question 3.0
{points: 3}

Since this is such a large data set, we will only use a subset of the data set, specifically 1,000 rows of training_data. There are 10 classes in the data set, so we group_by the class, y, and then use sample_n to get a random sample of 100 of the observations for each class. To ensure the X's and Y's match up when we do these, we use bind_cols to combine the training_data and training_labels data frames. We provide the code for how to do this in the training set, you will have to do this yourself for the test set (hint - use what we did for the training set as a guide). For the test set, sample only 50 from each class.

Additionally, after subsetting the data (to get a smaller sample of the data) split the training data into small_X_train and small_Y_train. Do the same for the test set.

At the end of this question you should have the following 6 data frames:

small_training_data
small_X_train
small_Y_train
small_testing_data
small_X_test
small_Y_test

hint - remember to make the small_X_ objects into data.frame's and the small_Y_ objects into vectors of type factor.

In [ ]:

# Set the seed. Don't remove this!
set.seed(1234) 

small_training_data <- bind_cols(training_data, training_labels) %>% 
    group_by(y) %>% 
    sample_n(100)  %>% 
    ungroup()
# your code here
fail() # No Answer - remove if you provide an answer

Question 3.1
{points: 3}

In the previous question, we split the data into two datasets, one for training purposes and one for testing purposes. Do you think this is a good idea? If yes, why do we do this? If no, explain why this is not a good idea.

YOUR ANSWER HERE

Which $k$ should we use?

As you learned from the worksheet, we can use cross-validation on the training data set to select which $k$ is the most optimal for our data set for k-nn classification.

Question 3.2
{points: 3}

To get all the marks in this question, you will have to:

set a seed to make your analysis reproducible
Apply 3-fold cross-validation to our small training data
- Test the following $k$ 's: 1, 3, 5, 7, 9, 11
Plot the $k$ vs the accuracy
- Assign this plot to an object called cross_val_plot

note - this will take 5-15 minutes to run... so we recommend you split the classifier training and cross validation in one cell and plotting into another cell (so you can tweak and re-run the plot code without re-training the classifier each time. Another hint is to make your training data very small, get the code working and then re-run the code with your training data the size you actually want it to be.

In [ ]:

# Set the seed. Don't remove this!
set.seed(1234) 

# your code here
fail() # No Answer - remove if you provide an answer

Question 3.3
{points: 3}

Based on the plot from Question 8, which $k$ would you choose and how can you be sure about your decision? In your answer you should reference why we do cross-validation.

YOUR ANSWER HERE

4. Let's build our model

Question 4.0
{points: 3}

Now that we have explored our data, separated the data into training and testing sets and applied cross-validation to choose the best $k$ , we can build our final model.

In [ ]:

# Set the seed. Don't remove this!
set.seed(1234) 

# your code here
fail() # No Answer - remove if you provide an answer

Question 4.1
{points: 3}

Use your final model to predict on the test dataset and report the accuracy of this prediction.

In [ ]:

# Set the seed. Don't remove this!
set.seed(1234) 

# your code here
fail() # No Answer - remove if you provide an answer

Question 4.2
{points: 3}

Print out 3 images and true labels from the test set that were predicted correctly. Use the show_digit function we gave you above to print out the images.

In [ ]:

# Set the seed. Don't remove this!
set.seed(1234) 

# your code here
fail() # No Answer - remove if you provide an answer

Question 4.3
{points: 3}

Print out 3 images and true labels from the test set that were NOT predicted correctly. For the incorrectly labelled images also print out the predicted labels. Use the show_digit function we gave you above to print out the images.

In [ ]:

# Set the seed. Don't remove this!
set.seed(1234) 

# your code here
fail() # No Answer - remove if you provide an answer

Question 4.4
{points: 3}

Do you notice any differences between the images that were predicted correctly versus the images that were not?

YOUR ANSWER HERE

Question 4.5
{points: 3}

What does this accuracy mean? Is it good enough that you would use this model for the Canada Post? Can you imagine a way we might improve our classifier's accuracy?

YOUR ANSWER HERE

Tutorial 7: Classification (Part II)

Handwritten Digit Classification using R

Using image data for classification

2. Exploring the Data

3. Splitting the Data

Which $k$ should we use?

4. Let's build our model

Product

Resources

Company

Tutorial 7: Classification (Part II)

Handwritten Digit Classification using R

Using image data for classification

2. Exploring the Data

3. Splitting the Data

Which kkk should we use?

4. Let's build our model

Which $k$ should we use?