INFO204 Assignment 1 - Handwritten Digits Classification
10 Marks
Group Members (Fill in details)
| Name | StudentID |
|---|---|
| Jack Royal | 6974461 |
| Ben Simmers | 7550991 |
In this assignment, we use Scikit Learn's handwritten "digits" dataset to practise a number of skills, including data manipulation, visualization using PCA, classification, and performance evaluation using ROC and cross validation.
You can form a group with a fellow student to do this assignment together. Submit your completed notebook through Blackboard by **11:59pm Monday 27 August. ** Submit one notebook only per group.
Here are some useful scikit.learn resources for your reference:
##Â Part 1. Data Manipulation and Visualization For the first part of the assignment, complete the following tasks [3 marks]:
Import Sklearn's datasets utilities to load in the "digits" dataset. Use "X" to store digit arrays, "y" class labels.
Report the dataset's information:
names: attribute names, class names;
number of instances: total, per class;
images: display an instance for each digit class as an image
As an example, to display X[0] as a digit image, try
plt.imshow(X[0].reshape(8,8).astype('uint8'), cmap=plt.cm.gray)Use PCA to extract the first two principal components and visualize the transformed dataset using class labels. Comment on the seperability of the classes.
Part 2. Confusion matrix
Now, randomly split the "digits" dataset into a training set (70%) and testing set (30%), and employ the -nearest neighbour classifier and the support vector classifier (SVC) to classify the dataset. For each classifier, report the corresponding confusion matrix and comment on the result. [2 marks]
Part 3. ROC and AUC
Let us now focus on Class "8" for an obvious reason. Change the multi-class classification problem into a binary, 8 vs non-8, classification problem. Use a 10-fold cross validation process, calculate the average ROC and AUC values for the kNN and SVC classifiers. Tune the classifier parameters and report the best outcome.
To be exact, follow these steps:
If necessary, convert our data arrays X,y for the new problem. [1 mark]
Calculate ROC and AUC for the binary 8 vs non-8 classification using a random split. [2 marks]
Employ 10-fold CV to tune the classifiers and generate the best average ROC and AUC results. [2 marks]