Jack Royal & Ben Simmers

¹⁶⁷ views

Kernel: Python 2 (SageMath)

INFO204 Assignment 1 - Handwritten Digits Classification

10 Marks

Group Members (Fill in details)

Name	StudentID
Jack Royal	6974461
Ben Simmers	7550991

In this assignment, we use Scikit Learn's handwritten "digits" dataset to practise a number of skills, including data manipulation, visualization using PCA, classification, and performance evaluation using ROC and cross validation.

You can form a group with a fellow student to do this assignment together. Submit your completed notebook through Blackboard by **11:59pm Monday 27 August. ** Submit one notebook only per group.

Here are some useful scikit.learn resources for your reference:

## Part 1. Data Manipulation and Visualization For the first part of the assignment, complete the following tasks [3 marks]:

Import Sklearn's datasets utilities to load in the "digits" dataset. Use "X" to store digit arrays, "y" class labels.
Report the dataset's information:
- names: attribute names, class names;
- number of instances: total, per class;
- images: display an instance for each digit class as an image

As an example, to display X[0] as a digit image, try

plt.imshow(X[0].reshape(8,8).astype('uint8'), cmap=plt.cm.gray)

Use PCA to extract the first two principal components and visualize the transformed dataset using class labels. Comment on the seperability of the classes.

In [1]:

# import all necessary packages
import collections, numpy
import numpy as np
import pylab as plt
from sklearn import datasets, utils
from pprint import pprint
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
# load in the digits datasets 
ds = datasets.load_digits()
X = ds.data
y = ds.target
# other tasks...
total = y.shape
print "Total Number of Instances: " , total[0]
instances = collections.Counter(y)
print "Instances Per Class: ", instances.items()

print("Attribute Names: {}".format(ds.keys()))

print("Class Names: {}".format(ds['target_names']))

images_and_labels = list(zip(ds.images, ds.target))
for index, (image, label) in enumerate(images_and_labels[:10]):
    plt.subplot(2, 5, index + 1)
    plt.axis('off')
    plt.imshow(image.reshape(8,8).astype('uint8'), cmap=plt.cm.gray)
    plt.title('Class %i ' % label)
print
print "We found after seperating the data into a training and testing set it was made up of classes that formed numbers in the range of 0 - 9, with there being ten in total. They were made up of similar data and we need to go through it, manipulating the data, to make these images more clear."

Out[1]:

Total Number of Instances:  1797
Instances Per Class:  [(0, 178), (1, 182), (2, 177), (3, 183), (4, 181), (5, 182), (6, 181), (7, 179), (8, 174), (9, 180)]
Attribute Names: ['images', 'data', 'target_names', 'DESCR', 'target']
Class Names: [0 1 2 3 4 5 6 7 8 9]

We found after seperating the data into a training and testing set it was made up of classes that formed numbers in the range of 0 - 9, with there being ten in total. They were made up of similar data and we need to go through it, manipulating the data, to make these images more clear.

In [2]:

#3. Use PCA to extract the first two principal components and visualize the transformed dataset using class labels. Comment on the seperability of the classes.
def pcanalysis(n):
    pca = PCA(n_components=n)
    pca = pca.fit(X)
    pc = pca.transform(X)
    plt.scatter(pc[:,0],pc[:,1], c=y )
    plt.colorbar()
    plt.show()
 

pcanalysis(2)

print "There is clear seperability between some of the classes as you would expect for example classes 1 and 0 look very different when written so you wouldnt expect many data points to overlap. Some classes are quite similar as shown by the heavy cluster in the left side of the plot 9 is quite scattered and is similar to 6 especially which you would expect when looking at the numbers drawn. However overall many of the classes overlap with a number of data points indicating a lack of seperability overall although some clear clusters are seen"

Out[2]:

There is clear seperability between some of the classes as you would expect for example classes 1 and 0 look very different when written so you wouldnt expect many data points to overlap. Some classes are quite similar as shown by the heavy cluster in the left side of the plot 9 is quite scattered and is similar to 6 especially which you would expect when looking at the numbers drawn. However overall many of the classes overlap with a number of data points indicating a lack of seperability overall although some clear clusters are seen

Part 2. Confusion matrix

Now, randomly split the "digits" dataset into a training set (70%) and testing set (30%), and employ the $k$ -nearest neighbour classifier and the support vector classifier (SVC) to classify the dataset. For each classifier, report the corresponding confusion matrix and comment on the result. [2 marks]

In [3]:

# code for Part 2
from sklearn.neighbors import KNeighborsClassifier
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3)

knnclf = KNeighborsClassifier(n_neighbors=5).fit(X_tr,y_tr)
knnpr = knnclf.predict(X_te)
print "KNN"
print confusion_matrix(y_te,knnpr)

svcclf = SVC(kernel='rbf', probability=True).fit(X_tr,y_tr)
svcpredict = svcclf.predict(X_te)
print "SVC"
print confusion_matrix(y_te,svcpredict)
print
print "Here we see the two matrixs, the first classified by k-NN and the latter by SVC. The values in the SVC classifier tend to be approximately half that of the k-NN."

Out[3]:

KNN
[[58  0  0  0  0  0  0  0  0  0]
 [ 0 58  0  0  0  0  0  0  0  0]
 [ 0  0 43  0  0  0  0  0  0  0]
 [ 0  0  0 49  0  0  0  1  1  0]
 [ 0  0  0  0 52  0  0  0  0  0]
 [ 0  0  0  0  0 63  0  0  0  0]
 [ 0  0  0  0  0  0 57  0  0  0]
 [ 0  0  0  0  0  0  0 48  0  0]
 [ 0  0  0  1  0  0  0  0 53  0]
 [ 0  0  0  1  1  1  0  0  0 53]]
SVC
[[25  0 33  0  0  0  0  0  0  0]
 [ 0 29 29  0  0  0  0  0  0  0]
 [ 0  0 43  0  0  0  0  0  0  0]
 [ 0  0  9 42  0  0  0  0  0  0]
 [ 0  0 24  0 28  0  0  0  0  0]
 [ 0  0 55  0  0  8  0  0  0  0]
 [ 0  0 21  0  0  0 36  0  0  0]
 [ 0  0 16  0  0  0  0 32  0  0]
 [ 0  0 50  0  0  0  0  0  4  0]
 [ 0  0 39  0  0  0  0  0  0 17]]

Here we see the two matrixs, the first classified by k-NN and the latter by SVC. The values in the SVC classifier tend to be approximately half that of the k-NN.

Part 3. ROC and AUC

Let us now focus on Class "8" for an obvious reason. Change the multi-class classification problem into a binary, 8 vs non-8, classification problem. Use a 10-fold cross validation process, calculate the average ROC and AUC values for the kNN and SVC classifiers. Tune the classifier parameters and report the best outcome.

To be exact, follow these steps:

If necessary, convert our data arrays X,y for the new problem. [1 mark]
Calculate ROC and AUC for the binary 8 vs non-8 classification using a random split. [2 marks]
Employ 10-fold CV to tune the classifiers and generate the best average ROC and AUC results. [2 marks]

In [0]:

In [4]:

# code for Part 3a - binary classification conversion
for i in range(0,1797):
    if(y[i] == 8):
        y[i] = 1
    else:
        y[i] = 0

In [5]:

# code for Part 3b - ROC and AUC 
from sklearn.utils import shuffle
from sklearn.metrics import roc_curve, auc

n_samples, n_features = X.shape
X, y = shuffle(X, y)
tr_rows=range(n_samples/2)
te_rows=range(n_samples/2,n_samples)

clf = SVC(kernel='rbf', probability=True)
probas_ = clf.fit(X[tr_rows], y[tr_rows]).predict_proba(X[te_rows])
# Compute ROC curve and area the curve for the "1" class

fpr, tpr, thresholds = roc_curve(y[te_rows], probas_[:, 1])
roc_auc = auc(fpr, tpr)     # calculate AUC

plt.plot(fpr, tpr, color='darkorange',
         lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.legend(loc="lower right");   # ';' used to suppress text output
plt.show()

Out[5]:

In [6]:

#Use code from the ROC with cross validation
#generate knn of different classifiers, nest loop using code from roc with cross validation
#generate svc change the kernel
#appeand mean to array and plot the best one by finding max
#KNN
from sklearn.model_selection import KFold
import numpy as np
from scipy import interp
nneighbors = [1,3,5,8,10,15,20,25,30]
kfold = KFold(n_splits = 10)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
for k in nneighbors:
    knnclf = KNeighborsClassifier(n_neighbors=k)
    
    for tr,te in kfold.split(X):
        probas_ = knnclf.fit(X[tr], y[tr]).predict_proba(X[te])
        fpr, tpr, thresholds = roc_curve(y[te], probas_[:, 1])
        tprs.append(interp(mean_fpr, fpr, tpr))
        tprs[-1][0] = 0.0
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)

        
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)

print mean_auc

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.plot(mean_fpr, mean_tpr, color='b',
         label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
         lw=2, alpha=.8)
plt.title('Best Average ROC/AUC KNN Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('FPR')
plt.legend(loc="lower right");
plt.ylabel('TPR')

Out[6]:

0.990246081315

Text(0,0.5,u'TPR')

In [7]:

#Use code from the ROC with cross validation
#generate knn of different classifiers, nest loop using code from roc with cross validation
#generate svc change the kernel
#appeand mean to array and plot the best one by finding max
#KNN
from sklearn.model_selection import KFold
import numpy as np
from scipy import interp
kernel = ['poly','rbf','sigmoid','precomputed']
kfold = KFold(n_splits = 10)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
for k in kernel:
    clf = SVC(kernel=k, probability=True)
    
    for tr,te in kfold.split(X):
        probas_ = knnclf.fit(X[tr], y[tr]).predict_proba(X[te])
        fpr, tpr, thresholds = roc_curve(y[te], probas_[:, 1])
        tprs.append(interp(mean_fpr, fpr, tpr))
        tprs[-1][0] = 0.0
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)

        
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)

print mean_auc

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.plot(mean_fpr, mean_tpr, color='b',
         label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
         lw=2, alpha=.8)
plt.title('Best Average ROC/AUC SVC Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('FPR')
plt.legend(loc="lower right");
plt.ylabel('TPR')

Out[7]:

0.993247857683

Text(0,0.5,u'TPR')

In [0]:

In [0]:

INFO204 Assignment 1 - Handwritten Digits Classification

10 Marks

Part 2. Confusion matrix

Part 3. ROC and AUC

Product

Resources

Company