Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download

Jack Royal & Ben Simmers

167 views
Kernel: Python 2 (SageMath)

INFO204 Assignment 1 - Handwritten Digits Classification

10 Marks

Group Members (Fill in details)

NameStudentID
Jack Royal6974461
Ben Simmers7550991

In this assignment, we use Scikit Learn's handwritten "digits" dataset to practise a number of skills, including data manipulation, visualization using PCA, classification, and performance evaluation using ROC and cross validation.

You can form a group with a fellow student to do this assignment together. Submit your completed notebook through Blackboard by **11:59pm Monday 27 August. ** Submit one notebook only per group.

Here are some useful scikit.learn resources for your reference:

## Part 1. Data Manipulation and Visualization For the first part of the assignment, complete the following tasks [3 marks]:

  1. Import Sklearn's datasets utilities to load in the "digits" dataset. Use "X" to store digit arrays, "y" class labels.

  2. Report the dataset's information:

    • names: attribute names, class names;

    • number of instances: total, per class;

    • images: display an instance for each digit class as an image

As an example, to display X[0] as a digit image, try

plt.imshow(X[0].reshape(8,8).astype('uint8'), cmap=plt.cm.gray)
  1. Use PCA to extract the first two principal components and visualize the transformed dataset using class labels. Comment on the seperability of the classes.

# import all necessary packages import collections, numpy import numpy as np import pylab as plt from sklearn import datasets, utils from pprint import pprint from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import confusion_matrix from sklearn.decomposition import PCA # load in the digits datasets ds = datasets.load_digits() X = ds.data y = ds.target # other tasks... total = y.shape print "Total Number of Instances: " , total[0] instances = collections.Counter(y) print "Instances Per Class: ", instances.items() print("Attribute Names: {}".format(ds.keys())) print("Class Names: {}".format(ds['target_names'])) images_and_labels = list(zip(ds.images, ds.target)) for index, (image, label) in enumerate(images_and_labels[:10]): plt.subplot(2, 5, index + 1) plt.axis('off') plt.imshow(image.reshape(8,8).astype('uint8'), cmap=plt.cm.gray) plt.title('Class %i ' % label) print print "We found after seperating the data into a training and testing set it was made up of classes that formed numbers in the range of 0 - 9, with there being ten in total. They were made up of similar data and we need to go through it, manipulating the data, to make these images more clear."
Total Number of Instances: 1797 Instances Per Class: [(0, 178), (1, 182), (2, 177), (3, 183), (4, 181), (5, 182), (6, 181), (7, 179), (8, 174), (9, 180)] Attribute Names: ['images', 'data', 'target_names', 'DESCR', 'target'] Class Names: [0 1 2 3 4 5 6 7 8 9] We found after seperating the data into a training and testing set it was made up of classes that formed numbers in the range of 0 - 9, with there being ten in total. They were made up of similar data and we need to go through it, manipulating the data, to make these images more clear.
Image in a Jupyter notebook
#3. Use PCA to extract the first two principal components and visualize the transformed dataset using class labels. Comment on the seperability of the classes. def pcanalysis(n): pca = PCA(n_components=n) pca = pca.fit(X) pc = pca.transform(X) plt.scatter(pc[:,0],pc[:,1], c=y ) plt.colorbar() plt.show() pcanalysis(2) print "There is clear seperability between some of the classes as you would expect for example classes 1 and 0 look very different when written so you wouldnt expect many data points to overlap. Some classes are quite similar as shown by the heavy cluster in the left side of the plot 9 is quite scattered and is similar to 6 especially which you would expect when looking at the numbers drawn. However overall many of the classes overlap with a number of data points indicating a lack of seperability overall although some clear clusters are seen"
Image in a Jupyter notebook
There is clear seperability between some of the classes as you would expect for example classes 1 and 0 look very different when written so you wouldnt expect many data points to overlap. Some classes are quite similar as shown by the heavy cluster in the left side of the plot 9 is quite scattered and is similar to 6 especially which you would expect when looking at the numbers drawn. However overall many of the classes overlap with a number of data points indicating a lack of seperability overall although some clear clusters are seen

Part 2. Confusion matrix

Now, randomly split the "digits" dataset into a training set (70%) and testing set (30%), and employ the kk-nearest neighbour classifier and the support vector classifier (SVC) to classify the dataset. For each classifier, report the corresponding confusion matrix and comment on the result. [2 marks]

# code for Part 2 from sklearn.neighbors import KNeighborsClassifier X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3) knnclf = KNeighborsClassifier(n_neighbors=5).fit(X_tr,y_tr) knnpr = knnclf.predict(X_te) print "KNN" print confusion_matrix(y_te,knnpr) svcclf = SVC(kernel='rbf', probability=True).fit(X_tr,y_tr) svcpredict = svcclf.predict(X_te) print "SVC" print confusion_matrix(y_te,svcpredict) print print "Here we see the two matrixs, the first classified by k-NN and the latter by SVC. The values in the SVC classifier tend to be approximately half that of the k-NN."
KNN [[58 0 0 0 0 0 0 0 0 0] [ 0 58 0 0 0 0 0 0 0 0] [ 0 0 43 0 0 0 0 0 0 0] [ 0 0 0 49 0 0 0 1 1 0] [ 0 0 0 0 52 0 0 0 0 0] [ 0 0 0 0 0 63 0 0 0 0] [ 0 0 0 0 0 0 57 0 0 0] [ 0 0 0 0 0 0 0 48 0 0] [ 0 0 0 1 0 0 0 0 53 0] [ 0 0 0 1 1 1 0 0 0 53]] SVC [[25 0 33 0 0 0 0 0 0 0] [ 0 29 29 0 0 0 0 0 0 0] [ 0 0 43 0 0 0 0 0 0 0] [ 0 0 9 42 0 0 0 0 0 0] [ 0 0 24 0 28 0 0 0 0 0] [ 0 0 55 0 0 8 0 0 0 0] [ 0 0 21 0 0 0 36 0 0 0] [ 0 0 16 0 0 0 0 32 0 0] [ 0 0 50 0 0 0 0 0 4 0] [ 0 0 39 0 0 0 0 0 0 17]] Here we see the two matrixs, the first classified by k-NN and the latter by SVC. The values in the SVC classifier tend to be approximately half that of the k-NN.

Part 3. ROC and AUC

Let us now focus on Class "8" for an obvious reason. Change the multi-class classification problem into a binary, 8 vs non-8, classification problem. Use a 10-fold cross validation process, calculate the average ROC and AUC values for the kNN and SVC classifiers. Tune the classifier parameters and report the best outcome.

To be exact, follow these steps:

  • If necessary, convert our data arrays X,y for the new problem. [1 mark]

  • Calculate ROC and AUC for the binary 8 vs non-8 classification using a random split. [2 marks]

  • Employ 10-fold CV to tune the classifiers and generate the best average ROC and AUC results. [2 marks]

# code for Part 3a - binary classification conversion for i in range(0,1797): if(y[i] == 8): y[i] = 1 else: y[i] = 0
# code for Part 3b - ROC and AUC from sklearn.utils import shuffle from sklearn.metrics import roc_curve, auc n_samples, n_features = X.shape X, y = shuffle(X, y) tr_rows=range(n_samples/2) te_rows=range(n_samples/2,n_samples) clf = SVC(kernel='rbf', probability=True) probas_ = clf.fit(X[tr_rows], y[tr_rows]).predict_proba(X[te_rows]) # Compute ROC curve and area the curve for the "1" class fpr, tpr, thresholds = roc_curve(y[te_rows], probas_[:, 1]) roc_auc = auc(fpr, tpr) # calculate AUC plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('FPR') plt.ylabel('TPR') plt.title('ROC curve') plt.legend(loc="lower right"); # ';' used to suppress text output plt.show()
Image in a Jupyter notebook
#Use code from the ROC with cross validation #generate knn of different classifiers, nest loop using code from roc with cross validation #generate svc change the kernel #appeand mean to array and plot the best one by finding max #KNN from sklearn.model_selection import KFold import numpy as np from scipy import interp nneighbors = [1,3,5,8,10,15,20,25,30] kfold = KFold(n_splits = 10) tprs = [] aucs = [] mean_fpr = np.linspace(0, 1, 100) for k in nneighbors: knnclf = KNeighborsClassifier(n_neighbors=k) for tr,te in kfold.split(X): probas_ = knnclf.fit(X[tr], y[tr]).predict_proba(X[te]) fpr, tpr, thresholds = roc_curve(y[te], probas_[:, 1]) tprs.append(interp(mean_fpr, fpr, tpr)) tprs[-1][0] = 0.0 roc_auc = auc(fpr, tpr) aucs.append(roc_auc) mean_tpr = np.mean(tprs, axis=0) mean_tpr[-1] = 1.0 mean_auc = auc(mean_fpr, mean_tpr) std_auc = np.std(aucs) print mean_auc plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.plot(mean_fpr, mean_tpr, color='b', label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc), lw=2, alpha=.8) plt.title('Best Average ROC/AUC KNN Classifier') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('FPR') plt.legend(loc="lower right"); plt.ylabel('TPR')
0.990246081315
Text(0,0.5,u'TPR')
Image in a Jupyter notebook
#Use code from the ROC with cross validation #generate knn of different classifiers, nest loop using code from roc with cross validation #generate svc change the kernel #appeand mean to array and plot the best one by finding max #KNN from sklearn.model_selection import KFold import numpy as np from scipy import interp kernel = ['poly','rbf','sigmoid','precomputed'] kfold = KFold(n_splits = 10) tprs = [] aucs = [] mean_fpr = np.linspace(0, 1, 100) for k in kernel: clf = SVC(kernel=k, probability=True) for tr,te in kfold.split(X): probas_ = knnclf.fit(X[tr], y[tr]).predict_proba(X[te]) fpr, tpr, thresholds = roc_curve(y[te], probas_[:, 1]) tprs.append(interp(mean_fpr, fpr, tpr)) tprs[-1][0] = 0.0 roc_auc = auc(fpr, tpr) aucs.append(roc_auc) mean_tpr = np.mean(tprs, axis=0) mean_tpr[-1] = 1.0 mean_auc = auc(mean_fpr, mean_tpr) std_auc = np.std(aucs) print mean_auc plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.plot(mean_fpr, mean_tpr, color='b', label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc), lw=2, alpha=.8) plt.title('Best Average ROC/AUC SVC Classifier') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('FPR') plt.legend(loc="lower right"); plt.ylabel('TPR')
0.993247857683
Text(0,0.5,u'TPR')
Image in a Jupyter notebook