Project 1: -Nearest Neighbors
So many points,
some near some far,
- who are my true neighbors?
Introduction
In this project, you will build a -nearest neighbor classifier.
How to submit: You can submit your code using the red Submit button above. This button will send any code below surrounded by #<GRADED>#</GRADED> tags below to the autograder, which will then run several tests over your code. By clicking on the Details dropdown next to the Submit button, you will be able to view your submission report once the autograder has completed running. This submission report contains a summary of the tests you have failed or passed, as well as a log of any errors generated by your code when we ran it.
Note that this may take a while depending on how long your code takes to run! Once your code is submitted you may navigate away from the page as you desire -- the most recent submission report will always be available from the Details menu.
Evaluation: Your code will be autograded for technical correctness and--on some assignments--speed. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. Furthermore, any code not surrounded by #<GRADED>#</GRADED> tags will not be run by the autograder. However, the correctness of your implementation -- not the autograder's output -- will be the final judge of your score. If necessary, we will review and grade assignments individually to ensure that you receive due credit for your work.
Academic Integrity: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. If you do, we will pursue the strongest consequences available to us.
Getting Help: You are not alone! If you find yourself stuck on something, contact the course staff for help. Office hours, section, and the Piazza are there for your support; please use them. If you can't make our office hours, let us know and we will schedule more. We want these projects to be rewarding and instructional, not frustrating and demoralizing. But, we don't know when or how to help unless you ask.
Libraries: Before we get started we need to install a few libraries. You can do this by executing the following code.
k-Nearest Neighbors implementation in Python
Our first goal towards a NN classifier is to build a classifier for handwritten digits classification and face recognition.
Data: We first obtain some data for testing your code. The data resides in the files faces.mat
and digits.mat
which hold the datasets for the further experiments. First, let us define a function that loads the data set.
Here, xTr are the training vectors with labels yTr and xTe are the testing vectors with labels yTe. As a reminder, to predict the label or class of an image in xTe, we will look for the k-nearest neighbors in xTr and predict a label based on their labels in yTr. For evaluation, we will compare these labels against the true labels provided in yTe.
Visualizing data
Let us take a look at our data. The following script will take the first 10 training images from the face data set and visualize them.
Implementation
The following questions will ask you to finish these functions in a pre-defined order.
(a) Implement the functions innerproduct
and l2distance
. You may use your own code(s) from the previous project.
(b) Implement the function findknn
, which should find the nearest neighbors of a set of vectors within a given training data set. The call of
[I,D]=findknn(xTr,xTe,k);should result in two matrices and , both of dimensions , where is the number of input vectors in
xTe
. The matrix is the index of the nearest neighbor of the vector .
So, for example, if we set i=I(1,3)
, then xTr(i,:)
is the first nearest neighbor of vector xTe(3,:)
. The second matrix returns the corresponding distances. So is the distance of to its nearest neighbor.
The distance seems to be doing its job just fine.
Attempting to reduce operations more in the code below. might not be necessary at this point. I may revisit this later.
We can visualize the 10 nearest training neighbors of some of the test points.
(c) The function analyze
should compute various metrics to evaluate a classifier. The call of
result=analyze(kind,truth,preds);should output the accuracy or absolute loss in variable
result
. The type of output required can be specified in the input argument kind
as "abs"
or "acc"
. The input variables truth
and pred
should contain vectors of true and predicted labels respectively.
For example, the call
>> analyze('acc',[1 2 1 2],[1 2 1 1])should return an accuracy of 0.75. Here, the true labels are 1,2,1,2 and the predicted labels are 1,2,1,1. So the first three examples are classified correctly, and the last one is wrong --- 75% accuracy.
(e) Implement the function knnclassifier
, which should perform nearest neighbor classification on a given test data set. The call
preds=knnclassifier(xTr,yTr,xTe,k)should output the predictions for the data in
xTe
i.e. preds[i]
will contain the prediction for xTe[i,:]
.Notes for knnclassifier:
this page gives some info about choosing a classification based on a voting scheme of some sort.
also talks about choosing a best value for k
You can compute the actual classification error on the test set by calling
>> analyze("acc",yTe,knnclassifier(xTr,yTr,xTe,3))
(e) This script runs the -nearest neighbor classifier over the faces and digits data set. The faces data set has classes, the digits data set . What classification accuracy would you expect from a random classifier?
(f) (optional) Sometimes a -NN classifier can result in a draw, when the majority vote is not clearly defined. Can you improve your accuracy by falling back onto -NN with lower in such a case?
(g) Edit the function competition
, which reads in a training and testing set and makes predictions. Inside this function you are free to use any combination or variation of the k-nearest neighbor classifier. Can you beat my submission on our secret training and testing set?
Evaluation
For this project, you will be ranked on the following measures:
- Percentage of test cases passed
- Average of:
- Accuracy on the faces test dataset and
- Accuracy on the digits test dataset
- Accuracy on the secret test dataset
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-127-b7e948201676> in <module>()
----> 1 test_y = np.random.choice(self.data_dict.keys(), size=int(len(self.data_dict.keys())*self.test_percent), replace=False)
2 train_y = [y for y in data_dict.keys() if y not in test_y]
NameError: name 'self' is not defined
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-124-2c9852bf3bef> in <module>()
----> 1 testing_tt_splitter(xTr, yTr, xTe, yTe)
<ipython-input-123-be7bac46ba19> in testing_tt_splitter(xTr, yTr, xTe, yTe)
28 best_k = (k_curr, result)
29 print(yTr.flatten())
---> 30 preds = knnclassifier(xTr, yTr.flatten(), xTe, best_k[0])
31
32
<ipython-input-106-826710676723> in knnclassifier(xTr, yTr, xTe, k)
22 for i in range(len(indicies_matrix.T)):
23 labels[:,i] = yTr[(indicies_matrix.T[i])].T
---> 24 preds = mode(labels)[0] # simple mode calc, if tie, takes least (maybe consider different k val in this case)
25
26 return preds
TypeError: 'NoneType' object is not subscriptable
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-120-e90aea3e4c16> in <module>()
6 for i in range(len(indicies_matrix.T)):
7 labels[:,i] = yTr[(indicies_matrix.T[i])].T
----> 8 preds = mode(labels)[0] # simple mode calc, if tie, takes least (maybe consider different k val in this case)
9
10 preds
TypeError: 'NoneType' object is not subscriptable