Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/KNN with case study.ipynb
3087 views
Kernel: Python 3

KNN algorithm

  • One of the simplest of all the supervised machine learning algorithms. It simply calculates the distance of a new data point to all other training data points.

  • K can be any integer. K=3 mean ( find the 3 nearest points)

  • The KNN algorithm starts by calculating the distance of point(Euclidean or Manhattan ) X from all the points.

  • Finally it assigns the data point to the class to which the majority of the K data points belong.

Note

The model for KNN is the entire training dataset. When a prediction is required for a unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.

image.png

## Import the necessary modules from specific libraries. import os import numpy as np import pandas as pd import numpy as np, pandas as pd import matplotlib.pyplot as plt from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split
k=pd.read_excel('F:\\ML & Data Visualization\\Personality.xlsx')
if(k["Domain"]="Progrmmer")== 0,
k.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 29 entries, 0 to 28 Data columns (total 4 columns): Height(Foot) 29 non-null float64 Weight(Kg) 29 non-null int64 Age 29 non-null int64 Domain 29 non-null object dtypes: float64(1), int64(2), object(1) memory usage: 1008.0+ bytes
k.shape
(29, 4)
## Identify the target variable k['Domain'],a = pd.factorize(k['Domain']) k.head()
print(a) print(k['Domain'].unique())
Index(['Programmer', 'Data Scientist', 'Data Analyst', 'Delivery', 'Sales'], dtype='object') [0 1 2 3 4]
  • As we can see the values has been encoded into 4 different numeric labels.

Identify the predictor variables and encode any string variables to equivalent integer codes

k.head()
## Select the predictor feature and select the target variable X = k.iloc[:,:-1] y = k.iloc[:,-1]
## Train test split : # split data randomly into 70% training and 30% test from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
## Training / model fitting # train the decision tree ## Instantiate the model with 4 neighbors. model = KNeighborsClassifier(n_neighbors=4) ## Fit the model on the training data. model.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=4, p=2, weights='uniform')
## Model parameters study : from sklearn.metrics import accuracy_score # use the model to make predictions with the test data y_pred = model.predict(X_test) # how did our model perform? count_misclassified = (y_test != y_pred).sum() print('Misclassified samples: {}'.format(count_misclassified)) #accuracy = metrics.accuracy_score(y_test, y_pred) #print('Accuracy: {:.2f}'.format(accuracy))
Misclassified samples: 3

Insights

  • Only 6 samples were misclassified.Since this is a very simplistic data set with distinctly separable classes. But there you have it. That’s how to implement K-Nearest Neighbors with scikit-learn.

import numpy as np X_new = np.array([[6,40,28]]) prediction = model.predict(X_new) print("Prediction: {}".format(prediction)) print("Predicted target name: {}".format(k['Domain'][prediction]))
Prediction: [2] Predicted target name: 2 0 Name: Domain, dtype: int64

How to decide the value of n-neighbors

Choosing a large value of K will lead to greater amount of execution time & underfitting. Selecting the small value of K will lead to overfitting. There is no such guaranteed way to find the best value of K.

from sklearn.metrics import accuracy_score for K in range(20): K_value = K+1 neigh = KNeighborsClassifier(n_neighbors = K_value) neigh.fit(X_train, y_train) y_pred = neigh.predict(X_test) print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",K_value)
Accuracy is 33.33333333333333 % for K-Value: 1 Accuracy is 50.0 % for K-Value: 2 Accuracy is 50.0 % for K-Value: 3 Accuracy is 50.0 % for K-Value: 4 Accuracy is 33.33333333333333 % for K-Value: 5 Accuracy is 16.666666666666664 % for K-Value: 6 Accuracy is 33.33333333333333 % for K-Value: 7 Accuracy is 33.33333333333333 % for K-Value: 8 Accuracy is 16.666666666666664 % for K-Value: 9 Accuracy is 33.33333333333333 % for K-Value: 10 Accuracy is 0.0 % for K-Value: 11 Accuracy is 0.0 % for K-Value: 12 Accuracy is 0.0 % for K-Value: 13 Accuracy is 0.0 % for K-Value: 14 Accuracy is 0.0 % for K-Value: 15 Accuracy is 0.0 % for K-Value: 16 Accuracy is 16.666666666666664 % for K-Value: 17 Accuracy is 16.666666666666664 % for K-Value: 18 Accuracy is 16.666666666666664 % for K-Value: 19 Accuracy is 16.666666666666664 % for K-Value: 20