GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/KNN with case study.ipynb
³⁰⁸⁷ views

Kernel: Python 3

KNN algorithm

One of the simplest of all the supervised machine learning algorithms. It simply calculates the distance of a new data point to all other training data points.
K can be any integer. K=3 mean ( find the 3 nearest points)
The KNN algorithm starts by calculating the distance of point(Euclidean or Manhattan ) X from all the points.
Finally it assigns the data point to the class to which the majority of the K data points belong.

Note

The model for KNN is the entire training dataset. When a prediction is required for a unseen data instance, the kNN algorithm will search through the training dataset for the k-most similar instances. The prediction attribute of the most similar instances is summarized and returned as the prediction for the unseen instance.

In [1]:

## Import the necessary modules from specific libraries.

import os
import numpy as np
import pandas as pd
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [14]:

k=pd.read_excel('F:\\ML & Data Visualization\\Personality.xlsx')

In [ ]:

if(k["Domain"]="Progrmmer")== 0,

In [16]:

k.info()

Out[16]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 4 columns):
Height(Foot)    29 non-null float64
Weight(Kg)      29 non-null int64
Age             29 non-null int64
Domain          29 non-null object
dtypes: float64(1), int64(2), object(1)
memory usage: 1008.0+ bytes

In [27]:

k.shape

Out[27]:

(29, 4)

In [17]:

## Identify the target variable 
k['Domain'],a = pd.factorize(k['Domain'])
k.head()

Out[17]:

In [18]:

print(a)
print(k['Domain'].unique())

Out[18]:

Index(['Programmer', 'Data Scientist', 'Data Analyst', 'Delivery', 'Sales'], dtype='object')
[0 1 2 3 4]

As we can see the values has been encoded into 4 different numeric labels.

Identify the predictor variables and encode any string variables to equivalent integer codes

In [7]:

k.head()

Out[7]:

In [8]:

## Select the predictor feature and select the target variable

X = k.iloc[:,:-1]
y = k.iloc[:,-1]

In [40]:

## Train test split :

# split data randomly into 70% training and 30% test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [41]:

## Training / model fitting

# train the decision tree
## Instantiate the model with 4 neighbors.
model = KNeighborsClassifier(n_neighbors=4)
## Fit the model on the training data.
model.fit(X_train, y_train)

Out[41]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=4, p=2,
           weights='uniform')

In [42]:

## Model parameters study :
from sklearn.metrics import accuracy_score
# use the model to make predictions with the test data
y_pred = model.predict(X_test)
# how did our model perform?
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
#accuracy = metrics.accuracy_score(y_test, y_pred)
#print('Accuracy: {:.2f}'.format(accuracy))

Out[42]:

Misclassified samples: 3

Insights

Only 6 samples were misclassified.Since this is a very simplistic data set with distinctly separable classes. But there you have it. That’s how to implement K-Nearest Neighbors with scikit-learn.

In [45]:

import numpy as np

X_new = np.array([[6,40,28]])
prediction = model.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(k['Domain'][prediction]))

Out[45]:

Prediction: [2]
Predicted target name: 2    0
Name: Domain, dtype: int64

How to decide the value of n-neighbors

Choosing a large value of K will lead to greater amount of execution time & underfitting. Selecting the small value of K will lead to overfitting. There is no such guaranteed way to find the best value of K.

In [44]:


from sklearn.metrics import accuracy_score
for K in range(20):
  K_value = K+1
  neigh = KNeighborsClassifier(n_neighbors = K_value)
  neigh.fit(X_train, y_train)
  y_pred = neigh.predict(X_test)
  print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",K_value)

Out[44]:

Accuracy is  33.33333333333333 % for K-Value: 1
Accuracy is  50.0 % for K-Value: 2
Accuracy is  50.0 % for K-Value: 3
Accuracy is  50.0 % for K-Value: 4
Accuracy is  33.33333333333333 % for K-Value: 5
Accuracy is  16.666666666666664 % for K-Value: 6
Accuracy is  33.33333333333333 % for K-Value: 7
Accuracy is  33.33333333333333 % for K-Value: 8
Accuracy is  16.666666666666664 % for K-Value: 9
Accuracy is  33.33333333333333 % for K-Value: 10
Accuracy is  0.0 % for K-Value: 11
Accuracy is  0.0 % for K-Value: 12
Accuracy is  0.0 % for K-Value: 13
Accuracy is  0.0 % for K-Value: 14
Accuracy is  0.0 % for K-Value: 15
Accuracy is  0.0 % for K-Value: 16
Accuracy is  16.666666666666664 % for K-Value: 17
Accuracy is  16.666666666666664 % for K-Value: 18
Accuracy is  16.666666666666664 % for K-Value: 19
Accuracy is  16.666666666666664 % for K-Value: 20

In [ ]:

KNN algorithm

Note

Identify the predictor variables and encode any string variables to equivalent integer codes

Insights

How to decide the value of n-neighbors

Product

Resources

Company