GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_07/code/starter-code/starter-code-7 - (done) (KNN and then fed into Train, Test, Split modell).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

In [1]:

from sklearn import datasets, neighbors, metrics
import pandas as pd
import seaborn as sns

In [2]:

%matplotlib inline

Load in the Data

In [5]:

## Load in the data
iris = datasets.load_iris()
irisdf = pd.DataFrame(iris.data, columns=iris.feature_names)
irisdf['target'] = iris.target

## Apply a 'color map' for plotting purposes
cmap = {'0': 'r', '1': 'g', '2': 'b' }
irisdf['ctarget'] = irisdf.target.apply(lambda x: cmap[str(x)])

## Do some plotting to illustrate the data
irisdf.plot('petal length (cm)', 'petal width (cm)', kind='scatter', c=irisdf.ctarget)
irisdf.head()

Out[5]:

In [6]:

irisdf.describe()

Out[6]:

In [8]:

def my_classifier(row):
    if row['petal length (cm)'] < 2:
        return 0
    else:
        return 1
    
predictions = irisdf.apply(my_classifier, axis=1)

In [9]:

irisdf['predictions'] = predictions

print(float(len(irisdf[irisdf.target == irisdf.predictions])) / len(irisdf))

Out[9]:

0.6666666666666666

Starter Code

Work on improving the classifier below.

In [10]:

def my_classifier(row):
    if row['petal length (cm)'] < 2:
        return 0
    else:
        return 2
    
    ## Fill in other if then statements here by looking at the plot and data above

predictions = irisdf.apply(my_classifier, axis=1)

irisdf['predictions'] = predictions

print(float(len(irisdf[irisdf.target == irisdf.predictions])) / len(irisdf))

Out[10]:

0.6666666666666666

Using distance: KNN implementation

In [ ]:

#zip: list(zip(first, second))

x = [1,2,3]
y = [a, b, c]

list(zip(x, y))

for i in zip(x, y):
    print (i)

In [12]:

from sklearn import datasets, neighbors, metrics
import pandas as pd

iris = datasets.load_iris()
X = iris.data
y = iris.target

# n_neighbors is our option in KNN. We'll tune this value to attempt to improve our prediction.
knn = neighbors.KNeighborsClassifier(n_neighbors=3, weights='uniform')
knn.fit(X, y)

print(pd.DataFrame(list(zip(knn.predict(X), y)), columns = ['predicted','actual']))
print('Accuracy = {}'.format(knn.score(X, iris.target)))

Out[12]:

     predicted  actual
          0       0
          0       0
          0       0
          0       0
          0       0
          0       0
          0       0
          0       0
          0       0
          0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
..         ...     ...
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        1       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2

[150 rows x 2 columns]
Accuracy = 0.96

Do we see a change when using more neighbors?

In [13]:

knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X, y)

print (pd.DataFrame(list(zip(knn.predict(X), y)), columns = ['predicted','actual']))
print ('Accuracy = {}'.format(knn.score(X, iris.target)))

Out[13]:

     predicted  actual
          0       0
          0       0
          0       0
          0       0
          0       0
          0       0
          0       0
          0       0
          0       0
          0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
         0       0
..         ...     ...
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2
        2       2

[150 rows x 2 columns]
Accuracy = 0.9666666666666667

Do we see a change in performance when using the distance weight?

In [14]:

knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance') # add in the weights parameter here
knn.fit(X, y)
print (knn.score(X, iris.target))

Out[14]:

1.0

Solution to solving K

This is only one approach to the problem, but adding in the 'distance' parameter (instead of uniform) would only be additive. Note the code would need some editing to handle it properly if done in the grid search; alternatively, make the change directly in the estimator.

In [17]:

from sklearn.model_selection import GridSearchCV

## Parameters to tune!
tuned_parameters = [{'n_neighbors': [3, 5, 7],
                    'weights': ['distance','uniform']}]

## How many folds to use for validation?
n_folds = 5

knn = neighbors.KNeighborsClassifier()
grid_search = GridSearchCV(knn, tuned_parameters, cv = n_folds)
grid_search.fit(X, iris.target)

Out[17]:

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'n_neighbors': [3, 5, 7], 'weights': ['distance', 'uniform']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

What does this output?

In [18]:

grid_search.cv_results_

Out[18]:

/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split0_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split1_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split2_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split3_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split4_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('std_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)

{'mean_fit_time': array([0.00055041, 0.00060692, 0.0009181 , 0.00061622, 0.00034871,
        0.00030766]),
 'mean_score_time': array([0.0009201 , 0.00067348, 0.00114713, 0.00100803, 0.00071225,
        0.00065012]),
 'mean_test_score': array([0.96666667, 0.96666667, 0.96666667, 0.97333333, 0.98      ,
        0.98      ]),
 'mean_train_score': array([1.        , 0.96      , 1.        , 0.97      , 1.        ,
        0.97333333]),
 'param_n_neighbors': masked_array(data=[3, 3, 5, 5, 7, 7],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_weights': masked_array(data=['distance', 'uniform', 'distance', 'uniform',
                    'distance', 'uniform'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'n_neighbors': 3, 'weights': 'distance'},
  {'n_neighbors': 3, 'weights': 'uniform'},
  {'n_neighbors': 5, 'weights': 'distance'},
  {'n_neighbors': 5, 'weights': 'uniform'},
  {'n_neighbors': 7, 'weights': 'distance'},
  {'n_neighbors': 7, 'weights': 'uniform'}],
 'rank_test_score': array([4, 4, 4, 3, 1, 1], dtype=int32),
 'split0_test_score': array([0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
        0.96666667]),
 'split0_train_score': array([1.        , 0.95833333, 1.        , 0.96666667, 1.        ,
        0.96666667]),
 'split1_test_score': array([0.96666667, 0.96666667, 1.        , 1.        , 1.        ,
        1.        ]),
 'split1_train_score': array([1.        , 0.95833333, 1.        , 0.96666667, 1.        ,
        0.96666667]),
 'split2_test_score': array([0.93333333, 0.93333333, 0.9       , 0.93333333, 0.96666667,
        0.96666667]),
 'split2_train_score': array([1.        , 0.96666667, 1.        , 0.975     , 1.        ,
        0.975     ]),
 'split3_test_score': array([0.96666667, 0.96666667, 0.96666667, 0.96666667, 0.96666667,
        0.96666667]),
 'split3_train_score': array([1.        , 0.96666667, 1.        , 0.975     , 1.        ,
        0.98333333]),
 'split4_test_score': array([1., 1., 1., 1., 1., 1.]),
 'split4_train_score': array([1.        , 0.95      , 1.        , 0.96666667, 1.        ,
        0.975     ]),
 'std_fit_time': array([2.43202733e-04, 5.85316720e-04, 5.32751645e-04, 3.64376125e-04,
        8.75857317e-05, 2.19125223e-05]),
 'std_score_time': array([0.00033249, 0.00016714, 0.00029082, 0.00017725, 0.00014781,
        0.00020538]),
 'std_test_score': array([0.02108185, 0.02108185, 0.03651484, 0.02494438, 0.01632993,
        0.01632993]),
 'std_train_score': array([0.        , 0.0062361 , 0.        , 0.00408248, 0.        ,
        0.0062361 ])}

In [20]:

pd.DataFrame(grid_search.cv_results_).sort_values('mean_test_score', ascending=False)

Out[20]:

/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('mean_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split0_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split1_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split2_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split3_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('split4_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)
/anaconda3/lib/python3.6/site-packages/sklearn/utils/deprecation.py:122: FutureWarning: You are accessing a training score ('std_train_score'), which will not be available by default any more in 0.21. If you need training scores, please set return_train_score=True
  warnings.warn(*warn_args, **warn_kwargs)

What is our best test accuracy? What do we expect our out of sample performance to look like?

In [22]:

grid_search.best_estimator_

Out[22]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
           weights='distance')

Lets Build the model and look at it in more detail

In [23]:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, iris.target)

knn_final = grid_search.best_estimator_
knn_final.fit(x_train, y_train)
preds = knn_final.predict(x_test)

In [24]:

from sklearn.metrics import classification_report
import pprint
pp = pprint.PrettyPrinter(indent=4)
results = classification_report(y_pred=preds, y_true = y_test)
pp.pprint(results)

Out[24]:

('             precision    recall  f1-score   support\n'
 '\n'
 '          0       1.00      1.00      1.00        13\n'
 '          1       0.89      0.89      0.89         9\n'
 '          2       0.94      0.94      0.94        16\n'
 '\n'
 'avg / total       0.95      0.95      0.95        38\n')

In [31]:

pd.DataFrame(list(zip(preds, y_test)), columns=['predicted', 'actual'])

Out[31]:

In [ ]: