Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-08/code/solution-code/solution-code-8.ipynb
1905 views
Kernel: Python 2
from sklearn import datasets, neighbors, metrics import pandas as pd import seaborn as sns
%matplotlib inline
iris = datasets.load_iris() irisdf = pd.DataFrame(iris.data, columns=iris.feature_names) irisdf['target'] = iris.target cmap = {'0': 'r', '1': 'g', '2': 'b' } irisdf['ctarget'] = irisdf.target.apply(lambda x: cmap[str(x)]) irisdf.plot('petal length (cm)', 'petal width (cm)', kind='scatter', c=irisdf.ctarget) print irisdf.plot('petal length (cm)', 'petal width (cm)', kind='scatter', c=irisdf.ctarget) print irisdf.describe() def my_classifier(row): if row['petal length (cm)'] < 2: return 0 else: return 1 predictions = irisdf.apply(my_classifier, axis=1)
Axes(0.125,0.125;0.775x0.775) sepal length (cm) sepal width (cm) petal length (cm) \ count 150.000000 150.000000 150.000000 mean 5.843333 3.054000 3.758667 std 0.828066 0.433594 1.764420 min 4.300000 2.000000 1.000000 25% 5.100000 2.800000 1.600000 50% 5.800000 3.000000 4.350000 75% 6.400000 3.300000 5.100000 max 7.900000 4.400000 6.900000 petal width (cm) target count 150.000000 150.000000 mean 1.198667 1.000000 std 0.763161 0.819232 min 0.100000 0.000000 25% 0.300000 0.000000 50% 1.300000 1.000000 75% 1.800000 2.000000 max 2.500000 2.000000
Image in a Jupyter notebookImage in a Jupyter notebook
irisdf['predictions'] = predictions print float(len(irisdf[irisdf.target == irisdf.predictions])) / len(irisdf)
0.666666666667

More specific solution

For the class, this solution is as simple it really needs to be in order to get a very good prediction score. But: Why, or when, does this fail? What attributes make this a great data set for learning classification algorithms? What makes it not as great?

def my_classifier(row): if row['petal length (cm)'] < 2: return 0 elif row['petal length (cm)'] < 5: return 1 else: return 2 predictions = irisdf.apply(my_classifier, axis=1) irisdf['predictions'] = predictions print float(len(irisdf[irisdf.target == irisdf.predictions])) / len(irisdf)
0.946666666667

Using distance: KNN implementation

from sklearn import datasets, neighbors, metrics import pandas as pd iris = datasets.load_iris() # n_neighbors is our option in KNN. We'll tune this value to attempt to improve our prediction. knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform') knn.fit(iris.data[:,2:], iris.target) print knn.predict(iris.data[:,2:]) print iris.target print knn.score(iris.data[:,2:], iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] 0.96

Do we see a change in performaance with using the distance weight?

knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance') knn.fit(iris.data[:,2:], iris.target) print knn.predict(iris.data[:,2:]) print iris.target print knn.score(iris.data[:,2:], iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] 0.993333333333

Solution to solving K

This is only one approach to the problem, but adding in the 'distance' parameter (instead of uniform) would only be additive; note that the code would need some editing to handle it properly if done in the grid search; alternatively, make the change directly in the estimator.

from sklearn import grid_search, cross_validation import matplotlib.pyplot as plt # some n_list! keep in mind cross validation # recall: what's an effective way to create a numerical list in python? k = range(2, 100) params = {'n_neighbors': k } kf = cross_validation.KFold(len(irisdf), n_folds = 5) gs = grid_search.GridSearchCV( estimator=neighbors.KNeighborsClassifier(), param_grid=params, cv=kf, ) gs.fit(iris.data, iris.target) gs.grid_scores_
[mean: 0.90667, std: 0.09752, params: {'n_neighbors': 2}, mean: 0.90667, std: 0.09286, params: {'n_neighbors': 3}, mean: 0.90667, std: 0.09286, params: {'n_neighbors': 4}, mean: 0.91333, std: 0.08327, params: {'n_neighbors': 5}, mean: 0.90667, std: 0.09286, params: {'n_neighbors': 6}, mean: 0.92000, std: 0.08589, params: {'n_neighbors': 7}, mean: 0.91333, std: 0.08844, params: {'n_neighbors': 8}, mean: 0.92000, std: 0.09092, params: {'n_neighbors': 9}, mean: 0.92000, std: 0.09092, params: {'n_neighbors': 10}, mean: 0.91333, std: 0.08589, params: {'n_neighbors': 11}, mean: 0.89333, std: 0.10625, params: {'n_neighbors': 12}, mean: 0.90667, std: 0.08273, params: {'n_neighbors': 13}, mean: 0.90000, std: 0.09428, params: {'n_neighbors': 14}, mean: 0.90000, std: 0.09428, params: {'n_neighbors': 15}, mean: 0.88667, std: 0.11851, params: {'n_neighbors': 16}, mean: 0.88000, std: 0.12754, params: {'n_neighbors': 17}, mean: 0.86667, std: 0.12111, params: {'n_neighbors': 18}, mean: 0.88667, std: 0.11662, params: {'n_neighbors': 19}, mean: 0.86667, std: 0.13499, params: {'n_neighbors': 20}, mean: 0.86667, std: 0.13499, params: {'n_neighbors': 21}, mean: 0.86667, std: 0.13499, params: {'n_neighbors': 22}, mean: 0.86667, std: 0.13499, params: {'n_neighbors': 23}, mean: 0.84667, std: 0.17075, params: {'n_neighbors': 24}, mean: 0.86000, std: 0.14667, params: {'n_neighbors': 25}, mean: 0.84667, std: 0.17075, params: {'n_neighbors': 26}, mean: 0.84667, std: 0.15434, params: {'n_neighbors': 27}, mean: 0.82000, std: 0.18809, params: {'n_neighbors': 28}, mean: 0.80000, std: 0.19437, params: {'n_neighbors': 29}, mean: 0.78667, std: 0.21250, params: {'n_neighbors': 30}, mean: 0.77333, std: 0.21848, params: {'n_neighbors': 31}, mean: 0.74000, std: 0.27520, params: {'n_neighbors': 32}, mean: 0.75333, std: 0.26043, params: {'n_neighbors': 33}, mean: 0.72000, std: 0.30155, params: {'n_neighbors': 34}, mean: 0.69333, std: 0.31510, params: {'n_neighbors': 35}, mean: 0.68000, std: 0.33506, params: {'n_neighbors': 36}, mean: 0.68667, std: 0.30955, params: {'n_neighbors': 37}, mean: 0.64000, std: 0.38320, params: {'n_neighbors': 38}, mean: 0.64000, std: 0.38320, params: {'n_neighbors': 39}, mean: 0.64000, std: 0.38320, params: {'n_neighbors': 40}, mean: 0.38667, std: 0.42248, params: {'n_neighbors': 41}, mean: 0.37333, std: 0.43123, params: {'n_neighbors': 42}, mean: 0.37333, std: 0.43123, params: {'n_neighbors': 43}, mean: 0.37333, std: 0.43123, params: {'n_neighbors': 44}, mean: 0.37333, std: 0.43123, params: {'n_neighbors': 45}, mean: 0.37333, std: 0.43123, params: {'n_neighbors': 46}, mean: 0.37333, std: 0.43123, params: {'n_neighbors': 47}, mean: 0.36000, std: 0.41655, params: {'n_neighbors': 48}, mean: 0.36000, std: 0.41655, params: {'n_neighbors': 49}, mean: 0.36000, std: 0.41655, params: {'n_neighbors': 50}, mean: 0.36000, std: 0.42864, params: {'n_neighbors': 51}, mean: 0.34667, std: 0.42667, params: {'n_neighbors': 52}, mean: 0.34667, std: 0.42667, params: {'n_neighbors': 53}, mean: 0.33333, std: 0.41312, params: {'n_neighbors': 54}, mean: 0.34667, std: 0.42667, params: {'n_neighbors': 55}, mean: 0.32000, std: 0.40089, params: {'n_neighbors': 56}, mean: 0.32667, std: 0.40683, params: {'n_neighbors': 57}, mean: 0.32000, std: 0.40089, params: {'n_neighbors': 58}, mean: 0.32667, std: 0.40683, params: {'n_neighbors': 59}, mean: 0.24667, std: 0.36246, params: {'n_neighbors': 60}, mean: 0.20667, std: 0.28783, params: {'n_neighbors': 61}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 62}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 63}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 64}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 65}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 66}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 67}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 68}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 69}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 70}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 71}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 72}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 73}, mean: 0.10667, std: 0.13233, params: {'n_neighbors': 74}, mean: 0.08667, std: 0.11851, params: {'n_neighbors': 75}, mean: 0.08667, std: 0.11851, params: {'n_neighbors': 76}, mean: 0.08667, std: 0.11851, params: {'n_neighbors': 77}, mean: 0.08667, std: 0.11851, params: {'n_neighbors': 78}, mean: 0.08667, std: 0.11851, params: {'n_neighbors': 79}, mean: 0.08667, std: 0.11851, params: {'n_neighbors': 80}, mean: 0.07333, std: 0.10414, params: {'n_neighbors': 81}, mean: 0.07333, std: 0.10414, params: {'n_neighbors': 82}, mean: 0.07333, std: 0.10414, params: {'n_neighbors': 83}, mean: 0.07333, std: 0.10414, params: {'n_neighbors': 84}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 85}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 86}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 87}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 88}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 89}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 90}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 91}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 92}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 93}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 94}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 95}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 96}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 97}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 98}, mean: 0.06667, std: 0.10328, params: {'n_neighbors': 99}]
plt.plot(k,[s[1] for s in gs.grid_scores_],)
[<matplotlib.lines.Line2D at 0x110d8d6d0>]
Image in a Jupyter notebook

Zoom in to look at fit before first dive around 25:

plt.plot(k[:25],[s[1] for s in gs.grid_scores_][:25],)
[<matplotlib.lines.Line2D at 0x110e95450>]
Image in a Jupyter notebook