Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_07/code/solution-code/solution-code-7.ipynb
1904 views
Kernel: Python 3
from sklearn import datasets, neighbors, metrics import pandas as pd import seaborn as sns
%matplotlib inline
iris = datasets.load_iris() irisdf = pd.DataFrame(iris.data, columns=iris.feature_names) irisdf['target'] = iris.target cmap = {'0': 'r', '1': 'g', '2': 'b' } irisdf['ctarget'] = irisdf.target.apply(lambda x: cmap[str(x)]) irisdf.plot('petal length (cm)', 'petal width (cm)', kind='scatter', c=irisdf.ctarget) print irisdf.plot('petal length (cm)', 'petal width (cm)', kind='scatter', c=irisdf.ctarget) print irisdf.describe() def my_classifier(row): if row['petal length (cm)'] < 2: return 0 else: return 1 predictions = irisdf.apply(my_classifier, axis=1)
File "<ipython-input-3-09d36c810018>", line 7 print irisdf.plot('petal length (cm)', 'petal width (cm)', kind='scatter', c=irisdf.ctarget) ^ SyntaxError: invalid syntax
irisdf['predictions'] = predictions print float(len(irisdf[irisdf.target == irisdf.predictions])) / len(irisdf)

More specific solution

For the class, this solution is as simple it really needs to be in order to get a very good prediction score. But: Why, or when, does this fail? What attributes make this a great data set for learning classification algorithms? What makes it not as great?

def my_classifier(row): if row['petal length (cm)'] < 2: return 0 elif row['petal length (cm)'] < 5: return 1 else: return 2 predictions = irisdf.apply(my_classifier, axis=1) irisdf['predictions'] = predictions print float(len(irisdf[irisdf.target == irisdf.predictions])) / len(irisdf)

Using distance: KNN implementation

from sklearn import datasets, neighbors, metrics import pandas as pd iris = datasets.load_iris() # n_neighbors is our option in KNN. We'll tune this value to attempt to improve our prediction. knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform') knn.fit(iris.data[:,2:], iris.target) print knn.predict(iris.data[:,2:]) print iris.target print knn.score(iris.data[:,2:], iris.target)

Do we see a change in performaance with using the distance weight?

knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance') knn.fit(iris.data[:,2:], iris.target) print knn.predict(iris.data[:,2:]) print iris.target print knn.score(iris.data[:,2:], iris.target)

Solution to solving K

This is only one approach to the problem, but adding in the 'distance' parameter (instead of uniform) would only be additive; note that the code would need some editing to handle it properly if done in the grid search; alternatively, make the change directly in the estimator.

from sklearn import grid_search, cross_validation import matplotlib.pyplot as plt # some n_list! keep in mind cross validation # recall: what's an effective way to create a numerical list in python? k = range(2, 100) params = {'n_neighbors': k } kf = cross_validation.KFold(len(irisdf), n_folds = 5) gs = grid_search.GridSearchCV( estimator=neighbors.KNeighborsClassifier(), param_grid=params, cv=kf, ) gs.fit(iris.data, iris.target) gs.grid_scores_
plt.plot(k,[s[1] for s in gs.grid_scores_],)

Zoom in to look at fit before first dive around 25:

plt.plot(k[:25],[s[1] for s in gs.grid_scores_][:25],)