Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/april_18/lessons/lesson-11-flex/code/Clustering with Scikit-Learn-Solutions.ipynb
1904 views
Kernel: Python 2

Clustering with Sklearn

In this notebook we'll practice clustering algorithms with Scikit-Learn.

Data sets

We'll use the following datasets:

There are many clustering data sets you can use for practice!

%matplotlib inline from collections import Counter import random import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns from sklearn.cluster import KMeans, DBSCAN
## Create some synthetic data from scipy.stats import multivariate_normal data = [] dist = multivariate_normal(mean=[0,0], cov=[[1,0],[0,1]]) for i in range(150): data.append(dist.rvs()) dist = multivariate_normal(mean=[5,5], cov=[[1,0.5],[0.2,1]]) for i in range(150): data.append(dist.rvs()) dist = multivariate_normal(mean=[9,9], cov=[[1,0.5],[0.2,1]]) for i in range(150): data.append(dist.rvs()) dist = multivariate_normal(mean=[-10,5], cov=[[3,0.5],[0.2,2]]) for i in range(150): data.append(dist.rvs()) df = pd.DataFrame(data, columns=["x", "y"]) df.head() plt.scatter(df['x'], df['y']) plt.show()
Image in a Jupyter notebook
def annulus(inner_radius, outer_radius, n=30, color='b'): """Generate n points with class `color` between the inner radius and the outer radius.""" data = [] diff = outer_radius - inner_radius for _ in range(n): # Pick an angle and radius angle = 2 * np.pi * random.random() r = inner_radius + diff * random.random() x = r * np.cos(angle) y = r * np.sin(angle) data.append((x, y)) # Return a data frame for convenience xs, ys = zip(*data) df = pd.DataFrame() df["x"] = xs df["y"] = ys df["color"] = color return df df1 = annulus(2, 6, 200, color='r') df2 = annulus(8, 10, 300, color='b') df_circ = pd.concat([df1, df2])
plt.scatter(df_circ['x'], df_circ['y'], c=df_circ['color']) plt.show()
Image in a Jupyter notebook

K-Means with sklearn

# Fit a k-means estimator estimator = KMeans(n_clusters=2) X = df[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print labels
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
# Plot the data def set_colors(labels, colors='rgbykcm'): colored_labels = [] for label in labels: colored_labels.append(colors[label]) return colored_labels colors = set_colors(labels) plt.scatter(df['x'], df['y'], c=colors) plt.xlabel("x") plt.ylabel("y") plt.show()
Image in a Jupyter notebook

Let's try it with k=4 this time.

estimator = KMeans(n_clusters=4) X = df[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(df['x'], df['y'], c=colors) plt.xlabel("x") plt.ylabel("y") plt.show()
Counter({0: 150, 1: 150, 2: 150, 3: 150})
Image in a Jupyter notebook

Let's try the circular data.

estimator = KMeans(n_clusters=2) X = df_circ[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(df_circ['x'], df_circ['y'], c=colors) plt.xlabel("x") plt.ylabel("y") plt.show()
Counter({0: 261, 1: 239})
Image in a Jupyter notebook

Ouch! No so great on this dataset. Now let's try some real data.

of_df = pd.read_csv("../assets/datasets/old-faithful.csv") of_df.head()
of_df.plot.scatter(x="eruption_time", y="wait_time") plt.show()
Image in a Jupyter notebook
# Fit a k-means estimator estimator = KMeans(n_clusters=2) X = of_df[["eruption_time", "wait_time"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels)
Counter({0: 172, 1: 100})
# Plot the data colors = set_colors(labels) plt.scatter(of_df["eruption_time"], of_df["wait_time"], c=colors) plt.xlabel("eruption_time") plt.ylabel("wait_time") plt.show()
Image in a Jupyter notebook

Exercise: k-means

For the Iris dataset, fit and plot k-means models to:

  • sepal_length and petal_length, for k=2 and k=3

  • sepal_width and petal_width, for k=2 and k=3

Bonus: Compare your classifications to the known species. How well do the labels match up?

After: Check out the 3D-example here

iris = pd.read_csv("../assets/datasets/iris.data") sns.pairplot(iris, hue="species") plt.show() iris.tail()
Image in a Jupyter notebook
## Exercise Answers here from sklearn import metrics # Fit a k-means estimator estimator = KMeans(n_clusters=3) X = iris[["sepal_length", "petal_length"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(iris["petal_length"], iris["sepal_length"], c=colors) plt.xlabel("petal_length") plt.ylabel("sepal_length") plt.show()
Counter({0: 58, 1: 51, 2: 41})
Image in a Jupyter notebook
print labels[0] print labels[len(labels) / 2] print labels[-1]
1 0 0
label_map = {"Iris-setosa": 1, "Iris-versicolor": 0, "Iris-virginica": 2} true_labels = [] for row in iris.itertuples(): true_labels.append(label_map[row.species]) number_correct = 0 for t, l in zip(true_labels, labels): if t == l: number_correct += 1 print number_correct / float(len(iris)) # metrics.adjusted_rand_score(true_labels, labels)
0.88

DBSCAN

# Fit a DBSCAN estimator estimator = DBSCAN(eps=0.85, min_samples=10) X = df[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(df['x'], df['y'], c=colors) plt.xlabel("x") plt.ylabel("y") plt.show()
Counter({2: 150, 0: 143, 1: 143, 3: 126, -1: 38})
Image in a Jupyter notebook
# Fit a DBSCAN estimator estimator = DBSCAN(eps=0.8, min_samples=10) X = df[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(df['x'], df['y'], c=colors) plt.xlabel("x") plt.ylabel("y") plt.show()
Counter({2: 145, 0: 143, 1: 140, 3: 107, -1: 65})
Image in a Jupyter notebook
# Fit a DBSCAN estimator estimator = DBSCAN(eps=2, min_samples=10) X = df_circ[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(df_circ['x'], df_circ['y'], c=colors) plt.xlabel("x") plt.ylabel("y") plt.show()
Counter({1: 300, 0: 200})
Image in a Jupyter notebook

Much better than k-means on this dataset! Let's try to cook up something that DBSCAN doesn't work as well on.

## Create some synthetic data data = [] dist = multivariate_normal(mean=[0,0], cov=[[6,12],[1,6]]) for i in range(50): data.append(dist.rvs()) dist = multivariate_normal(mean=[10,10], cov=[[1,1.1],[0.2,0.6]]) for i in range(400): data.append(dist.rvs()) df2 = pd.DataFrame(data, columns=["x", "y"]) df2.head() plt.scatter(df2['x'], df2['y']) plt.show()
Image in a Jupyter notebook
# Fit a DBSCAN estimator estimator = DBSCAN(eps=2, min_samples=10) X = df2[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(df2['x'], df2['y'], c=colors) plt.xlabel("x") plt.ylabel("y") plt.show()
Counter({1: 400, 0: 43, -1: 7})
Image in a Jupyter notebook

Exercise: DBSCAN

For the Iris dataset, fit and plot DBSCAN models to:

  • sepal_length and petal_length

  • sepal_width and petal_width

Bonus: Compare your classifications to the known species. How well do the labels match up?

## Exercise Answers here # Fit a DBSCAN estimator = DBSCAN(eps=0.5, min_samples=10) X = iris[["sepal_length", "petal_length"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(iris["petal_length"], iris["sepal_length"], c=colors) plt.xlabel("petal_length") plt.ylabel("sepal_length") plt.show()
Counter({1: 91, 0: 50, -1: 9})
Image in a Jupyter notebook
## Exercise Answers here # Fit a DBSCAN estimator = DBSCAN(eps=0.5, min_samples=10) X = iris[["sepal_width", "petal_width"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(iris["petal_width"], iris["sepal_width"], c=colors) plt.xlabel("petal_width") plt.ylabel("sepal_width") plt.show()
Counter({1: 100, 0: 49, -1: 1})
Image in a Jupyter notebook

Hierarchical Clustering

# Hierarchical: Agglomerative Clustering from sklearn.cluster import AgglomerativeClustering # Fit an estimator estimator = AgglomerativeClustering(n_clusters=4) X = df[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(df['x'], df['y'], c=colors) plt.xlabel("x") plt.ylabel("y") plt.show()
Counter({1: 156, 0: 150, 2: 150, 3: 144})
Image in a Jupyter notebook
# Hierarchical: Agglomerative Clustering from sklearn.cluster import AgglomerativeClustering # Fit an estimator estimator = AgglomerativeClustering(n_clusters=2) X = df_circ[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) colors = set_colors(labels) plt.scatter(df_circ['x'], df_circ['y'], c=colors) plt.xlabel("x") plt.ylabel("y") plt.show()
Counter({0: 314, 1: 186})
Image in a Jupyter notebook
## Silhouette Coefficient from sklearn import metrics estimator = KMeans(n_clusters=4) X = df[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) print metrics.silhouette_score(X, labels, metric='euclidean')
Counter({0: 151, 1: 150, 2: 150, 3: 149}) 0.695206988177
estimator = DBSCAN(eps=1.2, min_samples=10) X = df[["x", "y"]] estimator.fit(X) # Clusters are given in the labels_ attribute labels = estimator.labels_ print Counter(labels) print metrics.silhouette_score(X, labels, metric='euclidean')
Counter({1: 297, 0: 150, 2: 144, -1: 9}) 0.658136576865

Bigger is better, so k-means was a better clustering algorithm on this data set.