Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_09/code/01_intro-to-kmeans - (done).ipynb
1904 views
Kernel: Python 3
from IPython.display import Image from IPython.core.display import HTML

Intro to clustering and k-means

LEARNING OBJECTIVES

After this lesson, you will be able to:

  • Format and preprocess data for clustering

  • Perform a K-Means clustering analysis

  • Evaluate clusters for fit

Being able to create clusters is a powerful tool that will make you a stronger data scientist

What's the difference between supervised and unsupervised learning?

Classification - create a model to predict which group a point belongs to

Clustering - find groups that exist in the data already

How could unsupervised learning or clustering be useful?

Helpful uses for clustering:

  • Find items with similar behavior (users, products, voters, etc)

  • Market segmentation

  • Understand complex systems

  • Discover meaningful categories for your data

  • Reduce the number of classes by grouping (e.g. bourbons, scotches -> whiskeys)

  • Reduce the dimensions of your problem

  • Pre-processing! Create labels for supervised learning

Great. Clustering is useful.

Any ideas how to algorithmically tell which groups are different?

K Means

.

My FirstTM Clustering Algorithm

  1. Pick a value for k (the number of clusters to create)

  2. Initialize k 'centroids' (starting points) in your data

  3. Create your clusters. Assign each point to the nearest centroid.

  4. Make your clusters better. Move each centroid to the center of its cluster.

  5. Repeat steps 3-4 until your centroids converge.

These tutorial images come from Andrew Moore's CS class at Carnegie Mellon. His slide deck is online here: https://www.autonlab.org/tutorials/kmeans11.pdf. He also links to more of his tutorials on the first page.

Let's practice a toy example by hand. Take a 1-dimensional array of numbers:

import numpy as np arr = [2, 5, 6, 8, 12, 15, 18, 28, 30] your_initial_cluster = np.random.choice(arr, 3) your_initial_cluster
array([ 2, 30, 6])
arr1 = [] for i in range(len(your_initial_cluster)): arr2 = [] for e in range(len(arr)): x = your_initial_cluster[i] - arr[e] arr2.append(x) arr1.append(arr2) print(arr1)
[[0, -3, -4, -6, -10, -13, -16, -26, -28], [28, 25, 24, 22, 18, 15, 12, 2, 0], [4, 1, 0, -2, -6, -9, -12, -22, -24]]
all_dist = [] for i, center in enumerate(your_initial_cluster): distances = [] for num in arr: distance = np.sqrt((num-center)**2) distances.append(distance) all_dist.append(distances) print(all_dist)
[[0.0, 3.0, 4.0, 6.0, 10.0, 13.0, 16.0, 26.0, 28.0], [28.0, 25.0, 24.0, 22.0, 18.0, 15.0, 12.0, 2.0, 0.0], [4.0, 1.0, 0.0, 2.0, 6.0, 9.0, 12.0, 22.0, 24.0]]

you run this a few times, basically until you hit convergance...

take distance between observations and the centers, find the shortest distance, take the average to get new points, run again

Mock Distance Code

distance = np.sqrt((x-center)**2)

Pick k=3 random starting centroids and let the other points fall into clusters based on the closest centroid. Then, reassign your centroids based on the mean value of your cluster and repeat the process.

Check with your neighbors. Do you have the same clusters? Why or why not?

K Means is a powerful algorithm, but different starting points may give you different clusters. You won't necessarily get an optimal cluster.

Metrics for assessing your clusters

Inertia -- sum of squared errors for each cluster

  • ranges from 0 to very high values

  • low inertia = dense clusters

Silhouette Score -- measure of how far apart clusters are

  • ranges from -1 to 1

  • high silhouette Score = clusters are well separated

Inertia -- sum of squared errors for each cluster

  • low inertia = dense cluster

∑j=0n(xj−μi)2\sum_{j=0}^{n} (x_j - \mu_i)^2

where μi\mu_i is a cluster centroid. (K-means explicitly tries to minimize this.)

.inertia_ is an attribute of sklearn's kmeans models.

Silhouette Score -- measure of how far apart clusters are

  • ranges from -1 to 1

  • high silhouette Score = clusters are well separated

The definition is a little involved∗^*, but intuitively the score is based on how much closer data points are to their own clusters than to the nearest neighbor cluster.

We can calculate it in sklearn with metrics.silhouette_score(X_scaled, labels, metric='euclidean').

∗^*https://en.wikipedia.org/wiki/Silhouette_(clustering)

How do I know which K to pick?

Sometimes you have good context:

  • I need to create 3 profiles for marketing to target

Other times you have to figure it out:

  • My scatter plots show 2 linearly separable clusters

%matplotlib inline import pandas as pd import numpy as np from sklearn.metrics import pairwise_distances from sklearn import cluster, datasets, preprocessing, metrics import matplotlib.pyplot as plt import seaborn as sns plt.style.use('fivethirtyeight')

Guided practice

Let's do some clustering with the iris dataset.

# Check out the dataset and our target values df = pd.read_csv("../assets/datasets/iris.csv") print(df['Name'].value_counts()) df.head(5)
Iris-virginica 50 Iris-versicolor 50 Iris-setosa 50 Name: Name, dtype: int64

Let's plot the data to see the distributions:

cols = df.columns[:-1] sns.pairplot(df[cols])
<seaborn.axisgrid.PairGrid at 0x116cbd7b8>
Image in a Jupyter notebook

Next, since each of our features have different units and ranges, let's do some preprocessing:

X_scaled = preprocessing.MinMaxScaler().fit_transform(df[cols]) #this transforms the data to get into one scale #it applies a min of zero, max 1 - this squashes the data #can you z score for this?
pd.DataFrame(X_scaled, columns=cols).describe()

Now that we've formatted our data and understand its structures, we can finally go ahead and cluster.

We're going to set k = 2, given the pattern we were seeing above in our graphs.

# ?cluster.KMeans - can use this to inquire re a function
k = 2 # number of clusters kmeans = cluster.KMeans(n_clusters=k, n_init=10) #n_init is the number of iterartions kmeans.fit(X_scaled)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)
kmeans.predict(X_scaled)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

We can use Scikit's built-in functions to determine the locations of the labels, centroids, and cluster inertia:

labels = kmeans.labels_ centroids = kmeans.cluster_centers_ inertia = kmeans.inertia_
inertia
12.14368828157972
labels
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
centroids
array([[0.545 , 0.36333333, 0.6620339 , 0.65666667], [0.19611111, 0.59083333, 0.07864407, 0.06 ]])

And to compute the clusters' silhouette coefficient:

metrics.silhouette_score(X_scaled, labels, metric='euclidean')
0.6294675561906644

...and we're done! You've completed your first clustering analysis.

Let's see how it looks. First, let's put the labels columns into our dataframe

df['label'] = labels df.head()

Let's plot each cluster in a different color. Seaborn has a 'hue' parameter we can use for this.

cols = df.columns[:-2] sns.pairplot(df, x_vars=cols, y_vars= cols, hue='label')
<seaborn.axisgrid.PairGrid at 0x1196ebe48>
Image in a Jupyter notebook

For comparison, here's the data colored by name of the plant.

sns.pairplot(df, x_vars=cols, y_vars= cols, hue='Name')
<seaborn.axisgrid.PairGrid at 0x1a219e7668>
Image in a Jupyter notebook

Independent practice

A) Repeat our clustering analysis for the foods nutrients dataset (below). There are no "true" labels for this one!

B) Then go back up and separate our iris observations into different numbers of clusters.

  • How do the inertia and silhouette scores change?

  • What if you don't scale your features?

  • Is there a 'right' k? Why or why not?

Repeat this for the foods nutrients dataset.

# http://people.sc.fsu.edu/~jburkardt/datasets/hartigan/file06.txt import pandas as pd foods = pd.read_csv('../assets/datasets/nutrients.txt', sep=r'\s+') foods.head() k = 2 X_scaled = preprocessing.MinMaxScaler().fit_transform(foods.drop('Name',axis=1)) kmeans = cluster.KMeans(n_clusters=k) kmeans.fit(X_scaled) labels = kmeans.labels_ centroids = kmeans.cluster_centers_ inertia = kmeans.inertia_ foods['label'] = labels foods.head() cols = foods.columns[1:-1] sns.pairplot(foods, x_vars=cols, y_vars= cols, hue='label');
Image in a Jupyter notebook
foods['cluster'] = kmeans.predict(X_scaled)
foods
# http://people.sc.fsu.edu/~jburkardt/datasets/hartigan/file06.txt import pandas as pd foods = pd.read_csv('../assets/datasets/nutrients.txt', sep=r'\s+') foods.head() for k in range(1,10): X_scaled = preprocessing.MinMaxScaler().fit_transform(foods.drop('Name',axis=1)) kmeans = cluster.KMeans(n_clusters=k) kmeans.fit(X_scaled) labels = kmeans.labels_ centroids = kmeans.cluster_centers_ inertia = kmeans.inertia_ foods['label'] = labels foods.head() cols = foods.columns[1:-1] print(inertia) sns.pairplot(foods, x_vars=cols, y_vars= cols, hue='label');
8.520998875679423 5.069321339929418 3.366621653614521 2.560840907325132 1.948224606433282 1.5306011662906922 1.128824420809651 0.8058956050644697 0.6572453278863741
Image in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebookImage in a Jupyter notebook

Further reading

  • The sklearn documentation has a great summary of many other clustering algorithms.

  • DBSCAN is one popular alternative.

  • This PyData talk is good overview of clustering, different algorithms, and how to think about the quality of your clusters.