Path: blob/master/lessons/lesson_09/code/01_intro-to-kmeans - (done).ipynb
1904 views
Intro to clustering and k-means
LEARNING OBJECTIVES
After this lesson, you will be able to:
Format and preprocess data for clustering
Perform a K-Means clustering analysis
Evaluate clusters for fit
Being able to create clusters is a powerful tool that will make you a stronger data scientist
What's the difference between supervised and unsupervised learning?

Classification - create a model to predict which group a point belongs to
Clustering - find groups that exist in the data already
How could unsupervised learning or clustering be useful?
Helpful uses for clustering:
Find items with similar behavior (users, products, voters, etc)
Market segmentation
Understand complex systems
Discover meaningful categories for your data
Reduce the number of classes by grouping (e.g. bourbons, scotches -> whiskeys)
Reduce the dimensions of your problem
Pre-processing! Create labels for supervised learning
Great. Clustering is useful.
Any ideas how to algorithmically tell which groups are different?

K Means
.
My FirstTM Clustering Algorithm
Pick a value for k (the number of clusters to create)
Initialize k 'centroids' (starting points) in your data
Create your clusters. Assign each point to the nearest centroid.
Make your clusters better. Move each centroid to the center of its cluster.
Repeat steps 3-4 until your centroids converge.
These tutorial images come from Andrew Moore's CS class at Carnegie Mellon. His slide deck is online here: https://www.autonlab.org/tutorials/kmeans11.pdf. He also links to more of his tutorials on the first page.
Let's practice a toy example by hand. Take a 1-dimensional array of numbers:
you run this a few times, basically until you hit convergance...
take distance between observations and the centers, find the shortest distance, take the average to get new points, run again
Mock Distance Code
Pick k=3 random starting centroids and let the other points fall into clusters based on the closest centroid. Then, reassign your centroids based on the mean value of your cluster and repeat the process.
Check with your neighbors. Do you have the same clusters? Why or why not?
K Means is a powerful algorithm, but different starting points may give you different clusters. You won't necessarily get an optimal cluster.
Metrics for assessing your clusters
Inertia -- sum of squared errors for each cluster
ranges from 0 to very high values
low inertia = dense clusters
Silhouette Score -- measure of how far apart clusters are
ranges from -1 to 1
high silhouette Score = clusters are well separated
Inertia -- sum of squared errors for each cluster
low inertia = dense cluster
where is a cluster centroid. (K-means explicitly tries to minimize this.)
.inertia_ is an attribute of sklearn's kmeans models.
Silhouette Score -- measure of how far apart clusters are
ranges from -1 to 1
high silhouette Score = clusters are well separated
The definition is a little involved, but intuitively the score is based on how much closer data points are to their own clusters than to the nearest neighbor cluster.
We can calculate it in sklearn with metrics.silhouette_score(X_scaled, labels, metric='euclidean').
How do I know which K to pick?
Sometimes you have good context:
I need to create 3 profiles for marketing to target
Other times you have to figure it out:
My scatter plots show 2 linearly separable clusters
Guided practice
Let's do some clustering with the iris dataset.
Let's plot the data to see the distributions:
Next, since each of our features have different units and ranges, let's do some preprocessing:
Now that we've formatted our data and understand its structures, we can finally go ahead and cluster.
We're going to set k = 2, given the pattern we were seeing above in our graphs.
We can use Scikit's built-in functions to determine the locations of the labels, centroids, and cluster inertia:
And to compute the clusters' silhouette coefficient:
...and we're done! You've completed your first clustering analysis.
Let's see how it looks. First, let's put the labels columns into our dataframe
Let's plot each cluster in a different color. Seaborn has a 'hue' parameter we can use for this.
For comparison, here's the data colored by name of the plant.
Independent practice
A) Repeat our clustering analysis for the foods nutrients dataset (below). There are no "true" labels for this one!
B) Then go back up and separate our iris observations into different numbers of clusters.
How do the inertia and silhouette scores change?
What if you don't scale your features?
Is there a 'right' k? Why or why not?
Repeat this for the foods nutrients dataset.
Further reading
The sklearn documentation has a great summary of many other clustering algorithms.
DBSCAN is one popular alternative.
This PyData talk is good overview of clustering, different algorithms, and how to think about the quality of your clusters.