GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_09/code/01_intro-to-kmeans - (done).ipynb
²³⁷³ views

Kernel: Python 3

In [1]:

from IPython.display import Image
from IPython.core.display import HTML

Intro to clustering and k-means

LEARNING OBJECTIVES

After this lesson, you will be able to:

Format and preprocess data for clustering
Perform a K-Means clustering analysis
Evaluate clusters for fit

Being able to create clusters is a powerful tool that will make you a stronger data scientist

What's the difference between supervised and unsupervised learning?

Classification - create a model to predict which group a point belongs to

Clustering - find groups that exist in the data already

How could unsupervised learning or clustering be useful?

Helpful uses for clustering:

Find items with similar behavior (users, products, voters, etc)
Market segmentation
Understand complex systems
Discover meaningful categories for your data
Reduce the number of classes by grouping (e.g. bourbons, scotches -> whiskeys)
Reduce the dimensions of your problem
Pre-processing! Create labels for supervised learning

Great. Clustering is useful.

Any ideas how to algorithmically tell which groups are different?

K Means

My First^TM Clustering Algorithm

Pick a value for k (the number of clusters to create)
Initialize k 'centroids' (starting points) in your data
Create your clusters. Assign each point to the nearest centroid.
Make your clusters better. Move each centroid to the center of its cluster.
Repeat steps 3-4 until your centroids converge.

These tutorial images come from Andrew Moore's CS class at Carnegie Mellon. His slide deck is online here: https://www.autonlab.org/tutorials/kmeans11.pdf. He also links to more of his tutorials on the first page.

Let's practice a toy example by hand. Take a 1-dimensional array of numbers:

In [2]:

import numpy as np
arr = [2, 5, 6, 8, 12, 15, 18, 28, 30]
your_initial_cluster = np.random.choice(arr, 3)
your_initial_cluster

Out[2]:

array([ 2, 30,  6])

In [3]:

arr1 = []
for i in range(len(your_initial_cluster)):
    arr2 = []
    for e in range(len(arr)):
        x = your_initial_cluster[i] - arr[e]
        arr2.append(x)
    arr1.append(arr2)    
print(arr1)

Out[3]:

[[0, -3, -4, -6, -10, -13, -16, -26, -28], [28, 25, 24, 22, 18, 15, 12, 2, 0], [4, 1, 0, -2, -6, -9, -12, -22, -24]]

In [4]:

all_dist = []
for i, center in enumerate(your_initial_cluster):
    distances = []
    for num in arr:
        distance = np.sqrt((num-center)**2)
        distances.append(distance)       
    all_dist.append(distances)
print(all_dist)

Out[4]:

[[0.0, 3.0, 4.0, 6.0, 10.0, 13.0, 16.0, 26.0, 28.0], [28.0, 25.0, 24.0, 22.0, 18.0, 15.0, 12.0, 2.0, 0.0], [4.0, 1.0, 0.0, 2.0, 6.0, 9.0, 12.0, 22.0, 24.0]]

you run this a few times, basically until you hit convergance...

take distance between observations and the centers, find the shortest distance, take the average to get new points, run again

In [ ]:

Mock Distance Code

In [5]:

distance = np.sqrt((x-center)**2)

Pick k=3 random starting centroids and let the other points fall into clusters based on the closest centroid. Then, reassign your centroids based on the mean value of your cluster and repeat the process.

Check with your neighbors. Do you have the same clusters? Why or why not?

K Means is a powerful algorithm, but different starting points may give you different clusters. You won't necessarily get an optimal cluster.

Metrics for assessing your clusters

Inertia -- sum of squared errors for each cluster

ranges from 0 to very high values
low inertia = dense clusters

Silhouette Score -- measure of how far apart clusters are

ranges from -1 to 1
high silhouette Score = clusters are well separated

Inertia -- sum of squared errors for each cluster

low inertia = dense cluster

\sum_{j=0}^{n} (x_j - \mu_i)^2

where $\mu_i$ is a cluster centroid. (K-means explicitly tries to minimize this.)

.inertia_ is an attribute of sklearn's kmeans models.

Silhouette Score -- measure of how far apart clusters are

ranges from -1 to 1
high silhouette Score = clusters are well separated

The definition is a little involved $^*$ , but intuitively the score is based on how much closer data points are to their own clusters than to the nearest neighbor cluster.

We can calculate it in sklearn with metrics.silhouette_score(X_scaled, labels, metric='euclidean').

$^*$ https://en.wikipedia.org/wiki/Silhouette_(clustering)

How do I know which K to pick?

Sometimes you have good context:

I need to create 3 profiles for marketing to target

Other times you have to figure it out:

My scatter plots show 2 linearly separable clusters

In [6]:

%matplotlib inline 

import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
from sklearn import cluster, datasets, preprocessing, metrics
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

Guided practice

Let's do some clustering with the iris dataset.

In [7]:

# Check out the dataset and our target values
df = pd.read_csv("../assets/datasets/iris.csv")
print(df['Name'].value_counts())
df.head(5)

Out[7]:

Iris-virginica     50
Iris-versicolor    50
Iris-setosa        50
Name: Name, dtype: int64

Let's plot the data to see the distributions:

In [8]:

cols = df.columns[:-1]
sns.pairplot(df[cols])

Out[8]:

<seaborn.axisgrid.PairGrid at 0x116cbd7b8>

Next, since each of our features have different units and ranges, let's do some preprocessing:

In [9]:

X_scaled = preprocessing.MinMaxScaler().fit_transform(df[cols]) 
#this transforms the data to get into one scale
#it applies a min of zero, max 1 - this squashes the data

#can you z score for this?

In [10]:

pd.DataFrame(X_scaled, columns=cols).describe()

Out[10]:

Now that we've formatted our data and understand its structures, we can finally go ahead and cluster.

We're going to set k = 2, given the pattern we were seeing above in our graphs.

In [11]:

# ?cluster.KMeans  - can use this to inquire re a function

In [12]:

k = 2 # number of clusters
kmeans = cluster.KMeans(n_clusters=k, n_init=10) #n_init is the number of iterartions 
kmeans.fit(X_scaled)

Out[12]:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [13]:

kmeans.predict(X_scaled)

Out[13]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

We can use Scikit's built-in functions to determine the locations of the labels, centroids, and cluster inertia:

In [14]:

labels = kmeans.labels_
centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_

In [15]:

inertia

Out[15]:

12.14368828157972

In [16]:

labels

Out[16]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [17]:

centroids

Out[17]:

array([[0.545     , 0.36333333, 0.6620339 , 0.65666667],
       [0.19611111, 0.59083333, 0.07864407, 0.06      ]])

And to compute the clusters' silhouette coefficient:

In [18]:

metrics.silhouette_score(X_scaled, labels, metric='euclidean')

Out[18]:

0.6294675561906644

...and we're done! You've completed your first clustering analysis.

Let's see how it looks. First, let's put the labels columns into our dataframe

In [19]:

df['label'] = labels
df.head()

Out[19]:

Let's plot each cluster in a different color. Seaborn has a 'hue' parameter we can use for this.

In [20]:

cols = df.columns[:-2]
sns.pairplot(df, x_vars=cols, y_vars= cols, hue='label')

Out[20]:

<seaborn.axisgrid.PairGrid at 0x1196ebe48>

For comparison, here's the data colored by name of the plant.

In [21]:

sns.pairplot(df, x_vars=cols, y_vars= cols, hue='Name')

Out[21]:

<seaborn.axisgrid.PairGrid at 0x1a219e7668>

Independent practice

A) Repeat our clustering analysis for the foods nutrients dataset (below). There are no "true" labels for this one!

B) Then go back up and separate our iris observations into different numbers of clusters.

How do the inertia and silhouette scores change?
What if you don't scale your features?
Is there a 'right' k? Why or why not?

Repeat this for the foods nutrients dataset.

In [ ]:

In [27]:

# http://people.sc.fsu.edu/~jburkardt/datasets/hartigan/file06.txt
import pandas as pd
foods  = pd.read_csv('../assets/datasets/nutrients.txt', sep=r'\s+')
foods.head()

k = 2
X_scaled = preprocessing.MinMaxScaler().fit_transform(foods.drop('Name',axis=1))
kmeans = cluster.KMeans(n_clusters=k)
kmeans.fit(X_scaled)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_
foods['label'] = labels
foods.head()

cols = foods.columns[1:-1]
sns.pairplot(foods, x_vars=cols, y_vars= cols, hue='label');

Out[27]:

In [28]:

foods['cluster'] = kmeans.predict(X_scaled)

In [29]:

foods

Out[29]:

In [24]:

# http://people.sc.fsu.edu/~jburkardt/datasets/hartigan/file06.txt
import pandas as pd
foods  = pd.read_csv('../assets/datasets/nutrients.txt', sep=r'\s+')
foods.head()

for k in range(1,10):
    X_scaled = preprocessing.MinMaxScaler().fit_transform(foods.drop('Name',axis=1))
    kmeans = cluster.KMeans(n_clusters=k)
    kmeans.fit(X_scaled)

    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_
    inertia = kmeans.inertia_
    foods['label'] = labels
    foods.head()

    cols = foods.columns[1:-1]
    print(inertia)
    sns.pairplot(foods, x_vars=cols, y_vars= cols, hue='label');

Out[24]:

520998875679423
069321339929418
366621653614521
560840907325132
948224606433282
5306011662906922
128824420809651
8058956050644697
6572453278863741

Intro to clustering and k-means

LEARNING OBJECTIVES

Being able to create clusters is a powerful tool that will make you a stronger data scientist

What's the difference between supervised and unsupervised learning?

How could unsupervised learning or clustering be useful?

Helpful uses for clustering:

Great. Clustering is useful.

Any ideas how to algorithmically tell which groups are different?

K Means

My First^TM Clustering Algorithm

Let's practice a toy example by hand. Take a 1-dimensional array of numbers:

Mock Distance Code

K Means is a powerful algorithm, but different starting points may give you different clusters. You won't necessarily get an optimal cluster.

Metrics for assessing your clusters

How do I know which K to pick?

Guided practice

Independent practice

Further reading

Product

Resources

Company

Intro to clustering and k-means

LEARNING OBJECTIVES

Being able to create clusters is a powerful tool that will make you a stronger data scientist

What's the difference between supervised and unsupervised learning?

How could unsupervised learning or clustering be useful?

Helpful uses for clustering:

Great. Clustering is useful.

Any ideas how to algorithmically tell which groups are different?

K Means

My FirstTM Clustering Algorithm

Let's practice a toy example by hand. Take a 1-dimensional array of numbers:

Mock Distance Code

K Means is a powerful algorithm, but different starting points may give you different clusters. You won't necessarily get an optimal cluster.

Metrics for assessing your clusters

How do I know which K to pick?

Guided practice

Independent practice

Further reading

My First^TM Clustering Algorithm