GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_09/code/kmeans_clustering-lab - (done).ipynb
¹⁹⁰⁴ views

Kernel: Python 3

K-Means Clustering with Seeds Data

Authors: Joseph Nelson (DC), Haley Boyan (DC), Sam Stack (DC)

In [2]:

%matplotlib inline 

import pandas as pd
import numpy as np
from sklearn import cluster
from sklearn import metrics
from sklearn.metrics import pairwise_distances
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot') 

import seaborn as sns

1. Import the data

In [3]:

seeds = pd.read_csv("../assets/datasets/seeds.csv")

In [5]:

seeds.head()

Out[5]:

2. Do some EDA of relationships between features.

In [7]:

# Plot the Data to see the distributions/relationships
sns.pairplot(seeds, hue = 'species')

Out[7]:

<seaborn.axisgrid.PairGrid at 0x1a197b6240>

In [9]:

# Check for nulls
seeds.info()

Out[9]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 8 columns):
area               210 non-null float64
perimeter          210 non-null float64
compactness        210 non-null float64
length             210 non-null float64
width              210 non-null float64
asymmetry_coeff    210 non-null float64
groove_length      210 non-null float64
species            210 non-null int64
dtypes: float64(7), int64(1)
memory usage: 13.2 KB

In [16]:

# Check for nulls
seeds.describe()

Out[16]:

In [10]:

# Look at the real species labels.
list(seeds.columns)

Out[10]:

['area',
 'perimeter',
 'compactness',
 'length',
 'width',
 'asymmetry_coeff',
 'groove_length',
 'species']

Remember, clustering is a unsupervised learning method so known classes will never be a thing. In this situation we can see that the perimiter vs. groove_length is a good visualization to view the proper classes class, and we can use later to compare the results of clustering to a true value.

3. Prepare the data for clustering

Remove the species column. We will see if the clusters from K-Means end up like the actual species.
Put the features on the same scale.

In [22]:

k = 8
X_scaled = preprocessing.MinMaxScaler().fit_transform(seeds.drop('species',axis=1))
kmeans = cluster.KMeans(n_clusters=k)
kmeans.fit(X_scaled)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_
seeds['label'] = labels
seeds.head()

cols = seeds.columns[1:-1]
sns.pairplot(seeds, x_vars=cols, y_vars= cols, hue='label');

Out[22]:

In [24]:

# this is the real data, with labels
from sklearn.decomposition import PCA
from sklearn import cluster, datasets, preprocessing, metrics
seeds = pd.read_csv("../assets/datasets/seeds.csv")
X = seeds.drop('species', axis=1)
X = preprocessing.MinMaxScaler().fit_transform(X)
pca = PCA(n_components=2)
X =  pca.fit_transform(X)
y = seeds.species
X_df = pd.DataFrame(X)
y_df = pd.DataFrame(y)
#print(X_df.shape, y_df.shape)
new_seeds = pd.concat([X_df, y_df], axis=1)
sns.pairplot(new_seeds, hue='species')

Out[24]:

<seaborn.axisgrid.PairGrid at 0x1a1eafb160>

In [25]:

#this is clustered, or predicted
from sklearn.decomposition import PCA
from sklearn import cluster, datasets, preprocessing, metrics
seeds = pd.read_csv("../assets/datasets/seeds.csv")

X = seeds.drop('species', axis=1)
X = preprocessing.MinMaxScaler().fit_transform(X)
kmeans = cluster.KMeans(n_clusters=3)
kmeans.fit(X)

pca = PCA(n_components=2)
X = pca.fit_transform(X)

labels = kmeans.labels_
y = pd.DataFrame(labels, columns=['labels'])
# y['labels']=labels

new_seeds = pd.concat([pd.DataFrame(X), y], axis=1)
sns.pairplot(new_seeds, hue='labels')

Out[25]:

<seaborn.axisgrid.PairGrid at 0x1a205b2cf8>

4. Clustering with K-Means

Cluster the data to our our target groups.
We know that there are 3 actual classes. However, in an actual situation in which we used clustering we would have no idea. Lets initally try using the default K for KMeans(8).

In [8]:

from sklearn.cluster import KMeans

5. Get the labels and centroids for out first clustering model.

In [9]:

# A:

6. Compute the silouette score and visually examine the results of the 8 clusters.

(pairplot with hue)

In [10]:

from sklearn.metrics import silhouette_score

# A:

7. Repeat steps #4 and #6 with two selected or random K values and compare the results to the k=8 model.

In [11]:

import random

random.randint(1,25), random.randint(1,25)

Out[11]:

(1, 10)

In [12]:

# A:

8. Build a function to find the optimal number of clusters using silhouette score as the criteria.

Function should accept a range and a dataframe as arguments
Returns the optimal K value, associate silhoutte and scaling method.
Your function should also consider the scaled results of the data.
- normalize, StandardScaler, MinMaxScaler

Once you have found the optimal K and version of the data, visualize the clusters.

In [13]:

# A:

In [14]: