Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
YStrano
GitHub Repository: YStrano/DataScience_GA
Path: blob/master/lessons/lesson_09/code/kmeans_clustering-lab - (done).ipynb
1904 views
Kernel: Python 3

K-Means Clustering with Seeds Data

Authors: Joseph Nelson (DC), Haley Boyan (DC), Sam Stack (DC)


%matplotlib inline import pandas as pd import numpy as np from sklearn import cluster from sklearn import metrics from sklearn.metrics import pairwise_distances import matplotlib.pyplot as plt import matplotlib matplotlib.style.use('ggplot') import seaborn as sns

1. Import the data

seeds = pd.read_csv("../assets/datasets/seeds.csv")
seeds.head()

2. Do some EDA of relationships between features.

# Plot the Data to see the distributions/relationships sns.pairplot(seeds, hue = 'species')
<seaborn.axisgrid.PairGrid at 0x1a197b6240>
Image in a Jupyter notebook
# Check for nulls seeds.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 210 entries, 0 to 209 Data columns (total 8 columns): area 210 non-null float64 perimeter 210 non-null float64 compactness 210 non-null float64 length 210 non-null float64 width 210 non-null float64 asymmetry_coeff 210 non-null float64 groove_length 210 non-null float64 species 210 non-null int64 dtypes: float64(7), int64(1) memory usage: 13.2 KB
# Check for nulls seeds.describe()
# Look at the real species labels. list(seeds.columns)
['area', 'perimeter', 'compactness', 'length', 'width', 'asymmetry_coeff', 'groove_length', 'species']

Remember, clustering is a unsupervised learning method so known classes will never be a thing. In this situation we can see that the perimiter vs. groove_length is a good visualization to view the proper classes class, and we can use later to compare the results of clustering to a true value.

3. Prepare the data for clustering

  1. Remove the species column. We will see if the clusters from K-Means end up like the actual species.

  2. Put the features on the same scale.

k = 8 X_scaled = preprocessing.MinMaxScaler().fit_transform(seeds.drop('species',axis=1)) kmeans = cluster.KMeans(n_clusters=k) kmeans.fit(X_scaled) labels = kmeans.labels_ centroids = kmeans.cluster_centers_ inertia = kmeans.inertia_ seeds['label'] = labels seeds.head() cols = seeds.columns[1:-1] sns.pairplot(seeds, x_vars=cols, y_vars= cols, hue='label');
Image in a Jupyter notebook
# this is the real data, with labels from sklearn.decomposition import PCA from sklearn import cluster, datasets, preprocessing, metrics seeds = pd.read_csv("../assets/datasets/seeds.csv") X = seeds.drop('species', axis=1) X = preprocessing.MinMaxScaler().fit_transform(X) pca = PCA(n_components=2) X = pca.fit_transform(X) y = seeds.species X_df = pd.DataFrame(X) y_df = pd.DataFrame(y) #print(X_df.shape, y_df.shape) new_seeds = pd.concat([X_df, y_df], axis=1) sns.pairplot(new_seeds, hue='species')
<seaborn.axisgrid.PairGrid at 0x1a1eafb160>
Image in a Jupyter notebook
#this is clustered, or predicted from sklearn.decomposition import PCA from sklearn import cluster, datasets, preprocessing, metrics seeds = pd.read_csv("../assets/datasets/seeds.csv") X = seeds.drop('species', axis=1) X = preprocessing.MinMaxScaler().fit_transform(X) kmeans = cluster.KMeans(n_clusters=3) kmeans.fit(X) pca = PCA(n_components=2) X = pca.fit_transform(X) labels = kmeans.labels_ y = pd.DataFrame(labels, columns=['labels']) # y['labels']=labels new_seeds = pd.concat([pd.DataFrame(X), y], axis=1) sns.pairplot(new_seeds, hue='labels')
<seaborn.axisgrid.PairGrid at 0x1a205b2cf8>
Image in a Jupyter notebook

4. Clustering with K-Means

  • Cluster the data to our our target groups.

  • We know that there are 3 actual classes. However, in an actual situation in which we used clustering we would have no idea. Lets initally try using the default K for KMeans(8).

from sklearn.cluster import KMeans

5. Get the labels and centroids for out first clustering model.

# A:

6. Compute the silouette score and visually examine the results of the 8 clusters.

(pairplot with hue)

from sklearn.metrics import silhouette_score # A:

7. Repeat steps #4 and #6 with two selected or random K values and compare the results to the k=8 model.

import random random.randint(1,25), random.randint(1,25)
(1, 10)
# A:

8. Build a function to find the optimal number of clusters using silhouette score as the criteria.

  1. Function should accept a range and a dataframe as arguments

  2. Returns the optimal K value, associate silhoutte and scaling method.

  3. Your function should also consider the scaled results of the data.

    • normalize, StandardScaler, MinMaxScaler

Once you have found the optimal K and version of the data, visualize the clusters.

# A: