Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
rasbt
GitHub Repository: rasbt/machine-learning-book
Path: blob/main/ch10/ch10.ipynb
1245 views
Kernel: Python 3 (ipykernel)

Machine Learning with PyTorch and Scikit-Learn

-- Code Examples

Package version checks

Add folder to path in order to load from the check_packages.py script:

import sys sys.path.insert(0, '..')

Check recommended package versions:

from python_environment_check import check_packages d = { 'numpy': '1.21.2', 'scipy': '1.7.0', 'matplotlib': '3.4.3', 'sklearn': '1.0', 'pandas': '1.3.2', } check_packages(d)
[OK] Your Python version is 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:35:11) [Clang 11.1.0 ] [OK] numpy 1.22.0 [OK] scipy 1.7.3 [OK] matplotlib 3.5.1 [OK] sklearn 1.0.2 [OK] pandas 1.3.5

Python Machine Learning - Code Examples

Chapter 10 - Working with Unlabeled Data – Clustering Analysis

Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).

%load_ext watermark %watermark -a "Sebastian Raschka" -u -d -v -p numpy,pandas,matplotlib,scipy,sklearn
Author: Sebastian Raschka Last updated: 2022-01-06 Python implementation: CPython Python version : 3.9.6 IPython version : 7.30.1 numpy : 1.22.0 pandas : 1.3.5 matplotlib: 3.5.1 scipy : 1.7.3 sklearn : 1.0.2

The use of watermark is optional. You can install this Jupyter extension via

conda install watermark -c conda-forge

or

pip install watermark

For more information, please see: https://github.com/rasbt/watermark.



Overview



from IPython.display import Image %matplotlib inline

Grouping objects by similarity using k-means

K-means clustering using scikit-learn

from sklearn.datasets import make_blobs X, y = make_blobs(n_samples=150, n_features=2, centers=3, cluster_std=0.5, shuffle=True, random_state=0)
import matplotlib.pyplot as plt plt.scatter(X[:, 0], X[:, 1], c='white', marker='o', edgecolor='black', s=50) plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.grid() plt.tight_layout() #plt.savefig('figures/10_01.png', dpi=300) plt.show()
Image in a Jupyter notebook
from sklearn.cluster import KMeans km = KMeans(n_clusters=3, init='random', n_init=10, max_iter=300, tol=1e-04, random_state=0) y_km = km.fit_predict(X)
plt.scatter(X[y_km == 0, 0], X[y_km == 0, 1], s=50, c='lightgreen', marker='s', edgecolor='black', label='Cluster 1') plt.scatter(X[y_km == 1, 0], X[y_km == 1, 1], s=50, c='orange', marker='o', edgecolor='black', label='Cluster 2') plt.scatter(X[y_km == 2, 0], X[y_km == 2, 1], s=50, c='lightblue', marker='v', edgecolor='black', label='Cluster 3') plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=250, marker='*', c='red', edgecolor='black', label='Centroids') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.legend(scatterpoints=1) plt.grid() plt.tight_layout() #plt.savefig('figures/10_02.png', dpi=300) plt.show()
Image in a Jupyter notebook

A smarter way of placing the initial cluster centroids using k-means++

...

Hard versus soft clustering

...

Using the elbow method to find the optimal number of clusters

print(f'Distortion: {km.inertia_:.2f}')
Distortion: 72.48
distortions = [] for i in range(1, 11): km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=300, random_state=0) km.fit(X) distortions.append(km.inertia_) plt.plot(range(1, 11), distortions, marker='o') plt.xlabel('Number of clusters') plt.ylabel('Distortion') plt.tight_layout() #plt.savefig('figures/10_03.png', dpi=300) plt.show()
Image in a Jupyter notebook

Quantifying the quality of clustering via silhouette plots

import numpy as np from matplotlib import cm from sklearn.metrics import silhouette_samples km = KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=0) y_km = km.fit_predict(X) cluster_labels = np.unique(y_km) n_clusters = cluster_labels.shape[0] silhouette_vals = silhouette_samples(X, y_km, metric='euclidean') y_ax_lower, y_ax_upper = 0, 0 yticks = [] for i, c in enumerate(cluster_labels): c_silhouette_vals = silhouette_vals[y_km == c] c_silhouette_vals.sort() y_ax_upper += len(c_silhouette_vals) color = cm.jet(float(i) / n_clusters) plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color) yticks.append((y_ax_lower + y_ax_upper) / 2.) y_ax_lower += len(c_silhouette_vals) silhouette_avg = np.mean(silhouette_vals) plt.axvline(silhouette_avg, color="red", linestyle="--") plt.yticks(yticks, cluster_labels + 1) plt.ylabel('Cluster') plt.xlabel('Silhouette coefficient') plt.tight_layout() #plt.savefig('figures/10_04.png', dpi=300) plt.show()
Image in a Jupyter notebook

Comparison to "bad" clustering:

km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=300, tol=1e-04, random_state=0) y_km = km.fit_predict(X) plt.scatter(X[y_km == 0, 0], X[y_km == 0, 1], s=50, c='lightgreen', edgecolor='black', marker='s', label='Cluster 1') plt.scatter(X[y_km == 1, 0], X[y_km == 1, 1], s=50, c='orange', edgecolor='black', marker='o', label='Cluster 2') plt.scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], s=250, marker='*', c='red', label='Centroids') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.legend() plt.grid() plt.tight_layout() #plt.savefig('figures/10_05.png', dpi=300) plt.show()
Image in a Jupyter notebook
cluster_labels = np.unique(y_km) n_clusters = cluster_labels.shape[0] silhouette_vals = silhouette_samples(X, y_km, metric='euclidean') y_ax_lower, y_ax_upper = 0, 0 yticks = [] for i, c in enumerate(cluster_labels): c_silhouette_vals = silhouette_vals[y_km == c] c_silhouette_vals.sort() y_ax_upper += len(c_silhouette_vals) color = cm.jet(float(i) / n_clusters) plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none', color=color) yticks.append((y_ax_lower + y_ax_upper) / 2.) y_ax_lower += len(c_silhouette_vals) silhouette_avg = np.mean(silhouette_vals) plt.axvline(silhouette_avg, color="red", linestyle="--") plt.yticks(yticks, cluster_labels + 1) plt.ylabel('Cluster') plt.xlabel('Silhouette coefficient') plt.tight_layout() #plt.savefig('figures/10_06.png', dpi=300) plt.show()
Image in a Jupyter notebook


Organizing clusters as a hierarchical tree

Grouping clusters in bottom-up fashion

Image(filename='./figures/10_07.png', width=400)
Image in a Jupyter notebook
import pandas as pd import numpy as np np.random.seed(123) variables = ['X', 'Y', 'Z'] labels = ['ID_0', 'ID_1', 'ID_2', 'ID_3', 'ID_4'] X = np.random.random_sample([5, 3])*10 df = pd.DataFrame(X, columns=variables, index=labels) df

Performing hierarchical clustering on a distance matrix

from scipy.spatial.distance import pdist, squareform row_dist = pd.DataFrame(squareform(pdist(df, metric='euclidean')), columns=labels, index=labels) row_dist

We can either pass a condensed distance matrix (upper triangular) from the pdist function, or we can pass the "original" data array and define the metric='euclidean' argument in linkage. However, we should not pass the squareform distance matrix, which would yield different distance values although the overall clustering could be the same.

# 1. incorrect approach: Squareform distance matrix from scipy.cluster.hierarchy import linkage row_clusters = linkage(row_dist, method='complete', metric='euclidean') pd.DataFrame(row_clusters, columns=['row label 1', 'row label 2', 'distance', 'no. of items in clust.'], index=[f'cluster {(i + 1)}' for i in range(row_clusters.shape[0])])
/var/folders/jg/tpqyh1fd5js5wsr1d138k3n40000gn/T/ipykernel_11218/1052818803.py:6: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix row_clusters = linkage(row_dist, method='complete', metric='euclidean')
# 2. correct approach: Condensed distance matrix row_clusters = linkage(pdist(df, metric='euclidean'), method='complete') pd.DataFrame(row_clusters, columns=['row label 1', 'row label 2', 'distance', 'no. of items in clust.'], index=[f'cluster {(i + 1)}' for i in range(row_clusters.shape[0])])
# 3. correct approach: Input matrix row_clusters = linkage(df.values, method='complete', metric='euclidean') pd.DataFrame(row_clusters, columns=['row label 1', 'row label 2', 'distance', 'no. of items in clust.'], index=[f'cluster {(i + 1)}' for i in range(row_clusters.shape[0])])
from scipy.cluster.hierarchy import dendrogram # make dendrogram black (part 1/2) # from scipy.cluster.hierarchy import set_link_color_palette # set_link_color_palette(['black']) row_dendr = dendrogram(row_clusters, labels=labels, # make dendrogram black (part 2/2) # color_threshold=np.inf ) plt.tight_layout() plt.ylabel('Euclidean distance') #plt.savefig('figures/10_11.png', dpi=300, # bbox_inches='tight') plt.show()
Image in a Jupyter notebook

Attaching dendrograms to a heat map

# plot row dendrogram fig = plt.figure(figsize=(8, 8), facecolor='white') axd = fig.add_axes([0.09, 0.1, 0.2, 0.6]) # note: for matplotlib < v1.5.1, please use orientation='right' row_dendr = dendrogram(row_clusters, orientation='left') # reorder data with respect to clustering df_rowclust = df.iloc[row_dendr['leaves'][::-1]] axd.set_xticks([]) axd.set_yticks([]) # remove axes spines from dendrogram for i in axd.spines.values(): i.set_visible(False) # plot heatmap axm = fig.add_axes([0.23, 0.1, 0.6, 0.6]) # x-pos, y-pos, width, height cax = axm.matshow(df_rowclust, interpolation='nearest', cmap='hot_r') fig.colorbar(cax) axm.set_xticklabels([''] + list(df_rowclust.columns)) axm.set_yticklabels([''] + list(df_rowclust.index)) #plt.savefig('figures/10_12.png', dpi=300) plt.show()
/var/folders/jg/tpqyh1fd5js5wsr1d138k3n40000gn/T/ipykernel_11218/2347264065.py:23: UserWarning: FixedFormatter should only be used together with FixedLocator axm.set_xticklabels([''] + list(df_rowclust.columns)) /var/folders/jg/tpqyh1fd5js5wsr1d138k3n40000gn/T/ipykernel_11218/2347264065.py:24: UserWarning: FixedFormatter should only be used together with FixedLocator axm.set_yticklabels([''] + list(df_rowclust.index))
Image in a Jupyter notebook

Applying agglomerative clustering via scikit-learn

from packaging import version from sklearn.cluster import AgglomerativeClustering if version.parse(sklearn.__version__) > version.parse("1.2"): ac = AgglomerativeClustering(n_clusters=3, metric="euclidean", linkage="complete" ) else: ac = AgglomerativeClustering(n_clusters=3, affinity="euclidean", linkage="complete" ) labels = ac.fit_predict(X) print(f'Cluster labels: {labels}')
Cluster labels: [1 0 0 2 1]
if version.parse(sklearn.__version__) > version.parse("1.2"): ac = AgglomerativeClustering(n_clusters=2, metric="euclidean", linkage="complete" ) else: ac = AgglomerativeClustering(n_clusters=2, affinity="euclidean", linkage="complete" ) labels = ac.fit_predict(X) print(f'Cluster labels: {labels}')
Cluster labels: [0 1 1 0 0]


Locating regions of high density via DBSCAN

Image(filename='figures/10_13.png', width=500)
Image in a Jupyter notebook
from sklearn.datasets import make_moons X, y = make_moons(n_samples=200, noise=0.05, random_state=0) plt.scatter(X[:, 0], X[:, 1]) plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.tight_layout() #plt.savefig('figures/10_14.png', dpi=300) plt.show()
Image in a Jupyter notebook

K-means and hierarchical clustering:

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3)) km = KMeans(n_clusters=2, random_state=0) y_km = km.fit_predict(X) ax1.scatter(X[y_km == 0, 0], X[y_km == 0, 1], edgecolor='black', c='lightblue', marker='o', s=40, label='cluster 1') ax1.scatter(X[y_km == 1, 0], X[y_km == 1, 1], edgecolor='black', c='red', marker='s', s=40, label='cluster 2') ax1.set_title('K-means clustering') ax1.set_xlabel('Feature 1') ax1.set_ylabel('Feature 2') ac = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete') y_ac = ac.fit_predict(X) ax2.scatter(X[y_ac == 0, 0], X[y_ac == 0, 1], c='lightblue', edgecolor='black', marker='o', s=40, label='Cluster 1') ax2.scatter(X[y_ac == 1, 0], X[y_ac == 1, 1], c='red', edgecolor='black', marker='s', s=40, label='Cluster 2') ax2.set_title('Agglomerative clustering') ax2.set_xlabel('Feature 1') ax2.set_ylabel('Feature 2') plt.legend() plt.tight_layout() #plt.savefig('figures/10_15.png', dpi=300) plt.show()
Image in a Jupyter notebook

Density-based clustering:

from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.2, min_samples=5, metric='euclidean') y_db = db.fit_predict(X) plt.scatter(X[y_db == 0, 0], X[y_db == 0, 1], c='lightblue', marker='o', s=40, edgecolor='black', label='Cluster 1') plt.scatter(X[y_db == 1, 0], X[y_db == 1, 1], c='red', marker='s', s=40, edgecolor='black', label='Cluster 2') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.legend() plt.tight_layout() #plt.savefig('figures/10_16.png', dpi=300) plt.show()
Image in a Jupyter notebook


Summary

...


Readers may ignore the next cell.

! python ../.convert_notebook_to_script.py --input ch10.ipynb --output ch10.py
[NbConvertApp] WARNING | Config option `kernel_spec_manager_class` not recognized by `NbConvertApp`. [NbConvertApp] Converting notebook ch10.ipynb to script [NbConvertApp] Writing 14620 bytes to ch10.py