Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/ML Clustering Analysis/K means and K Means ++.ipynb
3074 views
Kernel: Python 3 (ipykernel)

Kmeans and K-means++

  • The difference between K-Means and K-Means++ primarily lies in how the initial centroids (cluster centers) are selected, which significantly impacts the performance and results of the algorithm.

Differences between K-Means and K-Means++

FeatureK-MeansK-Means++
Centroid InitializationRandom selection of initial centroidsInitial centroids selected with a probabilistic method that spreads them out
ConvergenceMay converge to a local minimum, depending on the initial centroidsMore likely to converge to a global minimum, due to better initial centroid selection
Algorithm Steps1. Randomly select initial centroids
2. Assign points to the nearest centroid
3. Update centroids based on the mean of assigned points
4. Repeat until centroids stabilize
1. Select the first centroid randomly
2. Choose subsequent centroids based on distance from existing centroids
3. Assign points and update centroids as in K-Means
4. Repeat until centroids stabilize
PerformanceCan result in suboptimal clustering due to random initializationTypically results in better clustering performance and faster convergence
Implementation ComplexitySimple to implementSlightly more complex due to the initial centroid selection process
## Random Data import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs # Generate random data with 3 clusters X, y_true = make_blobs(n_samples=900, centers=3, cluster_std=0.50, random_state=0) plt.scatter(X[:, 0], X[:, 1], s=50) plt.show()
Image in a Jupyter notebook

K-Means Implementation

from sklearn.cluster import KMeans import time # Standard K-Means kmeans = KMeans(n_clusters=3, init='random', n_init=10, max_iter=300, random_state=0) start_time = time.time() kmeans.fit(X) elapsed_time = time.time() - start_time # Plotting the results plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=50, cmap='viridis') centers = kmeans.cluster_centers_ plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X') plt.title(f'K-Means (Random Init)\nTime taken: {elapsed_time:.4f} seconds') plt.show()
Image in a Jupyter notebook

K-Means++ Implementation

# K-Means++ kmeans_pp = KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, random_state=0) start_time_pp = time.time() kmeans_pp.fit(X) elapsed_time_pp = time.time() - start_time_pp # Plotting the results plt.scatter(X[:, 0], X[:, 1], c=kmeans_pp.labels_, s=50, cmap='viridis') centers_pp = kmeans_pp.cluster_centers_ plt.scatter(centers_pp[:, 0], centers_pp[:, 1], c='red', s=200, alpha=0.75, marker='X') plt.title(f'K-Means++\nTime taken: {elapsed_time_pp:.4f} seconds') plt.show()
Image in a Jupyter notebook

By above the performance and time taken by each method we can say that K-Means++ typically provides better clustering and faster convergence as shown by shorter elapsed time.

  • Time Taken: You can observe the time taken for each method, which shows K-Means++ typically being faster and more efficient.

  • Clustering Quality: The centroids are better positioned with K-Means++ compared to random initialization in standard K-Means.