Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Machine Learning Unsupervised Methods/Day 2.2 Understanding K-means and Use case (Customer Segmentation ).ipynb
3074 views
Kernel: Python 3 (ipykernel)

Introduction

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups.

In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

There are many models for clustering out there. In this notebook, we will be presenting the model that is considered one of the simplest models amongst them. Despite its simplicity, the K-means is vastly used for clustering in many data science applications, especially useful if you need to quickly discover insights from unlabeled data. In this notebook, you will learn how to use k-Means for customer segmentation.

Some real-world applications of k-means:

  • Customer segmentation

  • Understand what the visitors of a website are trying to accomplish

  • Pattern recognition

  • Machine learning

  • Data compression

  • Behavioral Segmentation

  • Inventory Categorization

  • Sorting Sensor measurements

  • Detecting bots and anomalies

  • Computer Vision

  • Astronomy

In this notebook we practice k-means clustering with 2 examples:

  • k-means on a random generated dataset

  • Using k-means for customer segmentation

Import libraries

Lets first import the required libraries. Also run %matplotlib inline since we will be plotting in this section.

import random import numpy as np import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs %matplotlib inline import warnings # Ignore warning related to pandas_profiling warnings.filterwarnings('ignore')

k-Means on a randomly generated dataset

  • Lets create our own dataset for this lab!

  • First we need to set up a random seed. Use numpy's random.seed() function, where the seed will be set to 0

np.random.seed(0)
  • X: Array of shape [n_samples, n_features]. (Feature Matrix)
    • The generated samples.
  • y: Array of shape [n_samples]. (Response Vector)
    • The integer labels for cluster membership of each sample.
Next we will be making random clusters of points by using the make_blobs class. The make_blobs class can take in many inputs, but we will be using these specific ones.

Input
  • n_samples: The total number of points equally divided among clusters.
    • Value will be: 5000
  • centers: The number of centers to generate, or the fixed center locations.
    • Value will be: [[4, 4], [-2, -1], [2, -3],[1,1]]
  • cluster_std: The standard deviation of the clusters.
    • Value will be: 0.9

Output
X, y = make_blobs(n_samples=5000, centers=[[4,4], [-2, -1], [2, -3], [1, 1]], cluster_std=0.9) X, y = make_blobs(n_samples=5000, centers=[[4,4], [-2, -1], [2, -3], [1, 1]], cluster_std=0.9)
plt.scatter(X[:, 0], X[:, 1], marker='.')
<matplotlib.collections.PathCollection at 0x29395a39c10>
Image in a Jupyter notebook

Setting up K-Means

- Now that we have our random data, let's set up our K-Means Clustering.

The KMeans class has many parameters that can be used, but we will be using these three:

  • init: Initialization method of the centroids.
    • Value will be: "k-means++"
    • k-means++: Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.
  • n_clusters: The number of clusters to form as well as the number of centroids to generate.
    • Value will be: 4 (since we have 4 centers)
  • n_init: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
    • Value will be: 12

Initialize KMeans with these parameters, where the output parameter is called k_means.

k_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)
k_means.fit(X)
k_means_labels = k_means.labels_ k_means_labels
array([0, 0, 3, ..., 3, 1, 1])
k_means_cluster_centers = k_means.cluster_centers_ k_means_cluster_centers
array([[-1.95489462, -1.03564706], [ 3.99211079, 3.99540917], [ 2.00811148, -3.01440138], [ 1.01557176, 1.03442098]])

Creating the Visual Plot

So now that we have the random data generated and the KMeans model initialized, let's plot them and see what it looks like!
# Initialize the plot with the specified dimensions. fig = plt.figure(figsize=(8, 8)) # Colors uses a color map, which will produce an array of colors based on # the number of labels there are. We use set(k_means_labels) to get the # unique labels. colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means_labels)))) # Create a plot ax = fig.add_subplot(1, 1, 1) # For loop that plots the data points and centroids. # k will range from 0-3, which will match the possible clusters that each # data point is in. for k, col in zip(range(len([[4,4], [-2, -1], [2, -3], [1, 1]])), colors): # Create a list of all data points, where the data poitns that are # in the cluster (ex. cluster 0) are labeled as true, else they are # labeled as false. my_members = (k_means_labels == k) # Define the centroid, or cluster center. cluster_center = k_means_cluster_centers[k] # Plots the datapoints with color col. ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor=col, marker='*') # Plots the centroids with specified color, but with a darker outline ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=8) # Title of the plot ax.set_title('KMeans') # Remove x-axis ticks ax.set_xticks(()) # Remove y-axis ticks ax.set_yticks(()) # Show the plot plt.show()
Image in a Jupyter notebook

Three Clusters

# write your code here k_means3 = KMeans(init = "k-means++", n_clusters = 5, n_init = 12) k_means3.fit(X) fig = plt.figure(figsize=(10, 4)) colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means3.labels_)))) ax = fig.add_subplot(1, 1, 1) for k, col in zip(range(len(k_means3.cluster_centers_)), colors): my_members = (k_means3.labels_ == k) cluster_center = k_means3.cluster_centers_[k] ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor=col, marker='*') ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)
Image in a Jupyter notebook

Customer Segmentation with K-Means

Imagine that you have a customer dataset, and you need to apply customer segmentation on this historical data. Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy as a business can target these specific groups of customers and effectively allocate marketing resources. For example, one group might contain customers who are high-profit and low-risk, that is, more likely to purchase products, or subscribe for a service. A business task is to retaining those customers. Another group might include customers from non-profit organizations.

%matplotlib inline import numpy as np import pandas as pd from matplotlib import pyplot as plt #Plot styling import seaborn as sns; sns.set() # for plot stylin
# Importing the dataset c_data = pd.read_excel("Cust_Segmentation.xlsx") c_data.head(2)
#descriptive statistics of the dataset c_data.describe().transpose()
  • The dataset consists of 309 rows. The mean annual income is 245000 and the mean annual spend is 149000.

#Visualizing the data - displot plot_income = sns.distplot(c_data["Income"]) plot_spend = sns.distplot(c_data["Age"]) plt.xlabel('Income / Card Debt')
Text(0.5, 0, 'Income / Card Debt')
Image in a Jupyter notebook
#Violin plot of Income and Spend f, axes = plt.subplots(1,2, figsize=(12,6), sharex=True, sharey=True) v1 = sns.violinplot(data=c_data, x='Income', color="skyblue",ax=axes[0]) v2 = sns.violinplot(data=c_data, x='Card Debt',color="lightgreen", ax=axes[1]) v1.set(xlim=(0,420))
[(0, 420)]
Image in a Jupyter notebook

Why Clustering?

  • The mathematics behind clustering, in very simple terms involves minimizing the sum of square of distances between the cluster centroid and its associated data points image.png

X= c_data.iloc[:, [0,1]].values
from sklearn.cluster import KMeans wcss = [] for i in range(1,11): km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0) km.fit(X) wcss.append(km.inertia_) plt.plot(range(1,11),wcss) plt.title('Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('wcss') plt.show()
Image in a Jupyter notebook
#Model Build kmeansmodel = KMeans(n_clusters= 2, init='k-means++', random_state=0) y_kmeans= kmeansmodel.fit_predict(X) # Centroid values C = kmeansmodel.cluster_centers_ print(C)
[[213. 34.90823529] [638. 35.15058824]]
#Visualizing all the clusters plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 40, c = 'red', label = 'Cluster 1') plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 40, c = 'blue', label = 'Cluster 2') #plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 40, c = 'green', label = 'Cluster 3') #plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 40, c = 'cyan', label = 'Cluster 4') plt.scatter(C[:, 0],C[:, 1], s = 300, c = 'black', label = 'C',marker="*") plt.title('Clusters of customers') plt.xlabel('Annual Income') plt.ylabel('Debt ') plt.legend() plt.show()
Image in a Jupyter notebook

Insights

The plot shows the distribution of the 4 clusters. We could interpret them as the following customer segments:

  • Cluster 1: Customers with medium annual income and low annual spend

  • Cluster 2: Customers with high annual income and medium to high annual spend

  • Cluster 3: Customers with low annual income

  • Cluster 4: Customers with medium annual income but high annual spend

#Model Build kmeansmodel = KMeans(n_clusters= 3, init='k-means++', random_state=0) y_kmeans= kmeansmodel.fit_predict(X) # Centroid values C = kmeansmodel.cluster_centers_ print(C)
[[709. 34.66784452] [142.5 34.76408451] [426. 35.65724382]]
#Visualizing all the clusters plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 30, c = 'red', label = 'Cluster 1') plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 30, c = 'blue', label = 'Cluster 2') plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 30, c = 'green', label = 'Cluster 3') #plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 30, c = 'cyan', label = 'Cluster 4') #plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 30, c = 'magenta', label = 'Cluster 5') plt.scatter(C[:, 0],C[:, 1], s = 300, c = 'black', label = 'C',marker="*") plt.title('Clusters of customers') plt.xlabel('Annual Income') plt.ylabel('Spending ') plt.legend() plt.show()
Image in a Jupyter notebook

Insights :

  • Cluster 1: Medium income, low annual spend

  • Cluster 2: Low income, low annual spend

  • Cluster 3: High income, high annual spend

  • Cluster 4: Low income, high annual spend

  • Cluster 5: Medium income, low annual spend