Path: blob/master/ML/Notebook/8. K-Means-Customer-Seg.ipynb
3087 views
K-Means Clustering
Introduction
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups.
In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
There are many models for clustering out there. In this notebook, we will be presenting the model that is considered one of the simplest models amongst them. Despite its simplicity, the K-means is vastly used for clustering in many data science applications, especially useful if you need to quickly discover insights from unlabeled data. In this notebook, you will learn how to use k-Means for customer segmentation.
Some real-world applications of k-means:
Customer segmentation
Understand what the visitors of a website are trying to accomplish
Pattern recognition
Machine learning
Data compression
Behavioral Segmentation
Inventory Categorization
Sorting Sensor measurements
Detecting bots and anomalies
Computer Vision
Astronomy
In this notebook we practice k-means clustering with 2 examples:
k-means on a random generated dataset
Using k-means for customer segmentation
Import libraries
Lets first import the required libraries. Also run %matplotlib inline since we will be plotting in this section.
k-Means on a randomly generated dataset
Lets create our own dataset for this lab!First we need to set up a random seed. Use numpy's random.seed() function, where the seed will be set to 0
Next we will be making random clusters of points by using the make_blobs class. The make_blobs class can take in many inputs, but we will be using these specific ones.
Input
- n_samples: The total number of points equally divided among clusters.
- Value will be: 5000
- centers: The number of centers to generate, or the fixed center locations.
- Value will be: [[4, 4], [-2, -1], [2, -3],[1,1]]
- cluster_std: The standard deviation of the clusters.
- Value will be: 0.9
Output
- X: Array of shape [n_samples, n_features]. (Feature Matrix)
- The generated samples.
- y: Array of shape [n_samples]. (Response Vector)
- The integer labels for cluster membership of each sample.
Display the scatter plot of the randomly generated data.
Setting up K-Means
Now that we have our random data, let's set up our K-Means Clustering.The KMeans class has many parameters that can be used, but we will be using these three:
- init: Initialization method of the centroids.
- Value will be: "k-means++"
- k-means++: Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.
- n_clusters: The number of clusters to form as well as the number of centroids to generate.
- Value will be: 4 (since we have 4 centers)
- n_init: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
- Value will be: 12
Initialize KMeans with these parameters, where the output parameter is called k_means.
Now let's fit the KMeans model with the feature matrix we created above, X
Now let's grab the labels for each point in the model using KMeans' .labels_ attribute and save it as k_means_labels
We will also get the coordinates of the cluster centers using KMeans' .cluster_centers_ and save it as k_means_cluster_centers
Creating the Visual Plot
So now that we have the random data generated and the KMeans model initialized, let's plot them and see what it looks like!Please read through the code and comments to understand how to plot the model.
cluster the above dataset into 3 clusters.
Customer Segmentation with K-Means
Imagine that you have a customer dataset, and you need to apply customer segmentation on this historical data. Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy as a business can target these specific groups of customers and effectively allocate marketing resources. For example, one group might contain customers who are high-profit and low-risk, that is, more likely to purchase products, or subscribe for a service. A business task is to retaining those customers. Another group might include customers from non-profit organizations. And so on.Load Data From CSV File
Pre-processing
As you can see, Address in this dataset is a categorical variable. k-means algorithm isn't directly applicable to categorical variables because Euclidean distance function isn't really meaningful for discrete variables. So, lets drop this feature and run clustering.
Normalizing over the standard deviation
Now let's normalize the dataset. But why do we need normalization in the first place? Normalization is a statistical method that helps mathematical-based algorithms to interpret features with different magnitudes and distributions equally. We use StandardScaler() to normalize our dataset.
Modeling
In our example (if we didn't have access to the k-means algorithm), it would be the same as guessing that each customer group would have certain age, income, education, etc, with multiple tests and experiments. However, using the K-means clustering we can do all this process much easier.
Lets apply k-means on our dataset, and take look at cluster labels.
Insights
We assign the labels to each row in dataframe.We can easily check the centroid values by averaging the features in each cluster.
Now, lets look at the distribution of customers based on their age and income:
k-means will partition your customers into mutually exclusive groups, for example, into 3 clusters. The customers in each cluster are similar to each other demographically. Now we can create a profile for each group, considering the common characteristics of each cluster. For example, the 4 clusters can be:
AFFLUENT, EDUCATED AND OLD AGED
MIDDLE AGED AND MIDDLE INCOME
YOUNG AND LOW INCOME
Adult and high INCOME