GitHub Repository: suyashi29/python-su
Path: blob/master/ML/Notebook/K-Means.ipynb
³⁰⁸⁷ views

Kernel: Python 3

K-means Clustering

Table of Content

Introduction

Have you come across a situation when a Chief Marketing Officer of a company tells you – “Help me understand our customers better so that we can market our products to them in a better manner!"

If the person would have asked me to calculate Life Time Value (LTV) or propensity of Cross-sell, I wouldn’t have blinked. But this question looked very broad to me!
You are not looking for specific insights for a phenomena, but what you are looking for are structures with in data with out them being tied down to a specific outcome.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups. In this article, I will be taking you through the types of clustering, different clustering algorithms and a comparison between two of the most commonly used cluster methods.

Overview : What is Clustering?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups.

In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

Let’s understand this with an example.

Suppose, you are the head of a rental store and wish to understand preferences of your costumers to scale up your business.
Is it possible for you to look at details of each costumer and devise a unique business strategy for each one of them? Definitely not.
But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate strategy for costumers in each of these 10 groups.
This is what we call clustering.

Types of Clustering

clustering can be divided into two subgroups :

Hard Clustering: In hard clustering, each data point either belongs to a cluster completely or not.
For example, in the above example each customer is put into one group out of the 10 groups.
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned.
For example, from the above scenario each costumer is assigned a probability to be in either of 10 clusters of the retail store.

Types of Clustering Algorithms

Since the task of clustering is subjective, this means that can be used for achieving this goal are plenty. Every methodology follows a different set of rules for defining the ‘similarity’ among data points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms are used popularly, let’s look at them in detail:

Connectivity models: These models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches.
Examples of these models are hierarchical clustering algorithm and its variants.
Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that falls into this category. These models run iteratively to find the local optima.
Distribution models: These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is Expectation-maximization algorithm which uses multivariate normal distributions.
Density Models: These models search the data space for areas of varied density of data points in the data space. It isolates various different density regions and assign the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.

Introduction to K-means Algorithm

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups).

The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K.

The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided.
Data points are clustered based on feature similarity.

The results of the K-means clustering algorithm are:

The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)

Business Cases

The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets.
This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:

Behavioral Segmentation
Inventory Categorization
Sorting Sensor measurements
Detecting bots and anomalies
Computer Vision
Astronomy

Algorithm

To start with k-means algorithm, you first have to randomly initialize points called the cluster centroids (K).
K-means is an iterative algorithm and it does two steps:

1. Cluster Assignment

The algorithm goes through each of the data points and depending on which cluster is closer, It assigns the data points to one of the three cluster centroids.

2. Move Centroid

Here, K-means moves the centroids to the average of the points in a cluster. In other words, the algorithm calculates the average of all the points in a cluster and moves the centroid to that average location.

This process is repeated until there is no change in the clusters (or possibly until some other stopping condition is met). K is chosen randomly or by giving specific initial starting points by the user.

Choosing K

One of the metrics that is commonly used to compare results across different values of 'K' is the mean distance between data points and their cluster centroid.

Since increasing the number of clusters will always reduce the distance to data points, increasing K will always decrease this metric, to the extreme of reaching zero when K is the same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function of K is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to roughly determine K.

A number of other techniques exist for validating K, including cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm. In addition, monitoring the distribution of data points across groups provides insight into how the algorithm is splitting the data for each K.

How good is K-means?

Pros

It’s a simple technique to understand. Even with the math behind it!
Quite fast for small data-sets.

Cons

For large data-sets and large number of features, it gets computationally expensive.
If the data-set is sparse, then we may not get a good-quality clustering.
At times, it’s difficult to determine the number of clusters for K-Means.
It’s sensitive to outliers, so you should consider scaling your features before implementing K-Means.
Since the centroids get randomly initialised, the final centroids for the clusters may vary.

Introduction to Heirarchial Clustering

It is a type of connectivity model clustering which is based on the fact that data points that are closer to each other are more similar than the data points lying far away in a data space.

As the name speaks for itself, the hierarchical clustering forms the hierarchy of the clusters that can be studied by visualising dendogram.

How to measure closeness of points?

How to calculate distance between two clusters?

Centroid Distance: Euclidean distance between mean of data points in the two clusters
Minimum Distance: Euclidean distance between two data points in the two clusters that are closest to each other
Maximum Distance: Euclidean distance between two data points in the two clusters that are farthest to each other

Focus on Centroid Distance right now!

Algorithm Explained

Let there be N data points. Firstly, these N data points are assigned to N different clusters with one data point in each cluster.
Then, two data points with minimum euclidean distance between them are merged into a single cluster.
Then, two clusters with minimum centroid distance between them are merged into a single cluster.
This process is repeated until we are left with a single cluster, hence forming hierarchy of clusters.

How many clusters to form?

Visualising dendogram: Best choice of no. of clusters is no. of vertical lines that can be cut by a horizontal line, that can transverse maximum distance vertically without intersecting other cluster. For eg., in the below case, best choice for no. of clusters will be 4.
Intuition and prior knowledge of the data set.

Good Cluster Analysis

Data-points within same cluster share similar profile: Statistically, check the standard deviation for each input variable in each cluster. A perfect separation in case of cluster analysis is rarely achieved. Hence, even one standard deviation distance between two cluster means is considered to be a good separation.
Well spread proportion of data-points among clusters: There are no standards for this requirement. But a minimum of 5% and maximum of 35% of the total population can be assumed as a safe range for each cluster.

Use Case : Women E-Commerce Clothing Reviews

Consumer Segmentation

Consumer segmentation is the practice of dividing a customer base into groups. In each of these groups the individuals are similar to each other in some ways. This practice allows a business to create a group specific campaign or digital advertisement. This not only increases the effectiveness of the business's marketing efforts, it helps in reducing the marketing costs by reducing spends on certain audience groups that might react differently to a campaign. Segmentation also enables businesses to focus marketing activities more on certain platforms based on the characteristics of the segments. Though segmentation is used to create homogenous groups of individuals, we will use it to understand consumer behaviour in this notebook.

In this script, we look at one of the most popular segmentation algorithms; the k-means clustering algorithm. k-means clustering is an unsupervised learning algorithm that groups data into $K$ number of sets by using an iterative process. For each cluster, the centroid is chosen in such a way that the distance between the centroid and the data points in the cluster is minimized. Before we carry on, we drop variables that are nominal in nature. We also try to convert the review text to sentiment scores as K Means only works with numerical data.

Library Load

Here we will be using the pandas library to do data processing. The sklearn modules also help with necessary scaling of variables.

In [1]:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.mixture import GaussianMixture #For GMM clustering
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [10]:

w_data = pd.read_csv("F:\\ML & Data Visualization\\WomenE.csv")
w_data.dropna(inplace=True)
print(w_data.dtypes)
w_data.head(5)

Out[10]:

Unnamed: 0                  int64
Clothing ID                 int64
Age                         int64
Title                      object
Review Text                object
Rating                      int64
Recommended IND             int64
Positive Feedback Count     int64
Division Name              object
Department Name            object
Class Name                 object
dtype: object

In the next portion of the notebook we remove all the categorical variables.

In [11]:

# remove all the columns that are categorical variables
w_data_k=w_data.drop(['Unnamed: 0', 'Clothing ID','Class Name','Department Name','Title','Division Name','Recommended IND'],axis=1)

# converting review text to sentiment score
from afinn import Afinn
afinn= Afinn()
review_data_k_means['Review Text'] = review_data_k_means['Review Text'].str.lower()
review_data_k_means['sent_score'] = review_data_k_means.apply(lambda row: afinn.score(row['Review Text']), axis=1)

sent_score = review_data_k_means['sent_score'].values
with open("sent.txt", "wb") as fp:   #Pickling
    pickle.dump(sent_score, fp)

with open("sent.txt", "rb") as fp:   # Unpickling
    b = pickle.load(fp)

#print(b)







## adding the minimum value of sentiment score so as to remove negative sentiment scores

Each product has a text review associated with it. Text can be converted to a numerical value using sentiment scores. One way to do this is to use predefined sentiment lexicons and match them accordingly. For this example, we will use the AFINN lexicon.

As the Afinn library is not available on Kaggle, I have tried to post this portion of the code in raw format. A negative value denotes negative sentiment and a positive score indicates positive sentiments. The scores that are returned are then stored in a .txt format and called using the pickle.load function as seen below.

In [12]:

https://github.com/insaid2018/Term-3/blob/master/Data/CaseStudy/sent.txt?raw=true

Out[12]:

  File "<ipython-input-12-653ccdd0b004>", line 1
    https://github.com/insaid2018/Term-3/blob/master/Data/CaseStudy/sent.txt?raw=true
           ^
SyntaxError: invalid syntax

In [13]:

## adding the minimum value of sentiment score so as to remove negative sentiment scores


import pickle

with open("C:/Users/Shubham/Downloads/sent.txt", "rb") as fp:   # Unpickling
    b = pickle.load(fp)

b

w_data_k_means['sent_score'] = b


min_sent = abs(np.min(review_data_k_means['sent_score']))
review_data_k_means['sent_score'] =  review_data_k_means['sent_score'] + abs(np.min(review_data_k_means['sent_score']))
review_data_k_means.drop('Review Text',axis=1,inplace=True)

Out[13]:

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-13-a2212d5b5429> in <module>()
      4 import pickle
      5 
----> 6 with open("C:/Users/Shubham/Downloads/sent.txt", "rb") as fp:   # Unpickling
      7     b = pickle.load(fp)
      8 
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/Shubham/Downloads/sent.txt'

Before doing any further processing , we need to check how each of our variables are distributed.

In [5]:

sns.set_style('darkgrid')
plt.title('Distribution of Each Column in the Data')

for i,col in enumerate(review_data_k_means.columns):
    plt.figure(i)
    sns.distplot(review_data_k_means[col])

Out[5]:

F:\python\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
F:\python\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
F:\python\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
F:\python\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

A majority of the variables are not normally distributed. K Means algorithm do not handle skewed distributions well. To transform each variable to a normal distrbution, we use the log transformation transformation. To ensure that all values are positive, we add 1 to all values.

In [6]:

# box cox transform can help with the skewed transformation
from scipy.stats import boxcox
tmp = review_data_k_means 
# adding one to each data variable to make it positive
tmp = tmp+1
for i in tmp.columns:
    tmp[i]=np.log(tmp[i])
# log modified data    
review_data_mod = tmp
# checking the distributions after transforming
sns.set_style('darkgrid')
plt.title('Distribution of Each Column in the Data')

for i,col in enumerate(review_data_mod.columns):
    plt.figure(i)
    sns.distplot(review_data_mod[col])


# just take age and sent score - the variables that display a nearly normal distribution

#review_data_mod = review_data_mod[['Age','sent_score']]

Out[6]:

F:\python\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
F:\python\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
F:\python\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
F:\python\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

After the log transformation, the sent_score variable and Age variable seem to have a normal like distribution. We will drop the other variables.

As each variable scales differently, it is essential to bring them to a common scale. We use z scaling here for this very purpose . The z score tells us how far each data point is from the mean in terms of standard deviations.

In [7]:


review_data_mod = review_data_mod[['Age','sent_score']]


from scipy import stats
review_data_std = stats.zscore(review_data_mod)
review_data_std = np.array(review_data_std)

Elbow Method

After normalization, we need to choose the optimal number of clusters so as to get a good within cluster score.To achieve that we iterate through different K values and plot the total within cluster distances for each K value. We choose the K value that causes a sharp drop in total within cluster distance. This drop often resembles an "Elbow".

In [8]:

# This snippet is sourced from https://www.kaggle.com/mariadobreva/k-means-clustering-in-python
# also refer to https://stackoverflow.com/questions/32370543/understanding-score-returned-by-scikit-learn-kmeans/32371258
import pylab as pl
number_of_clusters = range(1,20)
kmeans = [KMeans(n_clusters=i,max_iter=1000,random_state=42) for i in number_of_clusters]
score = [-1*kmeans[i].fit(review_data_std).score(review_data_std) for i in range(len(kmeans))]
pl.plot((number_of_clusters),score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()

Out[8]:

Even though the "elbow" is located at $K=3$ , we will use $K=6$ for better explanation of the data.

In [9]:

k_means_test = KMeans(n_clusters=6,max_iter=1000,random_state=42)

In [10]:

-1*k_means_test.fit(review_data_std).score(review_data_std)

Out[10]:

9863.234663547353

In [11]:

review_data_k_means['labels'] = k_means_test.labels_

In [12]:

size_of_each_cluster = review_data_k_means.groupby('labels').size().reset_index()
size_of_each_cluster.columns = ['labels','number_of_points']
size_of_each_cluster['percentage'] = size_of_each_cluster['number_of_points']/np.sum(size_of_each_cluster['number_of_points'])

print(size_of_each_cluster)

Out[12]:

   labels  number_of_points  percentage
     0              4925    0.250483
     1              2311    0.117536
     2              3482    0.177093
     3              3366    0.171193
     4              2660    0.135286
     5              2918    0.148408

Label 0 has 25% of the the points. The rest of the labels have nearly equal percentage of data points.

Making sense of each cluster

To make sense of each cluster, we plot a grouped scatter chart with Age and sentiment score variables.

In [13]:

# a look at Age and Sentiment Scores
# we subtract the added absolute value of the minimum sentiment score
review_data_k_means['sent_score'] = review_data_k_means['sent_score'] - min_sent
sns.lmplot('Age','sent_score',data=review_data_k_means,hue='labels',fit_reg=False)
plt.show()

Out[13]:

In [14]:

sent_score_labels = review_data_k_means[['sent_score','labels']]

sent_score_labels.boxplot(by='labels',figsize=(20,10))
plt.xticks(rotation=90)
plt.show()


age_labels = review_data_k_means[['labels','Age']]

age_labels.boxplot(by='labels',figsize=(20,10))
plt.xticks(rotation=90)
plt.show()

Out[14]:

From the above box plots we get the following observations.

Labels

Label 0 : Middle Age consumers consumer with fairly positive sentiments
Label 1 : Middle age consumers with relatively negative reviews
Label 2 : Older consumers who has had fairly positive reviews
Label 3 : Older consumers who had positive reviews
Label 4 : Younger consumers who had faily positive reviews.
Label 5 : Fairly young consumer with positive reviews.