GitHub Repository: afnan47/sem7
Path: blob/main/ML/6. KMeans on Sales/KMeans_on_sales.ipynb
⁴⁴² views

Kernel: Python 3.8.6 64-bit

Implement K-Means clustering/ hierarchical clustering on sales_data_sample.csv dataset. Determine thenumber of clusters using the elbow method.

In [4]:

import pandas as pd
import numpy as np

In [5]:

df = pd.read_csv('./sales_data_sample.csv', encoding='unicode_escape')

In [6]:

df.head

Out[6]:

<bound method NDFrame.head of       ORDERNUMBER  QUANTITYORDERED  PRICEEACH  ORDERLINENUMBER    SALES  \
         10107               30      95.70                2  2871.00   
         10121               34      81.35                5  2765.90   
         10134               41      94.74                2  3884.34   
         10145               45      83.26                6  3746.70   
         10159               49     100.00               14  5205.27   
...           ...              ...        ...              ...      ...   
      10350               20     100.00               15  2244.40   
      10373               29     100.00                1  3978.51   
      10386               43     100.00                4  5417.57   
      10397               34      62.24                1  2116.16   
      10414               47      65.52                9  3079.44   

            ORDERDATE    STATUS  QTR_ID  MONTH_ID  YEAR_ID  ...  \
    2/24/2003 0:00   Shipped       1         2     2003  ...   
     5/7/2003 0:00   Shipped       2         5     2003  ...   
     7/1/2003 0:00   Shipped       3         7     2003  ...   
    8/25/2003 0:00   Shipped       3         8     2003  ...   
   10/10/2003 0:00   Shipped       4        10     2003  ...   
...               ...       ...     ...       ...      ...  ...   
 12/2/2004 0:00   Shipped       4        12     2004  ...   
 1/31/2005 0:00   Shipped       1         1     2005  ...   
  3/1/2005 0:00  Resolved       1         3     2005  ...   
 3/28/2005 0:00   Shipped       1         3     2005  ...   
  5/6/2005 0:00   On Hold       2         5     2005  ...   

                       ADDRESSLINE1  ADDRESSLINE2           CITY STATE  \
         897 Long Airport Avenue           NaN            NYC    NY   
              59 rue de l'Abbaye           NaN          Reims   NaN   
   27 rue du Colonel Pierre Avia           NaN          Paris   NaN   
              78934 Hillside Dr.           NaN       Pasadena    CA   
                 7734 Strong St.           NaN  San Francisco    CA   
...                             ...           ...            ...   ...   
           C/ Moralzarzal, 86           NaN         Madrid   NaN   
                  Torikatu 38           NaN           Oulu   NaN   
           C/ Moralzarzal, 86           NaN         Madrid   NaN   
        1 rue Alsace-Lorraine           NaN       Toulouse   NaN   
           8616 Spinnaker Dr.           NaN         Boston    MA   

     POSTALCODE  COUNTRY TERRITORY CONTACTLASTNAME CONTACTFIRSTNAME DEALSIZE  
       10022      USA       NaN              Yu             Kwai    Small  
       51100   France      EMEA         Henriot             Paul    Small  
       75508   France      EMEA        Da Cunha           Daniel   Medium  
       90003      USA       NaN           Young            Julie   Medium  
         NaN      USA       NaN           Brown            Julie   Medium  
...         ...      ...       ...             ...              ...      ...  
    28034    Spain      EMEA          Freyre            Diego    Small  
    90110  Finland      EMEA       Koskitalo           Pirkko   Medium  
    28034    Spain      EMEA          Freyre            Diego   Medium  
    31000   France      EMEA          Roulet          Annette    Small  
    51003      USA       NaN         Yoshido             Juri   Medium  

[2823 rows x 25 columns]>

In [7]:

df.info

Out[7]:

<bound method DataFrame.info of       ORDERNUMBER  QUANTITYORDERED  PRICEEACH  ORDERLINENUMBER    SALES  \
         10107               30      95.70                2  2871.00   
         10121               34      81.35                5  2765.90   
         10134               41      94.74                2  3884.34   
         10145               45      83.26                6  3746.70   
         10159               49     100.00               14  5205.27   
...           ...              ...        ...              ...      ...   
      10350               20     100.00               15  2244.40   
      10373               29     100.00                1  3978.51   
      10386               43     100.00                4  5417.57   
      10397               34      62.24                1  2116.16   
      10414               47      65.52                9  3079.44   

            ORDERDATE    STATUS  QTR_ID  MONTH_ID  YEAR_ID  ...  \
    2/24/2003 0:00   Shipped       1         2     2003  ...   
     5/7/2003 0:00   Shipped       2         5     2003  ...   
     7/1/2003 0:00   Shipped       3         7     2003  ...   
    8/25/2003 0:00   Shipped       3         8     2003  ...   
   10/10/2003 0:00   Shipped       4        10     2003  ...   
...               ...       ...     ...       ...      ...  ...   
 12/2/2004 0:00   Shipped       4        12     2004  ...   
 1/31/2005 0:00   Shipped       1         1     2005  ...   
  3/1/2005 0:00  Resolved       1         3     2005  ...   
 3/28/2005 0:00   Shipped       1         3     2005  ...   
  5/6/2005 0:00   On Hold       2         5     2005  ...   

                       ADDRESSLINE1  ADDRESSLINE2           CITY STATE  \
         897 Long Airport Avenue           NaN            NYC    NY   
              59 rue de l'Abbaye           NaN          Reims   NaN   
   27 rue du Colonel Pierre Avia           NaN          Paris   NaN   
              78934 Hillside Dr.           NaN       Pasadena    CA   
                 7734 Strong St.           NaN  San Francisco    CA   
...                             ...           ...            ...   ...   
           C/ Moralzarzal, 86           NaN         Madrid   NaN   
                  Torikatu 38           NaN           Oulu   NaN   
           C/ Moralzarzal, 86           NaN         Madrid   NaN   
        1 rue Alsace-Lorraine           NaN       Toulouse   NaN   
           8616 Spinnaker Dr.           NaN         Boston    MA   

     POSTALCODE  COUNTRY TERRITORY CONTACTLASTNAME CONTACTFIRSTNAME DEALSIZE  
       10022      USA       NaN              Yu             Kwai    Small  
       51100   France      EMEA         Henriot             Paul    Small  
       75508   France      EMEA        Da Cunha           Daniel   Medium  
       90003      USA       NaN           Young            Julie   Medium  
         NaN      USA       NaN           Brown            Julie   Medium  
...         ...      ...       ...             ...              ...      ...  
    28034    Spain      EMEA          Freyre            Diego    Small  
    90110  Finland      EMEA       Koskitalo           Pirkko   Medium  
    28034    Spain      EMEA          Freyre            Diego   Medium  
    31000   France      EMEA          Roulet          Annette    Small  
    51003      USA       NaN         Yoshido             Juri   Medium  

[2823 rows x 25 columns]>

In [8]:

#Columns to Remove
to_drop = ['ADDRESSLINE1', 'ADDRESSLINE2', 'STATE', 'POSTALCODE', 'PHONE']
df = df.drop(to_drop, axis=1)

In [9]:

#Check for null values
df.isnull().sum()

Out[9]:

ORDERNUMBER            0
QUANTITYORDERED        0
PRICEEACH              0
ORDERLINENUMBER        0
SALES                  0
ORDERDATE              0
STATUS                 0
QTR_ID                 0
MONTH_ID               0
YEAR_ID                0
PRODUCTLINE            0
MSRP                   0
PRODUCTCODE            0
CUSTOMERNAME           0
CITY                   0
COUNTRY                0
TERRITORY           1074
CONTACTLASTNAME        0
CONTACTFIRSTNAME       0
DEALSIZE               0
dtype: int64

In [10]:

#Bhai bhai look at territory
#But territory does not have significant impact on analysis, let it be

In [11]:

df.dtypes

Out[11]:

ORDERNUMBER           int64
QUANTITYORDERED       int64
PRICEEACH           float64
ORDERLINENUMBER       int64
SALES               float64
ORDERDATE            object
STATUS               object
QTR_ID                int64
MONTH_ID              int64
YEAR_ID               int64
PRODUCTLINE          object
MSRP                  int64
PRODUCTCODE          object
CUSTOMERNAME         object
CITY                 object
COUNTRY              object
TERRITORY            object
CONTACTLASTNAME      object
CONTACTFIRSTNAME     object
DEALSIZE             object
dtype: object

In [12]:

#ORDERDATE Should be in date time
df['ORDERDATE'] = pd.to_datetime(df['ORDERDATE'])

In [13]:

#We need to create some features in order to create cluseters
#Recency: Number of days between customer's latest order and today's date
#Frequency : Number of purchases by the customers
#MonetaryValue : Revenue generated by the customers
import datetime as dt
snapshot_date = df['ORDERDATE'].max() + dt.timedelta(days = 1)
df_RFM = df.groupby(['CUSTOMERNAME']).agg({
    'ORDERDATE' : lambda x : (snapshot_date - x.max()).days,
    'ORDERNUMBER' : 'count',
    'SALES' : 'sum'
})

#Rename the columns
df_RFM.rename(columns = {
    'ORDERDATE' : 'Recency',
    'ORDERNUMBER' : 'Frequency',
    'SALES' : 'MonetaryValue'
}, inplace=True)

In [14]:

df_RFM.head()

Out[14]:

In [16]:

# Divide into segments
# We create 4 quartile ranges
df_RFM['M'] = pd.qcut(df_RFM['MonetaryValue'], q = 4, labels = range(1,5))
df_RFM['R'] = pd.qcut(df_RFM['Recency'], q = 4, labels = list(range(4,0,-1)))
df_RFM['F'] = pd.qcut(df_RFM['Frequency'], q = 4, labels = range(1,5))

df_RFM.head()

Out[16]:

In [17]:

#Create another column for RFM score
df_RFM['RFM_Score'] = df_RFM[['R', 'M', 'F']].sum(axis=1)
df_RFM.head()

Out[17]:

We create levels for our Customers

RFM Score > 10 : High Value Customers

RFM Score < 10 and RFM Score >= 6 : Mid Value Customers

RFM Score < 6 : Low Value Customers

In [20]:

def rfm_level(df):
    if bool(df['RFM_Score'] >= 10):
        return 'High Value Customer'
    
    elif bool(df['RFM_Score'] < 10) and bool(df['RFM_Score'] >= 6):
        return 'Mid Value Customer'
    else:
        return 'Low Value Customer'
df_RFM['RFM_Level'] = df_RFM.apply(rfm_level, axis = 1)
df_RFM.head()

Out[20]:

In [21]:

# Time to perform KMeans
data = df_RFM[['Recency', 'Frequency', 'MonetaryValue']]
data.head()

Out[21]:

In [22]:

# Our data is skewed we must remove it by performing log transformation
data_log = np.log(data)
data_log.head()

Out[22]:

In [25]:

#Standardization 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data_log)
data_normalized = scaler.transform(data_log)
data_normalized = pd.DataFrame(data_normalized, index = data_log.index, columns=data_log.columns)
data_normalized.describe().round(2)

Out[25]:

In [28]:

#Fit KMeans and use elbow method to choose the number of clusters
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

sse = {}

for k in range(1, 21):
    kmeans = KMeans(n_clusters = k, random_state = 1)
    kmeans.fit(data_normalized)
    sse[k] = kmeans.inertia_

In [31]:

plt.figure(figsize=(10,6))
plt.title('The Elbow Method')

plt.xlabel('K')
plt.ylabel('SSE')
plt.style.use('ggplot')

sns.pointplot(x=list(sse.keys()), y = list(sse.values()))
plt.text(4.5, 60, "Largest Angle", bbox = dict(facecolor = 'lightgreen', alpha = 0.5))
plt.show()

Out[31]:

In [32]:

# 5 number of clusters seems good
kmeans = KMeans(n_clusters=5, random_state=1)
kmeans.fit(data_normalized)
cluster_labels = kmeans.labels_

data_rfm = data.assign(Cluster = cluster_labels)
data_rfm.head()

Out[32]:

Implement K-Means clustering/ hierarchical clustering on sales_data_sample.csv dataset. Determine thenumber of clusters using the elbow method.

We create levels for our Customers

RFM Score > 10 : High Value Customers

RFM Score < 10 and RFM Score >= 6 : Mid Value Customers

RFM Score < 6 : Low Value Customers

Product

Resources

Company