suyashi29

GitHub Repository: suyashi29/python-su
Path: blob/master/ML Clustering Analysis/Lab 3 Women purchasing pattern using K-Means.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings(action = 'ignore')

In [2]:

data = pd.read_csv('women.csv')
data.head()

Out[2]:

In [3]:

for i in data.iloc[: , 5 : ].columns:
    print(f'{i} : {data[i].nunique()}')

Out[3]:

Rating : 5
Recommended IND : 2
Positive Feedback Count : 82
Division Name : 3
Department Name : 6
Class Name : 20

Undestanding Data

In [4]:

sns.countplot(data['Rating']);

Out[4]:

Most of the rating are 3-5 which describe product quality are good

In [5]:

sns.countplot(data['Recommended IND']);

Out[5]:

Most of the women recommended the products

In [6]:

sns.countplot(data['Rating'], hue = data['Recommended IND']);

Out[6]:

Good rating means product will be recommended

In [7]:

plt.figure(figsize = (25,6))
sns.countplot(data['Positive Feedback Count']);

Out[7]:

In [8]:

drop_positive = data[data['Positive Feedback Count'] > 20].index

In [9]:

data = data.drop(drop_positive, axis = 0).reset_index(drop = True)

In [10]:

plt.figure(figsize = (25,6))
sns.countplot(data['Positive Feedback Count']);

Out[10]:

In [11]:

data.shape

Out[11]:

(23096, 11)

In [12]:

plt.figure(figsize = (25,6))
sns.countplot(data['Age']);

Out[12]:

Age 35 to 45 are big buyers
Because people of this age tends to have more money than teens or senior citizens.

In [13]:

plt.figure(figsize = (25,6))
sns.barplot(x = data['Age'], y = data['Rating']);

Out[13]:

Average Rating by each age group
Lowest for 85 , 90, 91 & 94

In [14]:

plt.figure(figsize = (28,7))
sns.barplot(x = data['Age'], y = data['Positive Feedback Count']);

Out[14]:

Highest for 90 but lowest for 89 and 18

In [15]:

plt.figure(figsize = (28,7))
sns.countplot(data['Age'], hue = data['Recommended IND']);

Out[15]:

We can see here buying is directly proprtional to Recommendations

In [16]:

teens = data[data['Age'] < 20]
youngs= data[(data['Age'] > 20) & (data['Age'] <= 40)]
adult = data[(data['Age'] > 40) & (data['Age'] <= 60)]
senior = data[data['Age'] > 60]
plt.figure(figsize = (25,25))
plt.subplot(4,1,1)
sns.countplot(teens['Age'], hue = data['Rating'])
plt.title('Ratings By Teens')
plt.subplot(4,1,2)
sns.countplot(youngs['Age'], hue = data['Rating'])
plt.title('Ratings By young')
plt.subplot(4,1,3)
sns.countplot(adult['Age'], hue = data['Rating'])
plt.title('Ratings By Adult')
plt.subplot(4,1,4)
sns.countplot(senior['Age'], hue = data['Rating'])
plt.title('Ratings By senior')
plt.legend(loc = 'upper right',title = 'Rating')
plt.show()

Out[16]:

Most of the buyers are giving good rating

In [17]:

data.head()

Out[17]:

In [18]:

data['Class Name'].value_counts()

Out[18]:

Dresses           6149
Knits             4764
Blouses           3045
Sweaters          1412
Pants             1375
Jeans             1138
Fine gauge        1087
Skirts             940
Jackets            693
Lounge             683
Swim               348
Outerwear          323
Shorts             316
Sleep              227
Legwear            165
Intimates          154
Layering           144
Trend              116
Casual bottoms       2
Chemises             1
Name: Class Name, dtype: int64

Word Cloud

Checking Null Values

In [19]:

data.isna().sum()

Out[19]:

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3761
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

Dropping All Null Values

In [20]:

data = data.dropna().reset_index(drop = True)

Word Cloud For Review Text

In [25]:

from wordcloud import WordCloud, STOPWORDS

comment_words = ''
stopwords = set(STOPWORDS)
 
# iterate through Column
for val in data['Review Text']:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
     
    comment_words += " ".join(tokens)+" "
 
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='green',
                stopwords = stopwords,
                min_font_size = 14).generate(comment_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (18, 12), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()

Out[25]:

Word Cloud For Title

In [26]:

comment_words = ''
stopwords = set(STOPWORDS)
 
# iterate through Column
for val in data['Title']:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
     
    comment_words += " ".join(tokens)+" "
 
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='yellow',
                stopwords = stopwords,
                min_font_size = 12).generate(comment_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (18, 12), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()

Out[26]:

In [27]:

plt.figure(figsize = (15, 7))
sns.countplot(data['Class Name'])
plt.xticks(rotation = 40)
plt.show()

Out[27]:

Dresses , Knits , Blouses are the most sold items

Histograms

In [28]:

num_cols = data.select_dtypes(exclude = 'object')
num_cols = num_cols.drop(['Unnamed: 0','Clothing ID'], axis = 1)
num_cols.hist(figsize = (10,10), color = 'green');

Out[28]:

Boxplots

In [29]:

num_cols.boxplot(figsize = (10,10), color = 'orange');

Out[29]:

Distribution Plots

In [ ]:

plt.figure(figsize = (10,10))
for i in enumerate(num_cols.columns):
    plt.subplot(2,2, i[0] +1)
    sns.distplot(num_cols[i[1]],kde = True, color = 'yellow')
plt.show

Individual Boxplots

In [ ]:

plt.figure(figsize = (10,10))
for i in enumerate(num_cols.columns):
    plt.subplot(2,2, i[0] +1)
    sns.boxplot(num_cols[i[1]], color = 'red')
plt.show

Correlation Heatmap

In [ ]:

plt.figure(figsize=(20, 5))
sns.heatmap(num_cols.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Recommendation and ratings are showing good correlation, i.e, Directly Proportional to each other.

Clustering

In [30]:

data.head(1)

Out[30]:

In [31]:

new_df = data[['Age', 'Rating','Recommended IND', 'Positive Feedback Count', 'Division Name','Department Name','Class Name']]

In [32]:

new_df.head()

Out[32]:

In [33]:

from sklearn.preprocessing import LabelEncoder, StandardScaler

Label Encoding Categorical Columns

In [34]:

le = LabelEncoder()
for col in new_df.iloc[:,4 : ].columns:
    new_df[col] = le.fit_transform(new_df[col])

In [35]:

new_df.head()

Out[35]:

Scaling Data

In [36]:

scaler = StandardScaler()
scaler.fit(new_df)
scaled_df = pd.DataFrame(scaler.transform(new_df), columns = new_df.columns, index = new_df.index)
scaled_df.head()

Out[36]:

K-Means Algorithm

In [37]:

error = []
for k in range(1,11):
    km = KMeans(n_clusters = k)
    km.fit_predict(scaled_df)
    error.append(km.inertia_)

Elbow Method for Best K value

In [38]:

plt.figure(figsize = (10,6))
plt.plot(list(range(1,11)), error, marker = 'X')
plt.title('Elbow Method')
plt.xlabel('K')
plt.ylabel('Erorrs')
plt.show()

Out[38]:

I am Little confused in 3 or 4

Silhouette Coefficient

One of the metrics to evaluate the quality of clustering is referred to as silhouette analysis. Silhouette analysis can be applied to other clustering algorithms as well. Silhouette coefficientranges between −1 and 1, where a higher silhouette coefficient refers to a model with more coherentclusters.
The Silhouette Coefficient is calculated using the mean intra-cluster distance ( a) and the mean nearest-cluster distance ( b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max (a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.

In [39]:

from sklearn.metrics import silhouette_score

In [40]:

sil_score = []
for k in range(2,11):
    km = KMeans(n_clusters = k)
    pred = km.fit_predict(scaled_df)
    score = silhouette_score(scaled_df, pred)
    sil_score.append(score)
    print(f'Silhouette Score for K = {k} is {score}')

Out[40]:

Silhouette Score for K = 2 is 0.13100287934394378
Silhouette Score for K = 3 is 0.22232958285811769
Silhouette Score for K = 4 is 0.18729185374820115
Silhouette Score for K = 5 is 0.2160331797769901
Silhouette Score for K = 6 is 0.2245349961764663
Silhouette Score for K = 7 is 0.17708867969003694
Silhouette Score for K = 8 is 0.22534628990642938
Silhouette Score for K = 9 is 0.21945540768264313
Silhouette Score for K = 10 is 0.20540160664776794

Silhouette Score Plot

In [41]:

plt.figure(figsize = (10,6))
plt.plot(list(range(2,11)), sil_score, marker = 'X')
plt.title('Silhouette Scores')
plt.xlabel('K')
plt.ylabel('Score')
plt.show()

Out[41]:

For K = 5 Sillhouette is higher than 3 or 4 so optimum no. of clusters should be 5

KMeans for K = 5

In [42]:

model = KMeans(n_clusters = 5, random_state = 1)
model.fit_predict(scaled_df)

Out[42]:

array([4, 2, 1, ..., 0, 4, 2])

Adding Labels to new_df and scaled_df

In [43]:

scaled_df['Labels'] = model.labels_
new_df['Labels'] = model.labels_
new_df.sample(5)

Out[43]:

In [44]:

new_df.Labels.value_counts()

Out[44]:

  6038
  5019
  3640
  3302
  1322
Name: Labels, dtype: int64

Cluster Profiling

In [45]:

new_df2 = data[['Age', 'Rating','Recommended IND', 'Positive Feedback Count', 'Division Name','Department Name','Class Name']]

In [46]:

new_df2['Labels'] = new_df['Labels']
new_df2.head()

Out[46]:

In [47]:

# Overall level summary
new_df2.describe().T

Out[47]:

Cluster means

In [49]:

cluster_means = new_df2.groupby('Labels').mean().reset_index()
cluster_means.style.highlight_max(color="lightgreen", axis=0)

Out[49]:

In [ ]:

cluster_means.style.highlight_min(color="lightgreen", axis=0)

Cluster 0 ,1 & 2 giving best ratings as well as recommending products
Cluster 4 consists of women giving least ratings and least recommendation but feedback Count is Low
Cluster 3 is women with satisfactory ratings and Recommendations.but feedback count is high

In [50]:

plt.figure(figsize = (20,7))
sns.countplot(x = new_df2['Class Name'], hue = new_df2['Labels'])
plt.xticks(rotation = 30)
plt.legend(loc = 'upper right', title = "Labels")
plt.show()

Out[50]:

Women in Cluster 2 tensds to buy Dresses, Pants, Skirts, Jeans and Shorts

Women in Cluster 1 are more interested in Blouses, Knits, Sweaters, Fine Gauge and Jackets.

Cluster 4 are more attracted to Pants, Lounge , Sweaters, skirts , Swim , Legwear and Layering.

Cluster 3 are less in no. and buying mostly Dresses, Pants , Blouses and Knits.

Women in cluster 0 shownig average approach like cluster 3.

In [51]:

plt.figure(figsize = (28,8))
sns.countplot(x = new_df2['Rating'], hue = new_df2['Labels'])
plt.legend(loc = 'upper right', title = "Labels")
plt.show()

Out[51]:

Women in Cluster 3 giving 1,2 & 3 rating out if 5, which is least in all of the cluster groups
Rest of the Clusters giving good average rating.

In [52]:

plt.figure(figsize = (28,12))
for i in enumerate(new_df.iloc[:, 0 : 7].columns):
    plt.subplot(2, 4 , i[0] + 1)
    sns.boxplot(y = new_df[i[1]], x = new_df['Labels'])
plt.show()

Out[52]:

In [53]:

le.classes_

Out[53]:

array(['Blouses', 'Casual bottoms', 'Chemises', 'Dresses', 'Fine gauge',
       'Intimates', 'Jackets', 'Jeans', 'Knits', 'Layering', 'Legwear',
       'Lounge', 'Outerwear', 'Pants', 'Shorts', 'Skirts', 'Sleep',
       'Sweaters', 'Swim', 'Trend'], dtype=object)

Cluster 0

Age between 35 & 50 is majority

Giving highest Ratings, Good Recommendations

Buying more Sweaters, Fine Gauge, Intimates, Jackets, Jeans etc.

Cluster 1

Age 35- 55

Good Rating + Good Recommendations

Buying mostly Blouses, Casual bottoms, Chemises, Dresses, Fine gauge, Intimates, Jackets & Jeans.

Cluster 2

Age 35 - 50 majority

Good Rating + Good Recommendations

Buying alsmost similar product to cluster 0 but more satisfied.

Cluster 3

Majority of age 47- 57

Lowest ratings + bad Recommendations

Similar Buying patterns to cluster 0 ,1 and 2 but more skewed to left side

Cluster 4

Age between 35 & 50 is majority

Good Rating + Good Recommendations

Buying Lounge, Outerwear, Pants, Shorts, Skirts, Sleep, Sweaters, Swim & Trend cloths

Conclusions

Women Between 35 - 55 of Age are Big Buyers and also giving good reviews and ratings.
We should target this age group to increase sales.
Women of Age between 35 - 55 having more money and more purchasing power than young and Old aged Women.

Undestanding Data

Word Cloud

Checking Null Values

Dropping All Null Values

Word Cloud For Review Text

Word Cloud For Title

Histograms

Boxplots

Distribution Plots

Individual Boxplots

Correlation Heatmap

Clustering

Label Encoding Categorical Columns

Scaling Data

K-Means Algorithm

Elbow Method for Best K value

Silhouette Coefficient

Silhouette Score Plot

KMeans for K = 5

Adding Labels to new_df and scaled_df

Cluster Profiling

Cluster means

Cluster 0

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Conclusions

Product

Resources

Company