suyashi29

GitHub Repository: suyashi29/python-su
Path: blob/master/Machine Learning Unsupervised Methods/Lab 3 Day 2 .3 Women purchasing pattern using K-Means.ipynb
³⁰⁷⁴ views

Kernel: Python 3 (ipykernel)

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings(action = 'ignore')

In [3]:

data = pd.read_csv('women.csv')
data.head()

Out[3]:

In [4]:

for i in data.iloc[: , 5 : ].columns:
    print(f'{i} : {data[i].nunique()}')

Out[4]:

Rating : 5
Recommended IND : 2
Positive Feedback Count : 82
Division Name : 3
Department Name : 6
Class Name : 20

Undestanding Data

In [5]:

sns.countplot(data['Rating']);

Out[5]:

Most of the rating are 3-5 which describe product quality are good

In [6]:

sns.countplot(data['Recommended IND']);

Out[6]:

Most of the women recommended the products

In [7]:

sns.countplot(data['Rating'], hue = data['Recommended IND']);

Out[7]:

Good rating means product will be recommended

In [8]:

plt.figure(figsize = (25,6))
sns.countplot(data['Positive Feedback Count']);

Out[8]:

In [9]:

drop_positive = data[data['Positive Feedback Count'] > 20].index

In [10]:

data = data.drop(drop_positive, axis = 0).reset_index(drop = True)

In [11]:

plt.figure(figsize = (25,6))
sns.countplot(data['Positive Feedback Count']);

Out[11]:

In [12]:

data.shape

Out[12]:

(23096, 11)

In [13]:

plt.figure(figsize = (25,6))
sns.countplot(data['Age']);

Out[13]:

Age 35 to 45 are big buyers
Because people of this age tends to have more money than teens or senior citizens.

In [15]:

plt.figure(figsize = (25,6))
sns.barplot(x = data['Age'], y = data['Rating']);

Out[15]:

Average Rating by each age group
Lowest for 84 , 90, 91 & 94

In [17]:

plt.figure(figsize = (28,7))
sns.barplot(x = data['Age'], y = data['Positive Feedback Count']);

Out[17]:

Highest for 90 but lowest for 84 and 94

In [18]:

plt.figure(figsize = (28,7))
sns.countplot(data['Age'], hue = data['Recommended IND']);

Out[18]:

We can see here buying is directly proprtional to Recommendations

In [19]:

teens = data[data['Age'] < 20]
youngs= data[(data['Age'] > 20) & (data['Age'] <= 40)]
adult = data[(data['Age'] > 40) & (data['Age'] <= 60)]
senior = data[data['Age'] > 60]
plt.figure(figsize = (25,25))
plt.subplot(4,1,1)
sns.countplot(teens['Age'], hue = data['Rating'])
plt.title('Ratings By Teens')
plt.subplot(4,1,2)
sns.countplot(youngs['Age'], hue = data['Rating'])
plt.title('Ratings By young')
plt.subplot(4,1,3)
sns.countplot(adult['Age'], hue = data['Rating'])
plt.title('Ratings By Adult')
plt.subplot(4,1,4)
sns.countplot(senior['Age'], hue = data['Rating'])
plt.title('Ratings By senior')
plt.legend(loc = 'upper right',title = 'Rating')
plt.show()

Out[19]:

Adult Women are buying more and giving good rating rather than youngsters.

In [20]:

data.head()

Out[20]:

In [21]:

data['Class Name'].value_counts()

Out[21]:

Dresses           6149
Knits             4764
Blouses           3045
Sweaters          1412
Pants             1375
Jeans             1138
Fine gauge        1087
Skirts             940
Jackets            693
Lounge             683
Swim               348
Outerwear          323
Shorts             316
Sleep              227
Legwear            165
Intimates          154
Layering           144
Trend              116
Casual bottoms       2
Chemises             1
Name: Class Name, dtype: int64

Word Cloud

Checking Null Values

In [22]:

data.isna().sum()

Out[22]:

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3761
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

Dropping All Null Values

In [23]:

data = data.dropna().reset_index(drop = True)

Word Cloud For Review Text

In [24]:

from wordcloud import WordCloud, STOPWORDS

comment_words = ''
stopwords = set(STOPWORDS)
 
# iterate through Column
for val in data['Review Text']:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
     
    comment_words += " ".join(tokens)+" "
 
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='Black',
                stopwords = stopwords,
                min_font_size = 10).generate(comment_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()

Out[24]:

Word Cloud For Title

In [25]:

comment_words = ''
stopwords = set(STOPWORDS)
 
# iterate through Column
for val in data['Title']:
     
    # typecaste each val to string
    val = str(val)
 
    # split the value
    tokens = val.split()
     
    # Converts each token into lowercase
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
     
    comment_words += " ".join(tokens)+" "
 
wordcloud = WordCloud(width = 800, height = 800,
                background_color ='Black',
                stopwords = stopwords,
                min_font_size = 10).generate(comment_words)
 
# plot the WordCloud image                      
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
 
plt.show()

Out[25]:

In [23]:

plt.figure(figsize = (15, 7))
sns.countplot(data['Class Name'])
plt.xticks(rotation = 40)
plt.show()

Out[23]:

Dresses , Knits , Blouses are the most sold items

Histograms

In [26]:

num_cols = data.select_dtypes(exclude = 'object')
num_cols = num_cols.drop(['Unnamed: 0','Clothing ID'], axis = 1)
num_cols.hist(figsize = (10,10), color = 'green');

Out[26]:

Boxplots

In [27]:

num_cols.boxplot(figsize = (10,10), color = 'orange');

Out[27]:

Distribution Plots

In [28]:

plt.figure(figsize = (10,10))
for i in enumerate(num_cols.columns):
    plt.subplot(2,2, i[0] +1)
    sns.distplot(num_cols[i[1]],kde = True, color = 'yellow')
plt.show

Out[28]:

<function matplotlib.pyplot.show(close=None, block=None)>

Individual Boxplots

In [29]:

plt.figure(figsize = (10,10))
for i in enumerate(num_cols.columns):
    plt.subplot(2,2, i[0] +1)
    sns.boxplot(num_cols[i[1]], color = 'red')
plt.show

Out[29]:

<function matplotlib.pyplot.show(close=None, block=None)>

Correlation Heatmap

In [30]:

plt.figure(figsize=(20, 5))
sns.heatmap(num_cols.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Out[30]:

Recommendation and ratings are showing good correlation, i.e, Directly Proportional to each other.

Clustering

In [31]:

data.head(1)

Out[31]:

In [32]:

new_df = data[['Age', 'Rating','Recommended IND', 'Positive Feedback Count', 'Division Name','Department Name','Class Name']]

In [33]:

new_df.head()

Out[33]:

In [34]:

from sklearn.preprocessing import LabelEncoder, StandardScaler

Label Encoding Categorical Columns

In [35]:

le = LabelEncoder()
for col in new_df.iloc[:,4 : ].columns:
    new_df[col] = le.fit_transform(new_df[col])

In [36]:

new_df.head()

Out[36]:

Scaling Data

In [37]:

scaler = StandardScaler()
scaler.fit(new_df)
scaled_df = pd.DataFrame(scaler.transform(new_df), columns = new_df.columns, index = new_df.index)
scaled_df.head()

Out[37]:

K-Means Algorithm

In [38]:

error = []
for k in range(1,11):
    km = KMeans(n_clusters = k)
    km.fit_predict(scaled_df)
    error.append(km.inertia_)

Elbow Method for Best K value

In [39]:

plt.figure(figsize = (10,6))
plt.plot(list(range(1,11)), error, marker = 'X')
plt.title('Elbow Method')
plt.xlabel('K')
plt.ylabel('Erorrs')
plt.show()

Out[39]:

I am Little confused in 3 or 5

Silhouette Coefficient

One of the metrics to evaluate the quality of clustering is referred to as silhouette analysis. Silhouette analysis can be applied to other clustering algorithms as well. Silhouette coefficientranges between −1 and 1, where a higher silhouette coefficient refers to a model with more coherentclusters.
The Silhouette Coefficient is calculated using the mean intra-cluster distance ( a) and the mean nearest-cluster distance ( b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max (a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.

In [42]:

from sklearn.metrics import silhouette_score

In [43]:

sil_score = []
for k in range(2,11):
    km = KMeans(n_clusters = k)
    pred = km.fit_predict(scaled_df)
    score = silhouette_score(scaled_df, pred)
    sil_score.append(score)
    print(f'Silhouette Score for K = {k} is {score}')

Out[43]:

Silhouette Score for K = 2 is 0.33015134965171683
Silhouette Score for K = 3 is 0.2156468473109423
Silhouette Score for K = 4 is 0.22674160321245745
Silhouette Score for K = 5 is 0.23862807754027696
Silhouette Score for K = 6 is 0.23448855556074166
Silhouette Score for K = 7 is 0.22695460667906933
Silhouette Score for K = 8 is 0.2196918416761983
Silhouette Score for K = 9 is 0.22024873675997048
Silhouette Score for K = 10 is 0.2275503333131384

Silhouette Score Plot

In [44]:

plt.figure(figsize = (10,6))
plt.plot(list(range(2,11)), sil_score, marker = 'X')
plt.title('Silhouette Scores')
plt.xlabel('K')
plt.ylabel('Score')
plt.show()

Out[44]:

For K = 5 Sillhouette is higher than 3 or 4 so optimum no. of clusters should be 5

KMeans for K = 5

In [45]:

model = KMeans(n_clusters = 5, random_state = 1)
model.fit_predict(scaled_df)

Out[45]:

array([0, 4, 1, ..., 1, 0, 2])

Adding Labels to new_df and scaled_df

In [46]:

scaled_df['Labels'] = model.labels_
new_df['Labels'] = model.labels_
new_df.sample(5)

Out[46]:

In [47]:

new_df.Labels.value_counts()

Out[47]:

  6760
  5776
  3275
  2174
  1336
Name: Labels, dtype: int64

Cluster Profiling

In [48]:

new_df2 = data[['Age', 'Rating','Recommended IND', 'Positive Feedback Count', 'Division Name','Department Name','Class Name']]

In [49]:

new_df2['Labels'] = new_df['Labels']
new_df2.head()

Out[49]:

In [50]:

# Overall level summary
new_df2.describe().T

Out[50]:

Cluster means

In [54]:

cluster_means = new_df2.groupby('Labels').mean().reset_index()
cluster_means.style.highlight_max(color="lightpink", axis=0)

Out[54]:

In [56]:

cluster_means.style.highlight_min(color="lightgreen", axis=0)

Out[56]:

Cluster 1 ,2 & 4 giving best ratings as well as recommending products
Cluster 0 consists of women giving least ratings and least recommendation but Positive feedback Count is higher
Cluster 3 is women with satisfactory ratings and Recommendations.

In [57]:

plt.figure(figsize = (20,7))
sns.countplot(x = new_df2['Class Name'], hue = new_df2['Labels'])
plt.xticks(rotation = 30)
plt.legend(loc = 'upper right', title = "Labels")
plt.show()

Out[57]:

Women in Cluster 2 tensds to buy Dresses, Pants, Skirts, Jeans and Shorts

Women in Cluster 1 are more interested in Blouses, Knits, Sweaters, Fine Gauge and Jackets.

Cluster 4 are more attracted to Pants, Lounge , Sweaters, skirts , Swim , Legwear and Layering.

Cluster 3 are less in no. and buying mostly Dresses, Pants , Blouses and Knits.

Women in cluster 0 shownig average approach like cluster 3.

In [58]:

plt.figure(figsize = (28,8))
sns.countplot(x = new_df2['Rating'], hue = new_df2['Labels'])
plt.legend(loc = 'upper right', title = "Labels")
plt.show()

Out[58]:

Women in Cluster 0 giving 1,2 & 3 rating out if 5, which is least in all of the cluster groups
Rest of the Clusters giving good average rating.

In [59]:

plt.figure(figsize = (28,12))
for i in enumerate(new_df.iloc[:, 0 : 7].columns):
    plt.subplot(2, 4 , i[0] + 1)
    sns.boxplot(y = new_df[i[1]], x = new_df['Labels'])
plt.show()

Out[59]:

In [60]:

le.classes_

Out[60]:

array(['Blouses', 'Casual bottoms', 'Chemises', 'Dresses', 'Fine gauge',
       'Intimates', 'Jackets', 'Jeans', 'Knits', 'Layering', 'Legwear',
       'Lounge', 'Outerwear', 'Pants', 'Shorts', 'Skirts', 'Sleep',
       'Sweaters', 'Swim', 'Trend'], dtype=object)

Cluster 0

Age between 35 & 50 is majority

Giving Lowest Ratings, Lowest Recommendations

Buying more Dresses, Fine Gauge, Intimates, Jackets, Jeans etc.

Cluster 1

Age 35- 55

Good Rating + Good Recommendations

Buying mostly Blouses, Casual bottoms, Chemises, Dresses, Fine gauge, Intimates, Jackets & Jeans.

Cluster 2

Age 35 - 50 majority

Good Rating + Good Recommendations

Buying alsmost similar product to cluster 0 but more satisfied.

Cluster 3

Majority of age 47- 57

Good Rating + Good Recommendations

Similar Buying patterns to cluster 0 ,1 and 2 but more skewed to left side

Cluster 4

Age between 35 & 50 is majority

Good Rating + Good Recommendations

Buying Lounge, Outerwear, Pants, Shorts, Skirts, Sleep, Sweaters, Swim & Trend cloths

Conclusions

Women Between 35 - 55 of Age are Big Buyers and also giving good reviews and ratings.
We should target this age group to increase sales.
Women of Age between 35 - 55 having more money and more purchasing power than young and Old aged Women.

Undestanding Data

Word Cloud

Checking Null Values

Dropping All Null Values

Word Cloud For Review Text

Word Cloud For Title

Histograms

Boxplots

Distribution Plots

Individual Boxplots

Correlation Heatmap

Clustering

Label Encoding Categorical Columns

Scaling Data

K-Means Algorithm

Elbow Method for Best K value

Silhouette Coefficient

Silhouette Score Plot

KMeans for K = 5

Adding Labels to new_df and scaled_df

Cluster Profiling

Cluster means

Cluster 0

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Conclusions

Product

Resources

Company