Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/ML Clustering Analysis/Lab 3 Women purchasing pattern using K-Means.ipynb
3074 views
Kernel: Python 3 (ipykernel)
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.cluster import KMeans import warnings warnings.filterwarnings(action = 'ignore')
data = pd.read_csv('women.csv') data.head()
for i in data.iloc[: , 5 : ].columns: print(f'{i} : {data[i].nunique()}')
Rating : 5 Recommended IND : 2 Positive Feedback Count : 82 Division Name : 3 Department Name : 6 Class Name : 20

Undestanding Data

sns.countplot(data['Rating']);
Image in a Jupyter notebook
  • Most of the rating are 3-5 which describe product quality are good

sns.countplot(data['Recommended IND']);
Image in a Jupyter notebook
  • Most of the women recommended the products

sns.countplot(data['Rating'], hue = data['Recommended IND']);
Image in a Jupyter notebook

Good rating means product will be recommended

plt.figure(figsize = (25,6)) sns.countplot(data['Positive Feedback Count']);
Image in a Jupyter notebook
drop_positive = data[data['Positive Feedback Count'] > 20].index
data = data.drop(drop_positive, axis = 0).reset_index(drop = True)
plt.figure(figsize = (25,6)) sns.countplot(data['Positive Feedback Count']);
Image in a Jupyter notebook
data.shape
(23096, 11)
plt.figure(figsize = (25,6)) sns.countplot(data['Age']);
Image in a Jupyter notebook
  • Age 35 to 45 are big buyers

  • Because people of this age tends to have more money than teens or senior citizens.

plt.figure(figsize = (25,6)) sns.barplot(x = data['Age'], y = data['Rating']);
Image in a Jupyter notebook
  • Average Rating by each age group

  • Lowest for 85 , 90, 91 & 94

plt.figure(figsize = (28,7)) sns.barplot(x = data['Age'], y = data['Positive Feedback Count']);
Image in a Jupyter notebook
  • Highest for 90 but lowest for 89 and 18

plt.figure(figsize = (28,7)) sns.countplot(data['Age'], hue = data['Recommended IND']);
Image in a Jupyter notebook
  • We can see here buying is directly proprtional to Recommendations

teens = data[data['Age'] < 20] youngs= data[(data['Age'] > 20) & (data['Age'] <= 40)] adult = data[(data['Age'] > 40) & (data['Age'] <= 60)] senior = data[data['Age'] > 60] plt.figure(figsize = (25,25)) plt.subplot(4,1,1) sns.countplot(teens['Age'], hue = data['Rating']) plt.title('Ratings By Teens') plt.subplot(4,1,2) sns.countplot(youngs['Age'], hue = data['Rating']) plt.title('Ratings By young') plt.subplot(4,1,3) sns.countplot(adult['Age'], hue = data['Rating']) plt.title('Ratings By Adult') plt.subplot(4,1,4) sns.countplot(senior['Age'], hue = data['Rating']) plt.title('Ratings By senior') plt.legend(loc = 'upper right',title = 'Rating') plt.show()
Image in a Jupyter notebook
  • Most of the buyers are giving good rating

data.head()
data['Class Name'].value_counts()
Dresses 6149 Knits 4764 Blouses 3045 Sweaters 1412 Pants 1375 Jeans 1138 Fine gauge 1087 Skirts 940 Jackets 693 Lounge 683 Swim 348 Outerwear 323 Shorts 316 Sleep 227 Legwear 165 Intimates 154 Layering 144 Trend 116 Casual bottoms 2 Chemises 1 Name: Class Name, dtype: int64

Word Cloud

Checking Null Values

data.isna().sum()
Unnamed: 0 0 Clothing ID 0 Age 0 Title 3761 Review Text 845 Rating 0 Recommended IND 0 Positive Feedback Count 0 Division Name 14 Department Name 14 Class Name 14 dtype: int64

Dropping All Null Values

data = data.dropna().reset_index(drop = True)

Word Cloud For Review Text

from wordcloud import WordCloud, STOPWORDS comment_words = '' stopwords = set(STOPWORDS) # iterate through Column for val in data['Review Text']: # typecaste each val to string val = str(val) # split the value tokens = val.split() # Converts each token into lowercase for i in range(len(tokens)): tokens[i] = tokens[i].lower() comment_words += " ".join(tokens)+" " wordcloud = WordCloud(width = 800, height = 800, background_color ='green', stopwords = stopwords, min_font_size = 14).generate(comment_words) # plot the WordCloud image plt.figure(figsize = (18, 12), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0) plt.show()
Image in a Jupyter notebook

Word Cloud For Title

comment_words = '' stopwords = set(STOPWORDS) # iterate through Column for val in data['Title']: # typecaste each val to string val = str(val) # split the value tokens = val.split() # Converts each token into lowercase for i in range(len(tokens)): tokens[i] = tokens[i].lower() comment_words += " ".join(tokens)+" " wordcloud = WordCloud(width = 800, height = 800, background_color ='yellow', stopwords = stopwords, min_font_size = 12).generate(comment_words) # plot the WordCloud image plt.figure(figsize = (18, 12), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0) plt.show()
Image in a Jupyter notebook
plt.figure(figsize = (15, 7)) sns.countplot(data['Class Name']) plt.xticks(rotation = 40) plt.show()
Image in a Jupyter notebook
  • Dresses , Knits , Blouses are the most sold items

Histograms

num_cols = data.select_dtypes(exclude = 'object') num_cols = num_cols.drop(['Unnamed: 0','Clothing ID'], axis = 1) num_cols.hist(figsize = (10,10), color = 'green');
Image in a Jupyter notebook

Boxplots

num_cols.boxplot(figsize = (10,10), color = 'orange');
Image in a Jupyter notebook

Distribution Plots

plt.figure(figsize = (10,10)) for i in enumerate(num_cols.columns): plt.subplot(2,2, i[0] +1) sns.distplot(num_cols[i[1]],kde = True, color = 'yellow') plt.show

Individual Boxplots

plt.figure(figsize = (10,10)) for i in enumerate(num_cols.columns): plt.subplot(2,2, i[0] +1) sns.boxplot(num_cols[i[1]], color = 'red') plt.show

Correlation Heatmap

plt.figure(figsize=(20, 5)) sns.heatmap(num_cols.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") plt.show()
  • Recommendation and ratings are showing good correlation, i.e, Directly Proportional to each other.

Clustering

data.head(1)
new_df = data[['Age', 'Rating','Recommended IND', 'Positive Feedback Count', 'Division Name','Department Name','Class Name']]
new_df.head()
from sklearn.preprocessing import LabelEncoder, StandardScaler

Label Encoding Categorical Columns

le = LabelEncoder() for col in new_df.iloc[:,4 : ].columns: new_df[col] = le.fit_transform(new_df[col])
new_df.head()

Scaling Data

scaler = StandardScaler() scaler.fit(new_df) scaled_df = pd.DataFrame(scaler.transform(new_df), columns = new_df.columns, index = new_df.index) scaled_df.head()

K-Means Algorithm

error = [] for k in range(1,11): km = KMeans(n_clusters = k) km.fit_predict(scaled_df) error.append(km.inertia_)

Elbow Method for Best K value

plt.figure(figsize = (10,6)) plt.plot(list(range(1,11)), error, marker = 'X') plt.title('Elbow Method') plt.xlabel('K') plt.ylabel('Erorrs') plt.show()
Image in a Jupyter notebook
  • I am Little confused in 3 or 4

Silhouette Coefficient

  • One of the metrics to evaluate the quality of clustering is referred to as silhouette analysis. Silhouette analysis can be applied to other clustering algorithms as well. Silhouette coefficientranges between −1 and 1, where a higher silhouette coefficient refers to a model with more coherentclusters.

  • The Silhouette Coefficient is calculated using the mean intra-cluster distance ( a) and the mean nearest-cluster distance ( b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max (a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.

image.png

from sklearn.metrics import silhouette_score
sil_score = [] for k in range(2,11): km = KMeans(n_clusters = k) pred = km.fit_predict(scaled_df) score = silhouette_score(scaled_df, pred) sil_score.append(score) print(f'Silhouette Score for K = {k} is {score}')
Silhouette Score for K = 2 is 0.13100287934394378 Silhouette Score for K = 3 is 0.22232958285811769 Silhouette Score for K = 4 is 0.18729185374820115 Silhouette Score for K = 5 is 0.2160331797769901 Silhouette Score for K = 6 is 0.2245349961764663 Silhouette Score for K = 7 is 0.17708867969003694 Silhouette Score for K = 8 is 0.22534628990642938 Silhouette Score for K = 9 is 0.21945540768264313 Silhouette Score for K = 10 is 0.20540160664776794

Silhouette Score Plot

plt.figure(figsize = (10,6)) plt.plot(list(range(2,11)), sil_score, marker = 'X') plt.title('Silhouette Scores') plt.xlabel('K') plt.ylabel('Score') plt.show()
Image in a Jupyter notebook
  • For K = 5 Sillhouette is higher than 3 or 4 so optimum no. of clusters should be 5

KMeans for K = 5

model = KMeans(n_clusters = 5, random_state = 1) model.fit_predict(scaled_df)
array([4, 2, 1, ..., 0, 4, 2])

Adding Labels to new_df and scaled_df

scaled_df['Labels'] = model.labels_ new_df['Labels'] = model.labels_ new_df.sample(5)
new_df.Labels.value_counts()
2 6038 1 5019 0 3640 4 3302 3 1322 Name: Labels, dtype: int64

Cluster Profiling

new_df2 = data[['Age', 'Rating','Recommended IND', 'Positive Feedback Count', 'Division Name','Department Name','Class Name']]
new_df2['Labels'] = new_df['Labels'] new_df2.head()
# Overall level summary new_df2.describe().T

Cluster means

cluster_means = new_df2.groupby('Labels').mean().reset_index() cluster_means.style.highlight_max(color="lightgreen", axis=0)
cluster_means.style.highlight_min(color="lightgreen", axis=0)
  • Cluster 0 ,1 & 2 giving best ratings as well as recommending products

  • Cluster 4 consists of women giving least ratings and least recommendation but feedback Count is Low

  • Cluster 3 is women with satisfactory ratings and Recommendations.but feedback count is high

plt.figure(figsize = (20,7)) sns.countplot(x = new_df2['Class Name'], hue = new_df2['Labels']) plt.xticks(rotation = 30) plt.legend(loc = 'upper right', title = "Labels") plt.show()
Image in a Jupyter notebook

Women in Cluster 2 tensds to buy Dresses, Pants, Skirts, Jeans and Shorts

Women in Cluster 1 are more interested in Blouses, Knits, Sweaters, Fine Gauge and Jackets.

Cluster 4 are more attracted to Pants, Lounge , Sweaters, skirts , Swim , Legwear and Layering.

Cluster 3 are less in no. and buying mostly Dresses, Pants , Blouses and Knits.

Women in cluster 0 shownig average approach like cluster 3.

plt.figure(figsize = (28,8)) sns.countplot(x = new_df2['Rating'], hue = new_df2['Labels']) plt.legend(loc = 'upper right', title = "Labels") plt.show()
Image in a Jupyter notebook
  • Women in Cluster 3 giving 1,2 & 3 rating out if 5, which is least in all of the cluster groups

  • Rest of the Clusters giving good average rating.

plt.figure(figsize = (28,12)) for i in enumerate(new_df.iloc[:, 0 : 7].columns): plt.subplot(2, 4 , i[0] + 1) sns.boxplot(y = new_df[i[1]], x = new_df['Labels']) plt.show()
Image in a Jupyter notebook
le.classes_
array(['Blouses', 'Casual bottoms', 'Chemises', 'Dresses', 'Fine gauge', 'Intimates', 'Jackets', 'Jeans', 'Knits', 'Layering', 'Legwear', 'Lounge', 'Outerwear', 'Pants', 'Shorts', 'Skirts', 'Sleep', 'Sweaters', 'Swim', 'Trend'], dtype=object)

Cluster 0

Age between 35 & 50 is majority

Giving highest Ratings, Good Recommendations

Buying more Sweaters, Fine Gauge, Intimates, Jackets, Jeans etc.

Cluster 1

Age 35- 55

Good Rating + Good Recommendations

Buying mostly Blouses, Casual bottoms, Chemises, Dresses, Fine gauge, Intimates, Jackets & Jeans.

Cluster 2

Age 35 - 50 majority

Good Rating + Good Recommendations

Buying alsmost similar product to cluster 0 but more satisfied.

Cluster 3

Majority of age 47- 57

Lowest ratings + bad Recommendations

Similar Buying patterns to cluster 0 ,1 and 2 but more skewed to left side

Cluster 4

Age between 35 & 50 is majority

Good Rating + Good Recommendations

Buying Lounge, Outerwear, Pants, Shorts, Skirts, Sleep, Sweaters, Swim & Trend cloths

Conclusions

  • Women Between 35 - 55 of Age are Big Buyers and also giving good reviews and ratings.

  • We should target this age group to increase sales.

  • Women of Age between 35 - 55 having more money and more purchasing power than young and Old aged Women.