Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Machine Learning Unsupervised Methods/Lab 3 Day 2 .3 Women purchasing pattern using K-Means.ipynb
3074 views
Kernel: Python 3 (ipykernel)
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.cluster import KMeans import warnings warnings.filterwarnings(action = 'ignore')
data = pd.read_csv('women.csv') data.head()
for i in data.iloc[: , 5 : ].columns: print(f'{i} : {data[i].nunique()}')
Rating : 5 Recommended IND : 2 Positive Feedback Count : 82 Division Name : 3 Department Name : 6 Class Name : 20

Undestanding Data

sns.countplot(data['Rating']);
Image in a Jupyter notebook
  • Most of the rating are 3-5 which describe product quality are good

sns.countplot(data['Recommended IND']);
Image in a Jupyter notebook
  • Most of the women recommended the products

sns.countplot(data['Rating'], hue = data['Recommended IND']);
Image in a Jupyter notebook

Good rating means product will be recommended

plt.figure(figsize = (25,6)) sns.countplot(data['Positive Feedback Count']);
Image in a Jupyter notebook
drop_positive = data[data['Positive Feedback Count'] > 20].index
data = data.drop(drop_positive, axis = 0).reset_index(drop = True)
plt.figure(figsize = (25,6)) sns.countplot(data['Positive Feedback Count']);
Image in a Jupyter notebook
data.shape
(23096, 11)
plt.figure(figsize = (25,6)) sns.countplot(data['Age']);
Image in a Jupyter notebook
  • Age 35 to 45 are big buyers

  • Because people of this age tends to have more money than teens or senior citizens.

plt.figure(figsize = (25,6)) sns.barplot(x = data['Age'], y = data['Rating']);
Image in a Jupyter notebook
  • Average Rating by each age group

  • Lowest for 84 , 90, 91 & 94

plt.figure(figsize = (28,7)) sns.barplot(x = data['Age'], y = data['Positive Feedback Count']);
Image in a Jupyter notebook
  • Highest for 90 but lowest for 84 and 94

plt.figure(figsize = (28,7)) sns.countplot(data['Age'], hue = data['Recommended IND']);
Image in a Jupyter notebook
  • We can see here buying is directly proprtional to Recommendations

teens = data[data['Age'] < 20] youngs= data[(data['Age'] > 20) & (data['Age'] <= 40)] adult = data[(data['Age'] > 40) & (data['Age'] <= 60)] senior = data[data['Age'] > 60] plt.figure(figsize = (25,25)) plt.subplot(4,1,1) sns.countplot(teens['Age'], hue = data['Rating']) plt.title('Ratings By Teens') plt.subplot(4,1,2) sns.countplot(youngs['Age'], hue = data['Rating']) plt.title('Ratings By young') plt.subplot(4,1,3) sns.countplot(adult['Age'], hue = data['Rating']) plt.title('Ratings By Adult') plt.subplot(4,1,4) sns.countplot(senior['Age'], hue = data['Rating']) plt.title('Ratings By senior') plt.legend(loc = 'upper right',title = 'Rating') plt.show()
Image in a Jupyter notebook
  • Adult Women are buying more and giving good rating rather than youngsters.

data.head()
data['Class Name'].value_counts()
Dresses 6149 Knits 4764 Blouses 3045 Sweaters 1412 Pants 1375 Jeans 1138 Fine gauge 1087 Skirts 940 Jackets 693 Lounge 683 Swim 348 Outerwear 323 Shorts 316 Sleep 227 Legwear 165 Intimates 154 Layering 144 Trend 116 Casual bottoms 2 Chemises 1 Name: Class Name, dtype: int64

Word Cloud

Checking Null Values

data.isna().sum()
Unnamed: 0 0 Clothing ID 0 Age 0 Title 3761 Review Text 845 Rating 0 Recommended IND 0 Positive Feedback Count 0 Division Name 14 Department Name 14 Class Name 14 dtype: int64

Dropping All Null Values

data = data.dropna().reset_index(drop = True)

Word Cloud For Review Text

from wordcloud import WordCloud, STOPWORDS comment_words = '' stopwords = set(STOPWORDS) # iterate through Column for val in data['Review Text']: # typecaste each val to string val = str(val) # split the value tokens = val.split() # Converts each token into lowercase for i in range(len(tokens)): tokens[i] = tokens[i].lower() comment_words += " ".join(tokens)+" " wordcloud = WordCloud(width = 800, height = 800, background_color ='Black', stopwords = stopwords, min_font_size = 10).generate(comment_words) # plot the WordCloud image plt.figure(figsize = (8, 8), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0) plt.show()
Image in a Jupyter notebook

Word Cloud For Title

comment_words = '' stopwords = set(STOPWORDS) # iterate through Column for val in data['Title']: # typecaste each val to string val = str(val) # split the value tokens = val.split() # Converts each token into lowercase for i in range(len(tokens)): tokens[i] = tokens[i].lower() comment_words += " ".join(tokens)+" " wordcloud = WordCloud(width = 800, height = 800, background_color ='Black', stopwords = stopwords, min_font_size = 10).generate(comment_words) # plot the WordCloud image plt.figure(figsize = (8, 8), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0) plt.show()
Image in a Jupyter notebook
plt.figure(figsize = (15, 7)) sns.countplot(data['Class Name']) plt.xticks(rotation = 40) plt.show()
Image in a Jupyter notebook
  • Dresses , Knits , Blouses are the most sold items

Histograms

num_cols = data.select_dtypes(exclude = 'object') num_cols = num_cols.drop(['Unnamed: 0','Clothing ID'], axis = 1) num_cols.hist(figsize = (10,10), color = 'green');
Image in a Jupyter notebook

Boxplots

num_cols.boxplot(figsize = (10,10), color = 'orange');
Image in a Jupyter notebook

Distribution Plots

plt.figure(figsize = (10,10)) for i in enumerate(num_cols.columns): plt.subplot(2,2, i[0] +1) sns.distplot(num_cols[i[1]],kde = True, color = 'yellow') plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
Image in a Jupyter notebook

Individual Boxplots

plt.figure(figsize = (10,10)) for i in enumerate(num_cols.columns): plt.subplot(2,2, i[0] +1) sns.boxplot(num_cols[i[1]], color = 'red') plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
Image in a Jupyter notebook

Correlation Heatmap

plt.figure(figsize=(20, 5)) sns.heatmap(num_cols.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") plt.show()
Image in a Jupyter notebook
  • Recommendation and ratings are showing good correlation, i.e, Directly Proportional to each other.

Clustering

data.head(1)
new_df = data[['Age', 'Rating','Recommended IND', 'Positive Feedback Count', 'Division Name','Department Name','Class Name']]
new_df.head()
from sklearn.preprocessing import LabelEncoder, StandardScaler

Label Encoding Categorical Columns

le = LabelEncoder() for col in new_df.iloc[:,4 : ].columns: new_df[col] = le.fit_transform(new_df[col])
new_df.head()

Scaling Data

scaler = StandardScaler() scaler.fit(new_df) scaled_df = pd.DataFrame(scaler.transform(new_df), columns = new_df.columns, index = new_df.index) scaled_df.head()

K-Means Algorithm

error = [] for k in range(1,11): km = KMeans(n_clusters = k) km.fit_predict(scaled_df) error.append(km.inertia_)

Elbow Method for Best K value

plt.figure(figsize = (10,6)) plt.plot(list(range(1,11)), error, marker = 'X') plt.title('Elbow Method') plt.xlabel('K') plt.ylabel('Erorrs') plt.show()
Image in a Jupyter notebook
  • I am Little confused in 3 or 5

Silhouette Coefficient

  • One of the metrics to evaluate the quality of clustering is referred to as silhouette analysis. Silhouette analysis can be applied to other clustering algorithms as well. Silhouette coefficientranges between −1 and 1, where a higher silhouette coefficient refers to a model with more coherentclusters.

  • The Silhouette Coefficient is calculated using the mean intra-cluster distance ( a) and the mean nearest-cluster distance ( b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max (a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.

image.png

from sklearn.metrics import silhouette_score
sil_score = [] for k in range(2,11): km = KMeans(n_clusters = k) pred = km.fit_predict(scaled_df) score = silhouette_score(scaled_df, pred) sil_score.append(score) print(f'Silhouette Score for K = {k} is {score}')
Silhouette Score for K = 2 is 0.33015134965171683 Silhouette Score for K = 3 is 0.2156468473109423 Silhouette Score for K = 4 is 0.22674160321245745 Silhouette Score for K = 5 is 0.23862807754027696 Silhouette Score for K = 6 is 0.23448855556074166 Silhouette Score for K = 7 is 0.22695460667906933 Silhouette Score for K = 8 is 0.2196918416761983 Silhouette Score for K = 9 is 0.22024873675997048 Silhouette Score for K = 10 is 0.2275503333131384

Silhouette Score Plot

plt.figure(figsize = (10,6)) plt.plot(list(range(2,11)), sil_score, marker = 'X') plt.title('Silhouette Scores') plt.xlabel('K') plt.ylabel('Score') plt.show()
Image in a Jupyter notebook
  • For K = 5 Sillhouette is higher than 3 or 4 so optimum no. of clusters should be 5

KMeans for K = 5

model = KMeans(n_clusters = 5, random_state = 1) model.fit_predict(scaled_df)
array([0, 4, 1, ..., 1, 0, 2])

Adding Labels to new_df and scaled_df

scaled_df['Labels'] = model.labels_ new_df['Labels'] = model.labels_ new_df.sample(5)
new_df.Labels.value_counts()
1 6760 2 5776 0 3275 4 2174 3 1336 Name: Labels, dtype: int64

Cluster Profiling

new_df2 = data[['Age', 'Rating','Recommended IND', 'Positive Feedback Count', 'Division Name','Department Name','Class Name']]
new_df2['Labels'] = new_df['Labels'] new_df2.head()
# Overall level summary new_df2.describe().T

Cluster means

cluster_means = new_df2.groupby('Labels').mean().reset_index() cluster_means.style.highlight_max(color="lightpink", axis=0)
cluster_means.style.highlight_min(color="lightgreen", axis=0)
  • Cluster 1 ,2 & 4 giving best ratings as well as recommending products

  • Cluster 0 consists of women giving least ratings and least recommendation but Positive feedback Count is higher

  • Cluster 3 is women with satisfactory ratings and Recommendations.

plt.figure(figsize = (20,7)) sns.countplot(x = new_df2['Class Name'], hue = new_df2['Labels']) plt.xticks(rotation = 30) plt.legend(loc = 'upper right', title = "Labels") plt.show()
Image in a Jupyter notebook

Women in Cluster 2 tensds to buy Dresses, Pants, Skirts, Jeans and Shorts

Women in Cluster 1 are more interested in Blouses, Knits, Sweaters, Fine Gauge and Jackets.

Cluster 4 are more attracted to Pants, Lounge , Sweaters, skirts , Swim , Legwear and Layering.

Cluster 3 are less in no. and buying mostly Dresses, Pants , Blouses and Knits.

Women in cluster 0 shownig average approach like cluster 3.

plt.figure(figsize = (28,8)) sns.countplot(x = new_df2['Rating'], hue = new_df2['Labels']) plt.legend(loc = 'upper right', title = "Labels") plt.show()
Image in a Jupyter notebook
  • Women in Cluster 0 giving 1,2 & 3 rating out if 5, which is least in all of the cluster groups

  • Rest of the Clusters giving good average rating.

plt.figure(figsize = (28,12)) for i in enumerate(new_df.iloc[:, 0 : 7].columns): plt.subplot(2, 4 , i[0] + 1) sns.boxplot(y = new_df[i[1]], x = new_df['Labels']) plt.show()
Image in a Jupyter notebook
le.classes_
array(['Blouses', 'Casual bottoms', 'Chemises', 'Dresses', 'Fine gauge', 'Intimates', 'Jackets', 'Jeans', 'Knits', 'Layering', 'Legwear', 'Lounge', 'Outerwear', 'Pants', 'Shorts', 'Skirts', 'Sleep', 'Sweaters', 'Swim', 'Trend'], dtype=object)

Cluster 0

Age between 35 & 50 is majority

Giving Lowest Ratings, Lowest Recommendations

Buying more Dresses, Fine Gauge, Intimates, Jackets, Jeans etc.

Cluster 1

Age 35- 55

Good Rating + Good Recommendations

Buying mostly Blouses, Casual bottoms, Chemises, Dresses, Fine gauge, Intimates, Jackets & Jeans.

Cluster 2

Age 35 - 50 majority

Good Rating + Good Recommendations

Buying alsmost similar product to cluster 0 but more satisfied.

Cluster 3

Majority of age 47- 57

Good Rating + Good Recommendations

Similar Buying patterns to cluster 0 ,1 and 2 but more skewed to left side

Cluster 4

Age between 35 & 50 is majority

Good Rating + Good Recommendations

Buying Lounge, Outerwear, Pants, Shorts, Skirts, Sleep, Sweaters, Swim & Trend cloths

Conclusions

  • Women Between 35 - 55 of Age are Big Buyers and also giving good reviews and ratings.

  • We should target this age group to increase sales.

  • Women of Age between 35 - 55 having more money and more purchasing power than young and Old aged Women.