Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Machine Learning Unsupervised Methods/Day 2 EDA on Women purchasing pattern.ipynb
3074 views
Kernel: Python 3 (ipykernel)
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.cluster import KMeans import warnings warnings.filterwarnings(action = 'ignore')
data = pd.read_csv('women.csv') data.head()
for i in data.iloc[: , 5 : ].columns: print(f'{i} : {data[i].nunique()}')
Rating : 5 Recommended IND : 2 Positive Feedback Count : 82 Division Name : 3 Department Name : 6 Class Name : 20

Undestanding Data

plt.figure(figsize = (25,6)) sns.countplot(data['Rating']);
Image in a Jupyter notebook
  • Most of the rating are 3-5 which describe product quality are good

  • Most of the items are liked by buyers

plt.figure(figsize = (15,6)) sns.countplot(data['Recommended IND']);
Image in a Jupyter notebook
  • Most of the women recommended the products

  • Good numbers of buyers liked the product

  • There can be discount if you recommend a product/

plt.figure(figsize = (25,6)) sns.countplot(data['Rating'], hue = data['Recommended IND']);
Image in a Jupyter notebook
  • Most of the good rated products are recommended

plt.figure(figsize = (25,6)) sns.countplot(data['Positive Feedback Count']);
Image in a Jupyter notebook
drop_positive = data[data['Positive Feedback Count'] > 20].index
data = data.drop(drop_positive, axis = 0).reset_index(drop = True)
plt.figure(figsize = (25,6)) sns.countplot(data['Positive Feedback Count']);
Image in a Jupyter notebook
data.shape
(23096, 11)
plt.figure(figsize = (25,6)) sns.countplot(data['Age']);
Image in a Jupyter notebook
  • Age 30 to 50 are big buyers

  • Because people of this age tends to have more money than teens or senior citizens.

plt.figure(figsize = (25,6)) sns.boxplot(y = data['Age'], x = data['Rating']);
Image in a Jupyter notebook
  • Most of the product have got good and Average ratings

  • Lowest for 84 , 90, 91 & 94

plt.figure(figsize = (28,7)) sns.boxplot(y = data['Age'], x = data['Positive Feedback Count']);
Image in a Jupyter notebook
  • Highest for 90 but lowest for 84 and 94

plt.figure(figsize = (28,7)) sns.countplot(data['Age'], hue = data['Recommended IND']);
Image in a Jupyter notebook
  • More numbers of count of recommendation over all age bands

  • We can see here buying is directly proprtional to Recommendations

teens = data[data['Age'] <= 19] youngs= data[(data['Age'] > 19) & (data['Age'] <= 40)] adult = data[(data['Age'] > 40) & (data['Age'] <= 60)] senior = data[data['Age'] > 60] plt.figure(figsize = (25,25)) plt.subplot(4,1,1) sns.countplot(teens['Age'], hue = data['Rating']) plt.title('Ratings By Teens') plt.subplot(4,1,2) sns.countplot(youngs['Age'], hue = data['Rating']) plt.title('Ratings By young') plt.subplot(4,1,3) sns.countplot(adult['Age'], hue = data['Rating']) plt.title('Ratings By Adult') plt.subplot(4,1,4) sns.countplot(senior['Age'], hue = data['Rating']) plt.title('Ratings By senior') plt.legend(loc = 'upper right',title = 'Rating') plt.show()
Image in a Jupyter notebook
  • Youngf Women are buying more and Adult women giving good rating rather than youngsters.

data.head()
data['Class Name'].value_counts()
Dresses 6149 Knits 4764 Blouses 3045 Sweaters 1412 Pants 1375 Jeans 1138 Fine gauge 1087 Skirts 940 Jackets 693 Lounge 683 Swim 348 Outerwear 323 Shorts 316 Sleep 227 Legwear 165 Intimates 154 Layering 144 Trend 116 Casual bottoms 2 Chemises 1 Name: Class Name, dtype: int64

Word Cloud

Checking Null Values

data.isna().sum()
Unnamed: 0 0 Clothing ID 0 Age 0 Title 3761 Review Text 845 Rating 0 Recommended IND 0 Positive Feedback Count 0 Division Name 14 Department Name 14 Class Name 14 dtype: int64

Dropping All Null Values

data = data.dropna().reset_index(drop = True)

Word Cloud For Review Text

from wordcloud import WordCloud, STOPWORDS comment_words = '' stopwords = set(STOPWORDS) # iterate through Column for val in data['Review Text']: # typecaste each val to string val = str(val) # split the value tokens = val.split() # Converts each token into lowercase for i in range(len(tokens)): tokens[i] = tokens[i].lower() comment_words += " ".join(tokens)+" " wordcloud = WordCloud(width = 800, height = 800, background_color ='Black', stopwords = stopwords, min_font_size = 10).generate(comment_words) # plot the WordCloud image plt.figure(figsize = (8, 8), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0) plt.show()
Image in a Jupyter notebook

Word Cloud For Title

comment_words = '' stopwords = set(STOPWORDS) # iterate through Column for val in data['Title']: # typecaste each val to string val = str(val) # split the value tokens = val.split() # Converts each token into lowercase for i in range(len(tokens)): tokens[i] = tokens[i].lower() comment_words += " ".join(tokens)+" " wordcloud = WordCloud(width = 800, height = 800, background_color ='Black', stopwords = stopwords, min_font_size = 10).generate(comment_words) # plot the WordCloud image plt.figure(figsize = (8, 8), facecolor = None) plt.imshow(wordcloud) plt.axis("off") plt.tight_layout(pad = 0) plt.show()
Image in a Jupyter notebook
plt.figure(figsize = (15, 7)) sns.countplot(data['Class Name']) plt.xticks(rotation = 40) plt.show()
Image in a Jupyter notebook
  • Dresses , Knits , Blouses are the most sold items

Histograms

num_cols = data.select_dtypes(exclude = 'object') num_cols = num_cols.drop(['Unnamed: 0','Clothing ID'], axis = 1) num_cols.hist(figsize = (10,10), color = 'green');
Image in a Jupyter notebook

Boxplots

num_cols.boxplot(figsize = (10,10), color = 'orange');
Image in a Jupyter notebook

Distribution Plots

plt.figure(figsize = (10,10)) for i in enumerate(num_cols.columns): plt.subplot(2,2, i[0] +1) sns.distplot(num_cols[i[1]],kde = True, color = 'yellow') plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
Image in a Jupyter notebook

Individual Boxplots

plt.figure(figsize = (10,10)) for i in enumerate(num_cols.columns): plt.subplot(2,2, i[0] +1) sns.boxplot(num_cols[i[1]], color = 'red') plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
Image in a Jupyter notebook

Correlation Heatmap

plt.figure(figsize=(20, 5)) sns.heatmap(num_cols.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") plt.show()
Image in a Jupyter notebook
  • Recommendation and ratings are showing good correlation, i.e, Directly Proportional to each other.

Clustering

data.head(1)

I am creating a new data frame with all relevant coulums

new_df = data[['Age', 'Rating','Recommended IND', 'Positive Feedback Count', 'Division Name','Department Name','Class Name']]
new_df.head()

Convert Categorical data to Numerical suong Labal Encoder

from sklearn.preprocessing import LabelEncoder, StandardScaler

Label Encoding Categorical Columns

le = LabelEncoder() for col in new_df.iloc[:,4 : ].columns: new_df[col] = le.fit_transform(new_df[col])
new_df.head()

Scaling Data

scaler = StandardScaler() scaler.fit(new_df) scaled_df = pd.DataFrame(scaler.transform(new_df), columns = new_df.columns, index = new_df.index) scaled_df.head()

K-Means Algorithm

Silhouette Coefficient

  • One of the metrics to evaluate the quality of clustering is referred to as silhouette analysis. Silhouette analysis can be applied to other clustering algorithms as well. Silhouette coefficient ranges between −1 and 1, where a higher silhouette coefficient refers to a model with more coherent clusters.

  • The Silhouette Coefficient is calculated using the mean intra-cluster distance ( a) and the mean nearest-cluster distance ( b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max (a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.

image.png