Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
suyashi29
GitHub Repository: suyashi29/python-su
Path: blob/master/Machine Learning Ensemble Methods/3.1 Case Study Smart Retail Customer Segmentation Using Random Forest.ipynb
7216 views
Kernel: Python 3 (ipykernel)

Random Forest is an ensemble learning algorithm that builds multiple Decision Trees and aggregates their predictions.

Key ideas:

  • Uses Bootstrap Sampling (Bagging)

  • Uses Feature Randomness

  • Aggregates predictions using majority voting

Random Forest helps reduce:

  • Overfitting

  • Variance

A large retail chain wants to better understand its customers in order to personalize marketing campaigns and improve conversion rates.

The company collected behavioral data from 1000 customers including:

purchase frequency, website engagement, spending behavior, loyalty activity

The objective is to automatically classify customers into three segments:

Class Segment Meaning

  • 0 Budget Buyers Highly price-sensitive customers

  • 1 Regular Customers Moderate spending and engagement

  • 2 Premium Customers High spending and high engagement

To solve this, we build a Random Forest classification model that predicts the customer segment based on behavioral features.

import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
retail_customer=pd.read_csv("retail_customer.csv")
retail_customer.head(3)
retail_customer.describe()
## Check Class distribition retail_customer["Customer_Segment"].value_counts().plot(kind="bar") plt.title("Customer Segment Distribution") plt.xlabel("Segment") plt.ylabel("Count") plt.show()
## Prepare for ML X = retail_customer.drop("Customer_Segment", axis=1) y = retail_customer["Customer_Segment"]
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )
rf = RandomForestClassifier( n_estimators=100, random_state=42 ) rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
importance = rf.feature_importances_ plt.barh(columns, importance) plt.title("Feature Importance") plt.show()

Manual Gini Index Calculation (Using retail_customer Dataset)

Step 1: Select a Feature for Splitting

Assume the decision tree is considering a split based on:

  • Annual_Spending

  • Annual_Spending ≤ 0

  • Annual_Spending > 0

Step 2: Count Class Distribution in Each Node

Assume after splitting the 1000 records we get:

Left Node (Annual_Spending ≤ 0)

SegmentCount
Budget_Buyer180
Regular_Customer90
Premium_Customer30

Total = 300

Right Node (Annual_Spending > 0)

SegmentCount
Budget_Buyer120
Regular_Customer260
Premium_Customer320

Total = 700

Step 3: Calculate Probabilities

Left Node

  • P(Budget) = 180 / 300 = 0.6

  • P(Regular) = 90 / 300 = 0.3

  • P(Premium) = 30 / 300 = 0.1

Right Node

  • P(Budget) = 120 / 700 = 0.171

  • P(Regular) = 260 / 700 = 0.371

  • P(Premium) = 320 / 700 = 0.457

Step 4: Apply Gini Formula

image.png

Step 5: Compute Gini for Left Node

Gini_left = 1 − (0.6² + 0.3² + 0.1²)

= 1 − (0.36 + 0.09 + 0.01)

= 1 − 0.46

= 0.54

Step 6: Compute Gini for Right Node

Gini_right = 1 − (0.171² + 0.371² + 0.457²)

= 1 − (0.029 + 0.138 + 0.209)

= 1 − 0.376

= 0.624

Step 7: Compute Weighted Gini for the Split

%7B00D95EFA-2F71-465F-922E-EA9245E2EEC8%7D.png

%7BA5227B43-A5AF-475F-8464-13C7DD98DF0B%7D.png

Step 8: Interpretation

Lower Gini means better separation.

Gini ValueInterpretation
0Pure node
0–0.3Very good split
0.3–0.6Moderate split