Path: blob/master/Machine Learning Ensemble Methods/3.1 Case Study Smart Retail Customer Segmentation Using Random Forest.ipynb
7216 views
Random Forest is an ensemble learning algorithm that builds multiple Decision Trees and aggregates their predictions.
Key ideas:
Uses Bootstrap Sampling (Bagging)
Uses Feature Randomness
Aggregates predictions using majority voting
Random Forest helps reduce:
Overfitting
Variance
A large retail chain wants to better understand its customers in order to personalize marketing campaigns and improve conversion rates.
The company collected behavioral data from 1000 customers including:
purchase frequency, website engagement, spending behavior, loyalty activity
The objective is to automatically classify customers into three segments:
Class Segment Meaning
0 Budget Buyers Highly price-sensitive customers
1 Regular Customers Moderate spending and engagement
2 Premium Customers High spending and high engagement
To solve this, we build a Random Forest classification model that predicts the customer segment based on behavioral features.
Manual Gini Index Calculation (Using retail_customer Dataset)
Step 1: Select a Feature for Splitting
Assume the decision tree is considering a split based on:
Annual_Spending
Annual_Spending ≤ 0
Annual_Spending > 0
Step 2: Count Class Distribution in Each Node
Assume after splitting the 1000 records we get:
Left Node (Annual_Spending ≤ 0)
| Segment | Count |
|---|---|
| Budget_Buyer | 180 |
| Regular_Customer | 90 |
| Premium_Customer | 30 |
Total = 300
Right Node (Annual_Spending > 0)
| Segment | Count |
|---|---|
| Budget_Buyer | 120 |
| Regular_Customer | 260 |
| Premium_Customer | 320 |
Total = 700
Step 3: Calculate Probabilities
Left Node
P(Budget) = 180 / 300 = 0.6
P(Regular) = 90 / 300 = 0.3
P(Premium) = 30 / 300 = 0.1
Right Node
P(Budget) = 120 / 700 = 0.171
P(Regular) = 260 / 700 = 0.371
P(Premium) = 320 / 700 = 0.457
Step 4: Apply Gini Formula

Step 5: Compute Gini for Left Node
Gini_left = 1 − (0.6² + 0.3² + 0.1²)
= 1 − (0.36 + 0.09 + 0.01)
= 1 − 0.46
= 0.54
Step 6: Compute Gini for Right Node
Gini_right = 1 − (0.171² + 0.371² + 0.457²)
= 1 − (0.029 + 0.138 + 0.209)
= 1 − 0.376
= 0.624
Step 7: Compute Weighted Gini for the Split


Step 8: Interpretation
Lower Gini means better separation.
| Gini Value | Interpretation |
|---|---|
| 0 | Pure node |
| 0–0.3 | Very good split |
| 0.3–0.6 | Moderate split |