Path: blob/master/ML Classification using Python/Random Forests .ipynb
4745 views
Random Forests in Python – A Simple, Step‑by‑Step Guide
In this notebook we will build intuition for Random Forests and then implement them in Python using scikit-learn
Load or create Data
Explore the data with basic EDA and visualizations
Train a single Decision Tree (our baseline)
Understand the theory and formulas behind Random Forests
Train a Random Forest classifier
Compare performance: tree vs forest
Visualize feature importance and number-of-trees vs accuracy
1. Intuition: From a Single Tree to a Forest 🌳🌳🌳
A Decision Tree is like a flowchart of yes/no questions that splits data into smaller and smaller groups.
Trees are easy to understand, but
A single tree can overfit (memorize noise in the training data).
Idea of a Random Forest
A Random Forest builds many decision trees and combines them.
Bootstrap sampling (Bagging)
We create many different training sets by sampling with replacement from the original data.
For each tree ( m ), we draw a bootstrap sample ( D_m ) from the training data.
Random feature subsets
At each split, instead of looking at all features, each tree only looks at a random subset of features.
Voting (for classification)
Each tree makes a prediction ( h_m(x) ).
The Random Forest prediction is the majority vote:
[ \hat{y} = \text{mode}\left( h_1(x), h_2(x), \dots, h_M(x) \right) ]
For regression, we take the average of the tree predictions:
[ \hat{y} = \frac{1}{M} \sum_{m=1}^{M} h_m(x) ]
Why does this help?
Each tree is different (due to random data and random features).
Individual trees may be weak and noisy.
But if many weak models agree, their average is usually strong and stable.
This is called an ensemble method.
2. Generate a Simple Synthetic Dataset
We will create a 2-class classification dataset with a few numerical features using make_classification.
3. Quick EDA (Exploratory Data Analysis)
We'll keep it very simple:
View first few rows
Summary statistics
Class balance
Distributions of a couple of features
4. Train–Test Split
We split the data into:
Training set: for learning the model
Test set: for evaluating on unseen data
We will use an 80–20 split.
5. Baseline: Single Decision Tree
Before we build a Random Forest, let's train one DecisionTreeClassifier.
Gini Impurity (split criterion)
One common way to decide how to split a node is Gini impurity:
[ G = 1 - \sum_{k=1}^{K} p_k^2 ]
where ( p_k ) is the proportion of class ( k ) in the node.
A pure node (only one class) has Gini (= 0). Decision Trees choose splits that reduce impurity the most.
6. Random Forest Classifier
Now we train a RandomForestClassifier.
Key hyperparameters:
n_estimators: number of trees in the forest (more trees → more stable, more compute)max_depth: maximum depth of each tree (None = full growth)max_features: number of features to consider at each splitmin_samples_split: minimum samples needed to split an internal noderandom_state: for reproducibility
We will start with a simple configuration and then inspect performance.
7. Feature Importance
Random Forests can give us an estimate of feature importance:
Each split that uses a feature reduces impurity.
We average that reduction over all trees.
Higher values → feature used more often and more effectively for splitting.
8. How Does the Number of Trees Affect Performance?
One important question: How many trees do we need?
Too few trees → model may be unstable.
More trees → more stable, but takes more time.
We'll train forests with different n_estimators and plot accuracy on train and test sets.
9. Visualizing the Decision Boundary (Just for Intuition)
To see what the model is doing, we'll:
Take only 2 features (
feature_1andfeature_2)Train a Random Forest on these 2D data
Plot the decision boundary.
This is only for intuition – in practice we use all useful features.
10. Summary
In this notebook, we:
Built a synthetic classification dataset
Performed simple EDA
Trained a single Decision Tree and saw its performance
Understood the Random Forest idea:
Bootstrap samples
Random feature subsets
Majority voting
Trained a RandomForestClassifier and compared it with a single tree
Looked at feature importances
Studied how number of trees affects accuracy
Visualized a 2D decision boundary for intuition
Random Forests are:
Powerful baseline models
Often perform very well with little tuning
Robust to overfitting compared to single trees