Path: blob/master/ML Classification using Python/Understanding Decision Tree.ipynb
4742 views
Decision Tree
A Decision Tree is a supervised learning algorithm used for classification and regression. It splits data into branches based on feature values to reach a final decision 
Key Components of Decision Trees in Python
Root Node: The decision tree's starting node, which stands for the complete dataset.
Branch Nodes: Internal nodes that represent decision points, where the data is split based on a specific attribute.
Leaf Nodes: Final categorization or prediction-representing terminal nodes.
Decision Rules: Rules that govern the splitting of data at each branch node.
Attribute Selection: The process of choosing the most informative attribute for each split.
Splitting Criteria: Metrics like information gain, entropy, or the Gini Index are used to calculate the optimal split.
Assumptions we make while using Decision tree
At the beginning, we consider the whole training set as the root.
Attributes are assumed to be categorical for information gain and for gini index, attributes are assumed to be continuous.
On the basis of attribute values records are distributed recursively.
We use statistical methods for ordering attributes as root or internal node.
How Decision Trees Work
A decision tree repeatedly asks the best question that splits the data into the most “pure” subsets.
Key idea: Choose a feature that results in the highest Information Gain (or lowest Gini impurity).
Important Mathematics Behind Decision Trees
Entropy (Measure of impurity or randomness)
Where:
( S ) = dataset
( c ) = number of classes
( p_i ) = proportion of class i
Entropy range:
0 → perfectly pure
1 → completely mixed (maximum uncertainty)
Information Gain (IG)
Information Gain tells us how much entropy is reduced after a split. 
Where:
( A ) = feature
( S_v ) = subset where feature = v
Gini Index (CART Algorithm)

Lower Gini ⇒ better purity.
Voting Dataset Example (Small Example)
Let’s use a simple dataset of 10 people:
| Person | Age | Income | Education | Will Vote |
|---|---|---|---|---|
| 1 | Young | High | Graduate | Yes |
| 2 | Young | Medium | Graduate | No |
| 3 | Middle | Low | High School | Yes |
| 4 | Old | High | Graduate | Yes |
| 5 | Middle | Medium | Graduate | No |
| 6 | Old | Low | High School | No |
| 7 | Young | Low | High School | No |
| 8 | Old | Medium | Graduate | Yes |
| 9 | Middle | High | Graduate | Yes |
| 10 | Young | Medium | High School | No |
Goal: Predict whether a new person will vote (Yes/No).
Step 1: Calculate Parent Entropy
Class counts:
Yes = 5
No = 5
[ p(Yes)=\frac{5}{10}=0.5,\quad p(No)=0.5 ]
[ H(S)= -0.5\log_2(0.5) - 0.5\log_2(0.5)=1 ]
So initial entropy = 1 (maximum impurity).
Step 2: Choose Best Feature to Split
Let’s calculate Information Gain for the feature Age.
Split on Age
Young (4 samples)
Yes = 1
No = 3
[ H_{Young} = -\frac{1}{4}\log_2\left(\frac{1}{4}\right) - \frac{3}{4}\log_2\left(\frac{3}{4}\right) = 0.81 ]
Middle (3 samples)
Yes = 2
No = 1
[ H_{Middle} = -\frac{2}{3}\log_2\left(\frac{2}{3}\right) - \frac{1}{3}\log_2\left(\frac{1}{3}\right) = 0.92 ]
Old (3 samples)
Yes = 2
No = 1
[ H_{Old} = 0.92 ]
Weighted Entropy After Split
[ H_{split} = \frac{4}{10}(0.81) + \frac{3}{10}(0.92) + \frac{3}{10}(0.92) = 0.87 ]
Information Gain
[ IG(Age) = 1 - 0.87 = 0.13 ]
Assume Income gives a higher IG → it becomes the root node.
Final Decision Tree (ASCII Diagram)
Interpretation
Decision Trees mimic human decision-making:
First check Income (the best discriminator).
If income is ambiguous (Medium), check Age.
Make the final classification.
Prediction Example
Person:
Age = Middle
Income = Medium
Education = Graduate
Tree Path: Income = Medium → check Age → Middle → Vote = Yes
EDA
Visualizations
Encode Categorical Variables
Decision Trees can work directly with labels, but encoding ensures consistency
Young-2 , Middle-0 , Old-1
Mapping example:
Yes → 1
No → 0
Age/Income/Education encoded by alphabetical order
Train/Test Split
max_depth:A parameter that controls the maximum depth of the decision tree. (number of edges from the root node to leaf node)
Min Sample Spilt: A parameter that controlls min number of samples required to split and internal node
Manual Mathematics (Entropy + Information Gain)
Entropy After Splitting on "Income"
Predict a New Person
Example:
Age = Middle
Income = Medium
Education = Graduate
============================
LAB ASSIGNMENTS
============================
Calculate the entropy of the "Age" feature manually.
Compute Information Gain for "Age", "Income", and "Education".
Try training the model using criterion="gini". Compare results.
Change
max_depthfrom 3 to 5 and observe:Does accuracy increase?
Does the tree overfit?
Visualize the tree again after tuning hyperparameters.
Add 10 more synthetic rows to increase dataset size.
Try predicting voting behaviour for:
Age = Young, Income = High, Education = High School
Age = Old, Income = Low, Education = Graduate
Export the trained model using joblib.