GitHub Repository: uob-COMS30035/lab_sheets_public
Path: blob/main/lab7/lab7-trees_and_ensembles.ipynb
³⁴⁰ views

Kernel: Python 3 (ipykernel)

Lab 7: Ensemble Methods

This week we will look at how collections of decision trees (or other machine learning models) can be trained on the same dataset and combined to enhance predictive performance. Specifically, we will look at bagging, random forests and boosting which are all related examples of ensemble methods.

1) Bagging

The combination of models can often perform much better than the average individual, and sometimes better than the best individual. Ensemble methods are ways of combining multiple models together. For good performance, the models should be diverse to minimise the expected error of the ensemble.

Bagging 'bootstrap aggregation' is a simple ensemble method that induces diversity by training M models on different samples of the training set (with replacement) and combining predictions by taking the mean or majority vote. An approximate bagging algorithm is:

For $m = 1,...,M$ models:
- Randomly sample N data points with replacement from the training set
- Learn a decision tree (CART algorithm) on the subset
The final prediction is found by a majority vote

1.1) Train a bagging ensemble

Complete the code below to train a bagging ensemble by randomly sampling from the MNIST training dataset to train multiple decision tress.

In [ ]:

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml

# load the MNIST data
X_mnist, y_mnist = fetch_openml(name='mnist_784',return_X_y=True, as_frame=False)
frac_of_dataset = 0.5
index = int(frac_of_dataset*X_mnist.shape[0])
X_train, X_test, y_train, y_test = train_test_split(X_mnist[:index,:], y_mnist[:index], test_size=0.2, random_state=5)

num_models = 20
sample_size = 2000  # training set size.
np.random.seed(0)

all_models = []
for m in range(num_models):
    # TODO
    # Sample with replacement from the training set. 
    # Each sample should contain sample_size data points chosen at random.
    # Hint: look at the documentation for numpy.random.choice().
        
    model = DecisionTreeClassifier()
    
    # TODO
    # train a decision tree classifier on the random sample.
    
    all_models.append(model)

1.2) Implement bagging prediction

Complete the bagging_predict function and then run the next cell to create predictions from the bagging ensemble based on majority voting of the individual models.

In [ ]:

# count how many classifiers have voted for each class for each point in the test dataset.
def bagging_predict(test_data, all_models):
    votes = np.zeros((test_data.shape[0], len(all_models)))  # number of classes = 10.
    combined_predictions = np.zeros(test_data.shape[0])

    for idx, m in enumerate(all_models):
        # TODO
        # obtain the predictions from model m for the test data and
        # populate the votes vector
        
    for test_point in range(votes.shape[0]):
        # TODO
        # determine the class with the most votes for each test point
        # and populate the combined_predictions 
        # hint use np.bincount
    
    return combined_predictions

In [ ]:

prediction = bagging_predict(X_test, all_models)
accuracy = np.count_nonzero(prediction==np.int64(y_test))/y_test.shape[0]
print("Test set accuracy: {}".format(accuracy))

How does the accuracy compare to a single decision tree?

Investigate the effect of changing the sample_size and num_models hyperparameters.

2) Random Forests

With bagging, the base models (individual decision trees) make similar splits on the same features, meaning that their errors are correlated and this reduces the diversity of the ensemble and limits performance.

Random forests improve the diversity of the base models by limiting the number of features considered for determining each split in the decision tree. We can obtain the random forest by modifying the bagging algorithm above so that each split for each model uses only a random subset of features.

2.1) Implement Random Forest Training

Copy in your code for the bagging procedure and modify it to implement random forest. The outline code below shows you where to make the modifications.

In [ ]:

num_models = 20
sample_size = 2000  # training set size.
feature_sample_size = 200
np.random.seed(0)

all_models = []
for m in range(num_models):
    # TODO
    # copy in your code from the bagging exercise here to sample with replacement from the training set. 
    # Each sample should contain sample_size data points chosen at random.
    # Hint: look at the documentation for numpy.random.choice().

    # TODO
    # create a decision tree classifier with limited features considered
    # for each split
    
    # TODO
    # copy in your code from the bagging exercise here to train a decision tree classifier on the random sample.
    # Remember to train it only on the random sample of features.
    
    all_models.append(model)

2.2) Random Forest Prediction

Use the bagging_predict function from Section 2.2 to generate predictions for the random forest and calculate the accuracy.

In [ ]:

prediction = bagging_predict(X_test, all_models)
accuracy = np.count_nonzero(prediction==np.int64(y_test))/y_test.shape[0]
print("Test set accuracy: {}".format(accuracy))

How does the performance of the random forest compare to bagging and the single model? Can you improve the performance by changing the hyperparameters?

3) Boosting

We can use a decision tree classifier as the base model for the ensemble method known as boosting. Boosting involves training base models in sequence to ensure that each base model addresses the weaknesses of the ensemble. Instead of training a new base model on a random sample, we weight the data points in the training set according to the performance of previous base models.

AdaBoost (adaptive boosting) is a popular boosting method, where training examples that are misclassified by one of the base classifiers are given greater weight when used to train the next classifier in the sequence. Once all the classifiers have been trained, their predictions are then combined through a weighted majority voting scheme.

The AdaBoost algorithm which you will implement is given below:

Initialize the data weighting coefficients $w_n$ by setting $w_n^{(1)} = 1/N$ for $n = 1,...,N$ where $N$ is the number of training examples
For $m = 1,...,M$ models:
- Fit a classifier $y_m(x)$ to a subset of the training data by minimising the weighted error function (hint: specify the sample_weight when fitting the model using scikit-learn).
- Calculate the weighted error, $\epsilon_m$ , where ParseError: KaTeX parse error: Expected 'EOF', got '_' at position 32: … \text{weighted_̲accuracy} = 1 -… and $I$ is the indicator function that equals $1$ when the condition is true (hint: the computation for the weighted accuracy is done for you if sample_weight is specified when calling the score function).
- Calculate the model weighting coefficients, $\alpha_m$ , where $\alpha_m = ln\left(\frac{1-\epsilon_m}{\epsilon_m}\right)$
- Update the data weighting coefficients $w_n^{(m+1)} = w_n^{(m)} exp(\alpha_m \space I(y_m(x_n) \neq t_n))$ where, again, $I$ is the indicator function.
The final prediction is a weighted combination of the trained base classifiers weighted by $\alpha_m$ .

For more information on boosting, see Bishop section 14.3.

3.1) Train an ensemble model using the AdaBoost algorithm

In [ ]:

num_models = 200 # number of base classifiers
sample_size = 2000  # sample training set size.
np.random.seed(0)

# TODO
# initialise the sample weights for all data points in the training set.

alphas = []
all_models = []
for m in range(num_models):
    # TODO
    # copy in your code from the bagging exercise here to sample with replacement from the training set. 
    # Each sample should contain sample_size data points chosen at random.
    # Hint: look at the documentation for numpy.random.choice().

    model = DecisionTreeClassifier(max_depth=10)
    
    # TODO
    # train a decision tree classifier on the weighted random sample.
    # Hint: fit() takes an additional argument, 'sample_weights'.
    
    # TODO
    # compute the model error using weighted accuracy for the sampled training dataset
    # Hint: score() score takes an additional argument, 'sample_weight'.

    # TODO
    # calculate alpha for the model and append alpha to alphas
   
    
    # TODO
    # update the sample_weights for incorrect predictions using alpha
    
    
    all_models.append(model)

3.2) Adaboost prediction

Complete the boosting_predict function to produce predictions from the trained models. In addition to the test data and the trained models, the function also takes the list of $\alpha_m$ as an input which determines the weighting of each individual model on the overall output.

In [ ]:

def boosting_predict(test_data, all_models, alphas):
    votes = np.zeros((test_data.shape[0], len(all_models))) 
    combined_predictions = np.zeros(test_data.shape[0])

    for idx, m in enumerate(all_models):
        # TODO
        # write your code here to obtain the predictions from model m and store it in votes.
        
    
    for test_point in range(len(votes)):
        # TODO
        # determine the class with the most votes for each test point and store it in combined_predictions
        # hint: remember the weighting alpha
    
    return combined_predictions

In [ ]:

prediction = boosting_predict(X_test, all_models, alphas)
accuracy = np.count_nonzero(prediction==np.int64(y_test))/y_test.shape[0]
print("Test set accuracy: {}".format(accuracy))

How does performance compare with the other approaches?

Try out different values of sample_size, num_models and max_depth of the decision tree.

How does training time vary for each approach as you change these ensemble parameters?

Wrap up

We then implemented bagging, then extended it to the Random Forest and Boosting methods. This should give some idea of how these three key ensemble methods are related to one another. Random Forest adds random sampling over features, while boosting re-weights the dataset at each iteration to focus on misclassified data points.

References

COMS30035 Machine Learning lecture notes.
Bishop Pattern Recognition and Machine Learning: Chapter 14.