Gradient Boosting Machine (GBM)
Just like Random Forest and Extra Trees, Gradient Boosting Machine is also a type of Ensemble Tree method, the only difference is it is stemmed from the boosting framework. The idea of boosting is to add one weak classifier to the ensemble at a time, and this newly added weak classifier is trained to improve upon the already trained ensemble. Meaning it will pay higher attention to examples which are misclassified or have higher errors and focus on mitigating those errors. Boosting is a general framework that can be applied to any sort of weak learner, although Decision Tree is by far the most commonly used ones due to them having the flexibility to be weak learners by simply restricting their depth and they are quite fast to train.
Suppose we are given some dataset , and the task is to fit a model to minimize square loss. After training the model, we discovered the model is good but not perfect. There are some mistakes: , while , and while .... Now the question is, how can we improve this model without changing anything from ?
How about we simply add an additional model (e.g. regression tree) to the already existing , so the new prediction becomes . In other words, we wish to improve upon the existing model so that or equivalent we wish to find a new model such that . The idea is all well and good, but the bad news is probably no model (e.g. regression tree) will be able to do this perfectly. Fortunately, the good news is, some might be able to do this approximately.
The idea is, we fit the model to the data using as the response variable. And the intuition for this is: the s are the residuals. These are the areas that the existing model cannot do well, so now the role of is to compensate the shortcoming of existing model . And if the model after adding the new model , is still unsatisfactory, we will just add another new one.
To make sure we're actually learning the residuals, we'll employ the idea of gradient descent. Say our goal is to minimize , an overall loss function additively calculated from all observations with regard to , a classifier with some parameters. More formally, we're given the formula:
Where:
is a cost/loss function comparing the response variable's value and the prediction of the model for each observation
Instead of trying to solve it directly, gradient descent is an iterative technique that allows us to approach the solution of an optimization problem. At each step of the algorithm, it will perform the following operations:
Where:
is the version of classifier at step/iteration
is the learning rate which controls the size of the learning process
is the gradient i.e. the first order partial derivative of the cost function with respect to the classifier
The formula above actually refers to stochastic gradient descent as we are only computing the function for a single observation,
For example, say we're given, sum of squares errors, a well-known quality indicator for regression model as our loss function. So now our loss function is defined as: (the 1/2 is simply to make the notation cleaner later). Taking the gradient of this loss function we get:
Tying this back to our original problem, we wish to update our function at iteration with a new model :
As we can see, the formula above is 99% the same as as the gradient descent formula, . The only difference is that the learning rate is 1. Thus, we now have an iterative process constructing the additive model that minimizes our loss function (residuals).
In practice though, Gradient Boosting Machine is more prone to overfitting, since the week learner is tasked with optimally fitting the gradient. This means that boosting will select the optimal learner at each stage of the algorithm, although this strategy generates an optimal solution at the current stage, it has the drawbacks of not finding the optimal global model as well as overfitting the training data. A remedy for greediness is to constrain the learning process by setting the learning rate (also known as shrinkage). In the above algorithm, instead of directly adding the predicted value for a sample to next iteration's predicted value, so that only a fraction of the current predicted value is added to the previous iteration's predicted value. This parameter can take values between 0 and 1 and becomes another tuning parameter for the model. Small values of the learning parameter such as 0.1 tends to work better, but the value of the parameter is inversely proportional to the computation time required to find an optimal model, because more iterations is required.
To sum it all up, the process of training a GBM for regression is:
Initialize a predicted value for each observation (e.g. the original response or the average response or a value that minimizes the loss function). This will be our initial "residuals", . It can be called the residuals because we're dealing with a regression task, but this quantity is more often referred to as the negative gradient, this terminology makes the part generalizes to any loss function we might wish to employ. In short, GBM is fitting to the gradient of the loss function
For step = 1 to (number of iterations that we specify) do:
Fit a regression tree to the training data , where we use the residuals as the response variable
Update model by adding a shrunken version of the newly fitted regression tree. Translating it to code, this means we append the new tree to the array of trees we've already stored:
Update each observation's residual by adding the predicted value to it:
In the end, our final output boosted model becomes , where we sum the values that each individual tree gives (times the learning rate)
To hit the notion home, let's conside an example using made up numbers. Suppose we have 5 observations, with responses 10, 20, 30, 40, 50. The first tree is built and gives predictions of 12, 18, 27, 39, 54 (these predictions are made up numbers). If our learning rate = 0.1, all trees will have their predictions scaled down by , so the first tree will instead "predict" 1.2, 1.8, 2.7, 3.9, 5.4. The response variable passed to the next tree will then have values 8.8, 18.2, 27.3, 36.1, 44.6 (the difference between the prediction that was scaled down by the prediction and the true response). The second round then uses these response values to build another tree - and again the predictions are scaled down by the learning rate . So tree 2 predicts say, 7, 18, 25, 40, 40, which, once scaled, become 0.7, 1.8, 2.5, 4.0, 4.0. As before, the third tree will be passed the difference between these values and the previous tree's response variable (so 8.1, 16.4, 24.8, 32.1. 40.6). And we keep iterating this process until we finished training all the trees (a parameter that we specify), in the end, the sum of the predictions from all trees will give the final prediction.
Implementation
Here, we will use the Wine Quality Data Set to test our implementation. This link should download the .csv file. The task is to predict the quality of the wine (a scale of 1 ~ 10) given some of its features.
Clearly, Gradient Boosting has some similarities to Random Forests and Extra Trees: the final prediction is based on an ensemble of models, and trees are used as the base learner, so all the tuning parameters for the tree model also controls the variability of Gradient Boosting. And for interpretability we can also access the feature importance attribute.
But the way the ensembles are constructed differs substantially between each model. In Random Forests and Extra Trees, all trees are created independently and each tree contributes equally to the final model. The trees in Gradient Boosting, however, are dependent on past trees and contribute unequally to the final model. Despite these differences, Random Forests, Extra Trees and Gradient Boosting all offer competitive predictive performance (Gradient Boosting often wins when carefully tuned). As for computation time, Gradient Boosting is often greater than for Random Forests, Extra Trees, since the two former models' procedure can be easily parallel processed given that their individual trees are created independently.
Classification
Gradient Boosting Machine can also be extended to handle classification tasks, as we'll soon see, even in the classification context, the underlying algorithm is still a regression tree. To adapt the algorithm to a classification process, we start by defining a new loss function, cross entropy (also known as multinomial deviance), denoted as:
The notation above says:
We have a total of output class (categorical response variable) that ranges from
is a dummy indicator of the response variable that takes the value of 1 if the observation belongs to class and 0 otherwise
is the predicted probability of the observation belonging to class
So the next question is how do we get ?
Softmax
Softmax function takes an -dimensional vector of arbitrary real values and produces another -dimensional vector with real values in the range (0, 1) that add up to 1. The function's formula can be written as:
For example, in the following code chunk, we see that how the softmax function transforms a 3-element vector 1.0, 2.0, 3.0 into probabilities that sums up to 1, while still preserving the relative size of the original elements.
Next, we wish to compute the derivative of this function with respect to the input so we can use it later when computing the derivative of the loss function. To be explicit we wish to find:
For any arbitrary output and input . To do so, We'll be using the quotient rule of derivatives. The rule tells us that for a function :
In our case, we have:
It's important to notice that no matter which we compute the derivative of for the output will always be . However, this is not the case for . It's derivative will be only if , because only then will it have the term . Otherwise, the derivative is simply 0 (because it's simply taking the derivative of a constant).
So going back to using our quotient rule, we start with the case. In the following derivation we'll use the (Sigma) sign to represent for simplicity and to prevent cluttering up the notation.
The reason we can perform the operation in the last line is because we're considering the scenario where . Similarly we can do the case where .
Just to sum it up, we now have:
Now, we can tie this back to the original loss function and compute its negative gradient.
Remember (as is a vector with only one non-zero element, which is when the indicating the observation belongs to the class.
After a long journey, we now see, for every class , the gradient is the difference between the associated dummy variable and the predicted probability of belonging to that class. This is essentially the "residuals" from the classification gradient boosting. Given this, we can now implement the algorithm, the overall process of training a regression tree has still not changed, only now we must deal with the dummy variables, and fit a regression tree on the negative gradient for each dummy variable.
Implementation
For the dataset, we'll still use the Wine Quality Data Set that was used for the regression task, except we now treat the quality of the wine (a scale of 1 ~ 10) as categorical instead of numeric.
Understanding Model Complexity
In the following section, we generate a Sinoide function + random gaussian noise, with 80 training samples (blue points) and 20 test samples (red points).
Recall that in a single regression tree, we can use the max_depth parameter to control how deep to grow the tree and the deeper the tree the more variance can be explained.
The plot above shows that the decision boundaries made by decision trees are always perpendicular to and axis (due to the fact that they consists of nested if-else statements). Let's see what happens when we use gradient boosting without tuning the parameters (by specifying a fix max_depth).
Hopefully, it should be clear that compared with decision trees, gradient boosting machine is far more susceptible to overfitting the training data, hence it is common to tune parameters including max_depth, max_features, min_samples_leaf, subsample (explained below) to reduce the overfitting phenomenon from occurring.
The parameter subsample (technically called stochastic gradient boosting) borrows some idea from bagging techniques. What it does is: while iterating through each individual tree building process, it randomly select a fraction of the training data. Then the residuals and models in the remaining steps of the current iteration are based only on that sample of data. It turns out that this simple modification improved the predictive accuracy of boosting while also reducing the required computational resources (of course, this is based on the fact that you have enough observations to subsample).
The following section tunes the commonly tuned parameter and find the best one and draws the decision boundary. The resulting plot should be self-explanatory.