XGBoost API Walkthrough
Both xgboost (Extreme gradient boosting) and gbm follows the principle of gradient boosting. The name xgboost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use xgboost. For model, it might be more suitable to be called as regularized gradient boosting, as it uses a more regularized model formalization to control overfitting.
Preparation
In this toy example, we will be dealing with a binary classification task. We start off by generating a 20 dimensional artificial dataset with 1000 samples, where 8 features holding information, 3 are redundant and 2 repeated. And perform a train/test split. The testing data will be useful for validating the performance of our algorithms.
We can use a decision tree classifier to establish our baseline and see if a more complex model is capable of beating it.
XGBoost Basics
We start by training a xgboost model using a fix set of parameters. For further details of the parameter (using scikit-learn like API) refer to the XGBoost Documentation: Python API documentation.
We can retrieve the performance of the model on the evaluation dataset and plot it to get insight into the training process. The evals_results_
dictionary stores the validation_0
and validation_1
as its first key. This corresponds to the order that datasets were provided to the eval_set
argument. The second key is the eval_metric
that were provided.
From reviewing our plot, it looks like there is an opportunity to stop the training process at an earlier stage, since test dataset's auc score stopped increasing around 80 estimators. Luckily, xgboost supports this functionality.
Early stopping works by monitoring model's performance that is being trained on a separate validation dataset and stopping the training procedure once performance on the validation dataset has not improved after a fixed number of training iterations (we can specify this number). This will potentially save us a lot of time from training a model that does not improve its performance over time.
The evaluation measure may be the loss function that is being optimized to train our model (such as logarithmic loss), or an external metric of interest to the problem in general (such as the auc score that we've used above). The full list of performace measure that we can directly specify can be found at the eval_metric
section of the XGBoost Doc: Learning Task Parameters.
In addition to specifying a evaluation metric and dataset, to use early stopping we also need to specify the rounds
in our call back function. This is essentially telling our model to stop the training process if the evaluation dataset's evaluation metric does not improve over this many rounds. Note that if multiple evaluation datasets or multiple evaluation metrics are provided in a list, then early stopping will use the last one in the list.
For example, we can check for no improvement in auc over the 5 rounds as follows:
we can see from the result below that this is already better than our original decision tree model.
Side note: Apart from using the built-in evaluation metric, we can also define one ourselves. The evaluation metric should be a function that takes two argument y_pred, y_true (it doesn't have to named like this). It is assumed that y_true will be a DMatrix object so that we can call the get_label
method to access the true labels. As for the return value, the function ust return a str, value pair where the str is a name for the evaluation metric and value is the value of the evaluation. This objective is always minimized.
Another example of writing the customized rsquared evaluation metric.
Hyperparamter Tuning (Random Search)
Next, since overfitting is a common problem with sophisticated algorithms like gradient boosting, we'll introduce ways to tune the model's hyperparameter and deal with them. If a xgboost model is too complex we can try:
Reduce
max_depth
, the depth of each tree.Increase
min_child_weight
, minimum sum of observation's weight needed in a child (think of it as the number of observation's needed in a tree's node).Increase
gamma
, the minimum loss reduction required to make a further partition.Increase regularization parameters,
reg_lambda
(l2 regularization) andreg_alpha
(l1 regularization).Add more randomness by using
subsample
(the fraction of observations to be randomly samples for fitting each tree),colsample_bytree
(the fraction of columns to be randomly samples for fitting each tree) parameters.
We'll use a Random Search
to tune the model's hyperparameter.