Path: blob/master/12_gradient_boosting_machines/01_boosting_baseline.ipynb
2923 views
Adaptive and Gradient Boosting
In this notebook, we demonstrate the use of AdaBoost and gradient boosting, incuding several state-of-the-art implementations of this very powerful and flexible algorithm that greatly speed up training.
We use the stock return dataset with a few engineered factors created in Chapter 4 on Alpha Factor Research in the notebook feature_engineering.
Update
This notebook now uses sklearn.ensemble.HistGradientBoostingClassifier
.
Imports and Settings
Prepare Data
Get source
We use the engineered_features
dataset created in Chapter 4, Alpha Factor Research
Set data store location:
Factorize Categories
Define columns with categorical data:
Integer-encode categorical columns:
One-Hot Encoding
Create dummy variables from categorical columns if needed:
Get Holdout Set
Create holdout test set to estimate generalization error after cross-validation:
Load Data
The algorithms in this chapter use a dataset generated in Chapter 4 on Alpha Factor Research in the notebook feature-engineering that needs to be executed first.
Cross-Validation Setup
Custom Time Series KFold Generator
Custom Time Series KFold generator.
CV Metrics
Define some metrics for use with cross-validation:
Helper function that runs cross-validation for the various algorithms.
CV Result Handler Functions
The following helper functions manipulate and plot the cross-validation results to produce the outputs below.
Baseline Classifier
sklearn
provides the DummyClassifier that makes predictions using simple rule and is useful as a simple baseline to compare with the other (real) classifiers we use below.
The stratified
rule generates predictions based on the training set’s class distribution, i.e. always predicts the most frequent class.
Unsurprisingly, it produces results near the AUC threshold for arbitrary predictions of 0.5:
RandomForest
For comparison, we train a RandomForestClassifier
as presented in Chapter 11 on Decision Trees and Random Forests.
Configure
Cross-validate
Plot Results
scikit-learn: AdaBoost
As part of its ensemble module, sklearn provides an AdaBoostClassifier implementation that supports two or more classes. The code examples for this section are in the notebook gbm_baseline that compares the performance of various algorithms with a dummy classifier that always predicts the most frequent class.
Base Estimator
We need to first define a base_estimator as a template for all ensemble members and then configure the ensemble itself. We'll use the default DecisionTreeClassifier with max_depth=1—that is, a stump with a single split. The complexity of the base_estimator is a key tuning parameter because it depends on the nature of the data.
As demonstrated in the previous chapter, changes to max_depth
should be combined with appropriate regularization constraints using adjustments to, for example, min_samples_split
:
AdaBoost Configuration
In the second step, we'll design the ensemble. The n_estimators parameter controls the number of weak learners and the learning_rate determines the contribution of each weak learner, as shown in the following code. By default, weak learners are decision tree stumps:
The main tuning parameters that are responsible for good results are n_estimators
and the base estimator complexity because the depth of the tree controls the extent of the interaction among the features.
Cross-validate
We will cross-validate the AdaBoost ensemble using a custom 12-fold rolling time-series split to predict 1 month ahead for the last 12 months in the sample, using all available prior data for training, as shown in the following code:
Plot Result
scikit-learn: HistGradientBoostingClassifier
The ensemble module of sklearn contains an implementation of gradient boosting trees for regression and classification, both binary and multiclass.
Configure
The following HistGradientBoostingClassifier initialization code illustrates the key tuning parameters that we previously introduced, in addition to those that we are familiar with from looking at standalone decision tree models.
This estimator is much faster than GradientBoostingClassifier for big datasets (n_samples >= 10 000).
This estimator has native support for missing values (NaNs). During training, the tree grower learns at each split point whether samples with missing values should go to the left or right child, based on the potential gain. When predicting, samples with missing values are assigned to the left or right child consequently. If no missing values were encountered for a given feature during training, then samples with missing values are mapped to whichever child has the most samples.
Cross-validate
Plot Results
Partial Dependence Plots
Drop time periods to avoid over-reliance for in-sample fit.
One-way and two-way partial depende plots
Two-way partial dependence as 3D plot
XGBoost
See XGBoost docs for details on parameters and usage.
Configure
Cross-validate
Plot Results
Feature Importance
LightGBM
See LightGBM docs for details on parameters and usage.
Configure
Cross-Validate
Using categorical features
Plot Results
Using dummy variables
Plot results
Catboost
See CatBoost docs for details on parameters and usage.
CPU
Configure
Cross-Validate
Catboost requires integer values for categorical variables.
Plot Results
GPU
Naturally, the following requires that you have a GPU.