Path: blob/master/model_deployment/gbt_inference/gbt_inference.ipynb
2611 views
Gradient Boosted Tree Inferencing
Once we train our machine learning model, depending on the use case, we may wish to operationize it by putting it behind a service for (near) real time inferencing. We can definitely generate predictions in batch offline, store them in some downstream tables or look up services, and pull out pre-computed predictions when needed. Although this batch prediction approach might sound easier to implement, and we might not have to worry about latency issues when it comes to real time services, this paradigm does come with its limitations. e.g.
Cold start problem, if a new entity, whether it's users coming to the website or items being listed on a marketplace, there will be no precomputed recommendations available.
Not having access to real time features. Dynamic features are features based on what’s happening right now – what a user is watching, what people just liked, knowing these will allow us to generate more accurate or relevant predictions based on latest information.
Potentially wasted computation/storage. If we generate predictions for every possible user each day, and only 5% of them login to use our website, then the compute used to generate 95% of our predictions will be wasted.
Translating to Production Language
It's very common in industry setting to prototype a machine learning model in Python and translate it into other languages such as C++, Java, etc, when it comes to deploying. This usually happens where the core application is written in other languages such as C++, Java, etc. and it is an extremely time sensitive application where we can't afford the cost of calling an external API to fetch the model prediction.
In this section, we'll be looking at how we can achieve this with Gradient Boosted Trees, specifically XGBoost. Different library might have different ways to doing this, but the concept should be similar.
Tree Structure
A typical model dump from XGBoost looks like the following:
There are 3 distinct information:
boosterGradient Boosting Tree is an ensemble tree method, each new booster indicates the start of a new tree. The number of trees we have will be equivalent to the number of trees we specified for the model (e.g. for the sklearn XGBoost API,n_estimatorscontrols this) multiplied by the number of distinct classes. For regression model or binary classification model, the number of booster in the model dump will be exactly equal to the number of trees we've specified. Whereas for multi class classification, say we have 3 classes, then tree 0 will contribute to the raw prediction of class 0, tree 1 to class 1, tree 2 to class 2, tree 3 to class 0 and so on.nodeFollowing the booster is each tree's if-else structure. e.g. for node 0, if the featurebmiis less than a threshold, then it will branch to node 1 else it will branch to node 2.leafOnce we reach the leaf, we can accumulate the response prediction. e.g. node 7 is a leaf, and the prediction for this node is 25.84091.
Raw Prediction
We mentioned that to get the prediction for a given input, we sum up the response prediction associated from each tree's leaf node. The holds true for regression models, but for other models, we will need to perform a transformation on top the raw prediction to get to the probabilities. e.g. for when building a binary classification, a logistic transformation will be needed on top of the raw prediction, whereas for the multi-class classification, a softmax transformation is needed.
Preparation
All the examples below, be it regression, binary classification or multi class classification all follow the same structure.
We load some pre-processed data.
Train a quick XGBoost model.
Dump the raw model to disk.
Generate a sample prediction so we can later verify whether the prediction matches with the model converted to cpp.
Regression
Binary Classification
Multiclass Classification
C++ Implementation
The rest of the content is about implementing the boosted tree inferencing logic in C++, all the code resides in the gbt_inference folder for those interested. In practice, we don't always have to rely on naive code that we've implemented to solidify our understanding. e.g. the m2cgen (Model 2 Code Generator) project is one of the many projects out there that focuses on converting a trained model into native code. If we export our regression model, we can see that the inferencing logic is indeed a bunch of if else statements followed by a summation at the very end.
ONNX
Another way to achieving this is through ONNX, directly quoting from its documentation.
ONNX Runtime provides an easy way to run machine learned models with high performance on CPU or GPU without dependencies on the training framework. Machine learning frameworks are usually optimized for batch training rather than for prediction, which is a more common scenario in applications, sites, and services
We'll walk through the process of converting our boosted tree model into ONNX format, and benchmark the inference runtime. Here, we are doing it for classification model, but the process should be similar for regression based models.
Upon porting our model to onnx format, we can use it for inferencing. This section uses the Python API for benchmarking.
Note, at the time of writing this document, the onnx converter doesn't support categorical variables splits from common boosted tree libraries such as xgboost or lightgbm, we will have to leverage other ways of dealing with categorical variables if we wish to leverage onnx for inferencing.