Path: blob/master/model_selection/prob_calibration/deeplearning_prob_calibration.ipynb
2601 views
Deep Learning Model Calibration with Temperature Scaling
In this article, we'll be going over two main things:
Process of finetuning a pre-trained BERT model towards a text classification task, more specificially, the Quora Question Pairs challenge.
Process of evaluating model calibration and improving upong calibration error using temperature scaling [2].
Finetuning pre-trained models on downstream tasks has been increasingly popular these days, this notebook documents the findings on these model's calibration. Calibration in this context means does the model's predicted score reflects true probability. If the reader is not familiar with model calibration 101, there is a separate notebook [nbviewer][html] that covers this topic. Reading up till the "Measuring Calibration" section should suffice.
Tokenizer
We won't be going over the details of the pre-trained tokenizer or model and only load a pre-trained one available from the huggingface model repository.
We can feed our tokenizer directly with a pair of sentences.
Decoding the tokenized inputs, this model's tokenizer adds some special tokens such as, [SEP], that is used to indicate which token belongs to which segment/pair.
The proprocessing step will be task specific, if we happen to be using another dataset, this function needs to be modified accordingly.
Model FineTuning
Having preprocessed our raw dataset, for our text classification task, we use AutoModelForSequenceClassification class to load the pre-trained model, the only other argument we need to specify is number of class/label our text classification task has. Upon instantiating this model for the first time, we'll see some warnings generated, telling us we should fine tune this model on our down stream tasks before using it.
We can perform all sorts of hyper parameter tuning on the fine tuning step, here we'll pick some default parameters for illustration purposes.
We define some helper functions to generate predictions for our dataset, store the predicted score and label into a pandas DataFrame.
Model Calibration
Temperature Scaling
Temperature Scaling is a post-processing technique that was proposed to improve upon the calibration error, but specifically designed for deep learning. It works by dividing the logits (output of the layer right before the final softmax layer) by a learned scalar parameter.
where is the logit, and is the learned temperature scaling parameter. We learn this parameter on a validation set, where is chosen to minimize negative log likelihood. As we can imagine, with , it lowers the predicted score across all classes, making the model less confident about its predictions but does not change the model's predicted maximum class.
The benefit of this approach is mainly two folds:
Unlike a lot of post processing calibration technique, temperature scaling can be directly embedded into our deep learning module as a single additional parameter. We can export the model as is using standard serialization techniques for that specific deep learning library and perform inferencing at run time without introducing additional dependencies.
It has been shown to provide potent calibration performance when compared to other post processing calibration techniques by the original paper.
Observations:
Based on our calibration plot below, we can see our predicted score on this particular datset is concentrated on the higher end. Though it also seems like our original predicted score is already pretty well calibrated, and with temperature scaling, we were able to improve upon the calibration metrics even further.
A trained temperature scaling parameter that's larger than 1 value indicates that it is indeed shrinking the predicted score to make our model less confident on its prediction.
There're other works [3] that studies calibration effect for state of the art models. Although it's mainly for image based models, their claim is that model size and pretraining amount don't fully account for the differences in calibration across different models, but primary factor seems to be on model architecture, or more explicitly models that rely on attention based mechanism are found to be better calibrated compared to convolution based mechanism.
Reference
[1] Blog: Temperature Scaling for Neural Network Calibration
[2] Chuan Guo, Geoff Pleiss, Yu Sun, et al. - On Calibration of Modern Neural Networks (2017)
[3] Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, Mario Lucic - Revisiting the Calibration of Modern Neural Networks (2021)