Path: blob/master/site/en-snapshot/tutorials/optimization/compression.ipynb
25118 views
Copyright 2022 The TensorFlow Compression Authors.
Scalable model compression
Overview
This notebook shows how to compress a model using TensorFlow Compression.
In the example below, we compress the weights of an MNIST classifier to a much smaller size than their floating point representation, while retaining classification accuracy. This is done by a two step process, based on the paper Scalable Model Compression by Entropy Penalized Reparameterization:
Training a "compressible" model with an explicit entropy penalty during training, which encourages compressibility of the model parameters. The weight on this penalty, , enables continuously controlling the trade-off between the compressed model size and its accuracy.
Encoding the compressible model into a compressed model using a coding scheme that is matched with the penalty, meaning that the penalty is a good predictor for model size. This ensures that the method doesn't require multiple iterations of training, compressing, and re-training the model for fine-tuning.
This method is strictly concerned with compressed model size, not with computational complexity. It can be combined with a technique like model pruning to reduce size and complexity.
Example compression results on various models:
Model (dataset) | Model size | Comp. ratio | Top-1 error comp. (uncomp.) |
---|---|---|---|
LeNet300-100 (MNIST) | 8.56 KB | 124x | 1.9% (1.6%) |
LeNet5-Caffe (MNIST) | 2.84 KB | 606x | 1.0% (0.7%) |
VGG-16 (CIFAR-10) | 101 KB | 590x | 10.0% (6.6%) |
ResNet-20-4 (CIFAR-10) | 128 KB | 134x | 8.8% (5.0%) |
ResNet-18 (ImageNet) | 1.97 MB | 24x | 30.0% (30.0%) |
ResNet-50 (ImageNet) | 5.49 MB | 19x | 26.0% (25.0%) |
Applications include:
Deploying/broadcasting models to edge devices on a large scale, saving bandwidth in transit.
Communicating global model state to clients in federated learning. The model architecture (number of hidden units, etc.) is unchanged from the initial model, and clients can continue learning on the decompressed model.
Performing inference on extremely memory limited clients. During inference, the weights of each layer can be sequentially decompressed, and discarded right after the activations are computed.
Setup
Install Tensorflow Compression via pip
.
Import library dependencies.
Define and train a basic MNIST classifier
In order to effectively compress dense and convolutional layers, we need to define custom layer classes. These are analogous to the layers under tf.keras.layers
, but we will subclass them later to effectively implement Entropy Penalized Reparameterization (EPR). For this purpose, we also add a copy constructor.
First, we define a standard dense layer:
And similarly, a 2D convolutional layer:
Before we continue with model compression, let's check that we can successfully train a regular classifier.
Define the model architecture:
Load the training data:
Finally, train the model:
Success! The model trained fine, and reached an accuracy of over 98% on the validation set within 5 epochs.
Train a compressible classifier
Entropy Penalized Reparameterization (EPR) has two main ingredients:
Applying a penalty to the model weights during training which corresponds to their entropy under a probabilistic model, which is matched with the encoding scheme of the weights. Below, we define a Keras
Regularizer
which implements this penalty.Reparameterizing the weights, i.e. bringing them into a latent representation which is more compressible (yields a better trade-off between compressibility and model performance). For convolutional kernels, it has been shown that the Fourier domain is a good representation. For other parameters, the below example simply uses scalar quantization (rounding) with a varying quantization step size.
First, define the penalty.
The example below uses a code/probabilistic model implemented in the tfc.PowerLawEntropyModel
class, inspired by the paper Optimizing the Communication-Accuracy Trade-off in Federated Learning with Rate-Distortion Theory. The penalty is defined as: where is one element of the model parameter or its latent representation, and is a small constant for numerical stability around values of 0.
The penalty is effectively a regularization loss (sometimes called "weight loss"). The fact that it is concave with a cusp at zero encourages weight sparsity. The coding scheme applied for compressing the weights, an Elias gamma code, produces codes of length bits for the magnitude of the element. That is, it is matched to the penalty, and applying the penalty thus minimizes the expected code length.
Second, define subclasses of CustomDense
and CustomConv2D
which have the following additional functionality:
They take an instance of the above regularizer and apply it to the kernels and biases during training.
They define kernel and bias as a
@property
, which perform quantization with straight-through gradients whenever the variables are accessed. This accurately reflects the computation that is carried out later in the compressed model.They define additional
log_step
variables, which represent the logarithm of the quantization step size. The coarser the quantization, the smaller the model size, but the lower the accuracy. The quantization step sizes are trainable for each model parameter, so that performing optimization on the penalized loss function will determine what quantization step size is best.
The quantization step is defined as follows:
With that, we can define the dense layer:
The convolutional layer is analogous. In addition, the convolution kernel is stored as its real-valued discrete Fourier transform (RDFT) whenever the kernel is set, and the transform is inverted whenever the kernel is used. Since the different frequency components of the kernel tend to be more or less compressible, each of them gets its own quantization step size assigned.
Define the Fourier transform and its inverse as follows:
With that, define the convolutional layer as:
Define a classifier model with the same architecture as above, but using these modified layers:
And train the model:
The compressible model has reached a similar accuracy as the plain classifier.
However, the model is not actually compressed yet. To do this, we define another set of subclasses which store the kernels and biases in their compressed form – as a sequence of bits.
Compress the classifier
The subclasses of CustomDense
and CustomConv2D
defined below convert the weights of a compressible dense layer into binary strings. In addition, they store the logarithm of the quantization step size at half precision to save space. Whenever the kernel or bias is accessed through the @property
, they are decompressed from their string representation and dequantized.
First, define functions to compress and decompress a model parameter:
With these, we can define CompressedDense
:
The convolutional layer class is analogous to the above.
To turn the compressible model into a compressed one, we can conveniently use the clone_model
function. compress_layer
converts any compressible layer into a compressed one, and simply passes through any other types of layers (such as Flatten
, etc.).
Now, let's validate that the compressed model still performs as expected:
The classification accuracy of the compressed model is identical to the one achieved during training!
In addition, the size of the compressed model weights is much smaller than the original model size:
Storing the models on disk requires some overhead for storing the model architecture, function graphs, etc.
Lossless compression methods such as ZIP are good at compressing this type of data, but not the weights themselves. That is why there is still a significant benefit of EPR when counting model size inclusive of that overhead, after also applying ZIP compression:
Regularization effect and size–accuracy trade-off
Above, the hyperparameter was set to 2 (normalized by the number of parameters in the model). As we increase , the model weights are more and more heavily penalized for compressibility.
For low values, the penalty can act like a weight regularizer. It actually has a beneficial effect on the generalization performance of the classifier, and can lead to a slightly higher accuracy on the validation dataset:
For higher values, we see a smaller and smaller model size, but also a gradually diminishing accuracy. To see this, let's train a few models and plot their size vs. accuracy:
The plot should ideally show an elbow-shaped size–accuracy trade-off, but it is normal for accuracy metrics to be somewhat noisy. Depending on initialization, the curve can exhibit some kinks.
Due to the regularization effect, the EPR compressed model is more accurate on the test set than the original model for small values of . The EPR compressed model is also many times smaller, even if we compare the sizes after additional ZIP compression.
Decompress the classifier
CompressedDense
and CompressedConv2D
decompress their weights on every forward pass. This makes them ideal for memory-limited devices, but the decompression can be computationally expensive, especially for small batch sizes.
To decompress the model once, and use it for further training or inference, we can convert it back into a model using regular or compressible layers. This can be useful in model deployment or federated learning scenarios.
First, converting back into a plain model, we can do inference, and/or continue regular training without a compression penalty:
Note that the validation accuracy drops after training for an additional epoch, since the training is done without regularization.
Alternatively, we can convert the model back into a "compressible" one, for inference and/or further training with a compression penalty:
Here, the accuracy improves after training for an additional epoch.