Path: blob/master/deep_learning/multi_label/product_quantization.ipynb
1480 views
Product Quantization for Model Compression
Product Quantization or often times PQ for short is an extremely popular algorithm for compressing vectors/embeddings and performing approximate nearest neighborhood search. In this example, we'll be using fasttext to illustrate the concept. The library and the data preparation step has already been introduced in another documentation, hence we won't be spending too much on them.
Data Preparation and Model
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
Product Quantization from Scratch
The main goal behind compression is that after training our model, the model size is too large for deployment. e.g. when deploying on edge devices where we have limited memory at hand or would require using a bigger machine which ultimately incur more infrastructure costs.
In this situation, product quantization is a compression technique for compressing embeddings. It has been shown that it significantly reduce the size of the model without hurting the performance by much. And without further a due, we'll dive right into how this technique works.
Learning Code Book
Our first step is to split the a dimensional vectors/matrix into distinct subvectors , of dimension , where should ideally be multiple of .
As an example, say our original embedding has a dimension of 1000x80, we can split it into 4 subvectors, each having a dimension of 1000x20.
Next, we'll run a k-means clustering algorithm on each subvectors. Upon computing the cluster centroid for each of cluster and subvectors, we would have on our hand, cluster centroids, each of size . In other words, reducing the dimension from into
So we'll apply k-means on our 1000x20 subvectors. If is 10, we would end up having 10 cluster centroids each of size 20. Since we have 4 subvectors, we would have 4 of those.
Information theory nomenclature is often times used to explain these concepts. Where:
The cluster centroids is referred to as codebook.
The cluster index is called a code, a reproduction value, a reconstruction value.
We'll be using scipy vector quantization module's k-means clustering implementation in the example code.
Encode
The cluster centroids for each of the subvectors represents the average/common pattern of each subvectors, and what we're going to do is to replace the original vector with the cluster centroid of each subvectors. The effect of doing so is instead of storing the original floating values for every record in our dataset, we'll be replacing it with the closest cluster centroid id within each subvectors.
i.e. At the end, we'll be "compressing" our original vector of size into a size of
We can calculate the potential size/memory savings if we were to go from storing the original vector into storing the encoded codes and the code book.
Instead of directly compressing the original embedding, we can also learn the codebooks and compress all new incoming embeddings on the fly.
Computing Query Distance
We can also compute nearest neighbors using the compressed vectors.
To do so, we'll be computing the distance between each subspace of the query with the cluster centroid of each subspace, giving us a distance table, where each one denotes the squared Euclidean distance between the subvector of the query and the code/cluster centroid for that subvector.
Then assuming for original vector is already encoded in advance, we can lookup the distances for each cluster centroid and add them up.
The approach illustrated here is more of a naive approach as it still involves calculating the distances to all the vectors, which can still be inefficient for large (number of data points). We won't be discussing how to speed up the nearest neighborhood search process for product quantization as this documentation is more focused on the compression aspect of it.
Fasttext Product Quantization
Fasttext comes with built-in capabilities for doing model compression using product quantization. We'll experiment with different options/parameter and measure the model performance and model size. i.e. compression ratio v.s. model performance dip.
The next couple of code chunks defines the functions to measure the model performance and model file size.
For this part of the experiment, we'll tweak the parameter, dimension of subvector, dsub
. Remember that this is one of main parameter that controls the tradeoff between the compression ratio and amount of distortion (deviation from the original vector).
We can visualize the table results. Our main observation is that setting dsub
to 2 seems to give the most memory reduction while preserving most of the model's performance.
The .quantize
method also provide other options that we did not use here such as:
Whether to quantize the output matrix,
qout
Whether to retrain the model's vector after doing the quantization,
retrain
.
Here we also list down the other compression tidbits from the original paper.
Quantize the Norm fasttext has a parameter
qnorm
to normalize of the vector and also quantize that norm. This often times leads to a lesser drop in accuracy.Retrain after Quantization: This suggests a bottom-up learning strategy where we first quantize the input matrix, then retrain and quantize the output matrix (the input matrix being frozen).
Vocabulary Pruning: Upon training the model, we can remove features that do not play a large role in the model. For each document, we verify if it is already covered by a retained feature and, if not, we add the feature with the highest norm to our set of retained features. If the number of features is below some user specified threshold, we add the features with the highest norm that have not yet been picked.
Choice of subvectors: We observe in practice that using subvectors, i.e., half of the components of the embeddings, works well in practice. Using less subquantizers significantly decreases the performance for a small memory gain.
Product Quantization Recap
To wrap up, we'll do a recap of product quantization, this time using a bit more notation.
The idea behind product quantization is to compress our original matrix into compact codes, where the comparison of the compact codes approximates the comparison in the original space. In the process of doing so, we are essentially retaining the most useful information within our matrix while discarding the less-relevant ones.
We have an original matrix , where . The input matrix will first be split into distinct sub-matrix, , , each of dimension , where should ideally be multiple of .
Then a product quantizer function, , is defined as a concatenation of sub-quantizer.
Where each sub-quantizer, , is usually a low complexity quantizer/clustering algorithm, such as k-means. In other words, each sub-quantizer learns a sub-codebook/sub-cluster centroids comprises of centroids each of size . Then the quantizer would map an input into its respective code/centroid under each sub-matrix and the final representation would be the concatenation of centroids.
Typically, the number of cluster centroids, , is set to 256 so that each code/centroid can be represented by 8 bits.