Path: blob/master/deep_learning/contrastive/clip/clip.ipynb
2604 views
CLIP (Contrastive Language-Image Pre-training)
Self-supervision, a.k.a. pre-training methods has been all the rage lately. The core idea behind this is supervised learning that has been the main workhorse in machine learning applications requires labeled data. In real world scenarios getting large amounts of labeled data can be very expensive, not to mention we might need to continuously annotate ground truth for new data to ensure our system can adapt to newer information. One of the benefits of self-supervised learning is to reduce the amount of labeling required. Given large amounts of un-labeled data at hand, we would create proxy tasks from the data itself and pre-train our model on them, these tasks are basically turning our un-supervised learning into a supervised learning problem. By warming up our models, the hope is that we can then achieve competitive results on the downstream applications that we are actually interested in by fine-tuning it on a smaller set of labeled data.
In this document, we'll go over one popular vision language pre-training methods called CLIP (Contrastive Language-Image Pre-training) [6] [8]. The de-facto approach to a lot of vision tasks in this deep learning era is to start from pretrained visual representations, potentially trained via supervised learning on image classification dataset such as ImageNet. CLIP demonstrated a pre-training task of predicting which caption goes with which image via contrastive loss is an efficient and scalable way to learn SOTA image representations. Directly quoting from the original paper:
The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.
Personally, one of the most interesting things about this method is its multi-modality nature. i.e. there are pure text pre-training models such as BERT or pure image pre-training methods such as SIMCLR, but using natural language supervision to guide image representation learning is definitely quite unique.
Dataset
We'll be using publicly available Flickr30k from Kaggle [7] as our example dataset, other dataset choices with reasonable sizes are Flickr8k and MS-COCO Captions.
Next few code chunks performs the usual of reading in our dataset, creating train/validation split. Read in some sample dataset for manual inspection. One important thing to note about this dataset is each image is paired with multiple captions, 5 to be exact.
We also construct our dataset and dataloader. For data preprocessing, we'll use dataset's with_transform method. This applies the transformation only when the examples are accessed, which can be thought of as a lazy version of map method.
CLIP Model
CLIP model comprises of three components [6] [8]: image encoder, text encoder and projection head (absorbed inside encoder block in the diagram).

During training we'll need to feed our batches of text and images through its own respective encoder. Given a batch of text, image pairs, , CLIP is trained to predict which of the possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the real image and text embedding pairs' (cosine) similarity in a given batch while minimizing cosine similarity of incorrect image and text embedding pairings. This is commonly referred to as InfoNCE loss.
Where is a similarity function such as cosine similarity or dot product that aims to align embeddings of similar pairs. This loss function can then be treated as a mult-class classification task using cross entropy loss, with the difference that number of class here will be the number of negatives within that batch instead of distinct number of classes that we wish to predict in a multi-class classification problem.
We load image encoder and text encoder from huggingface
transformers. Here we will be implementing the clip components ourselves, but huggingfacetransformersalso has a clip implementation which can serve as a reference.Projection head is responsible for taking both image and text encodings and embedding them into the same dimensional space.
Feel free to experiment with different image or text encoders, as well as projection head.
Some of other key learnings from the work includes:
Data: One key ingredients of pre-training is large scale data. CLIP collected a new dataset comprised of 400 million image text pairs from public internet.
Objective: Choosing proxy task that is training efficient was also key to scaling learning image representations via natural language supervision. As illustrated in this document, CLIP chose a two tower contrastive learning approach of aligning which text as a whole is paired with which image, instead of predictive objective such as predicting exact words of that caption or generative models.
Encoder: We can always experiment with different encoder architectures, authors reported a 3x gain in compute efficiency by adopting vision transformer over a standard ResNet for image encoder, and found that this model is less sensitive to the text encoder's capacity. They also reported using a higher 336 pixel resolution for images.
Training Recipe:
Important thing to note is that their contrastive loss uses a very large minibatch size of 32,768, and the calculation of embedding similarities are sharded with individual GPUs.
Their largest Vision Transformer took 12 days on 256 V100 GPUs.
They train CLIP model completely from scratch without initializing image or text encoder with pre-trained weights.
Zero Shot Capabilties: Given CLIP leverages natural langauge supervision, this enables far stronger generalization and zero shot capabilities. e.g. Given a task of classifying photos of objects, we can check each image whether CLIP predicts which of the caption "a photo of a dog" or "a photo of a car", etc. is more likely to be paired with it (depicted in the diagram below). We can imagine swapping out the dog and car part with any other class in our prompt making this applicable to potentially arbitrary classification tasks. Caveat: this may require trail and error "prompt engineering" to work well, and still has poor generalization to images not covered in its pre-training dataset.
Transfer Learning: CLIP's vision encoder which is trained on noisy image-text pairs from the web also offers very solid fine-tuning performance on image classification tasks with the right choice of hyperparameters [11].
Smaller learning rate.
Exponential moving average: keeping a moving average of all model parameters' weight.
Layer wise learning rate decay: setting different learning rates for each backbone layer. Top layers have higher learning rate to adapt to new tasks, while bottom layers have smaller learning rate so strong features learned from pre-training is preserved).
Data Augmentation: Removing strong random augmentation such as mixup, cutmix.

Apart from CLIP, we'll also use this opportuniy to introduce LiT, a potentially more efficient way of training text-image with contrastive learning. As well as VIT, the image encoder that we'll be using.
LiT
Locked image text Tuning, LiT [9] finds applying contrastive learning using a locked/frozen pre-trained image model with unlocked text model works extremely well. The core idea behind this is to teach a text model to read out good representation from a pre-trained image model.
Image pre-trained models are typically trained on semi-manually labeled images such as ImageNet-21k, which offers high quality data. This approach, however, has a limitation as it's confined to a pre-defined set of categories, restricting the model's generalization capability. In contrast, contrastive learning is often trained on image and text pairs that are loosely aligned from the web. This circumvent the need for manual labeling, and allows for learning richer visual concepts that goes beyond categories that are defined in the classification label space. Initializing contrastive pre-training with an image model that has been pre-trained using cleaner semi-manually labeled dataset aims to provides the best of both worlds: strong image representations from pre-training, plus flexible zero-shot transfer to new tasks via contrastive learning.
Reduced compute and data requirements. LiT re-uses existing pre-trained image encoders, amortizing computation and data resources to achieve solid performance. Authors from the paper quoted, LiT models trained on 24 million publicly available image-text pairs can rival the zero-shot classification performance of previous models trained on 400 million image-text pairs from private sources.
Locked/frozen image encoder leads to faster training and a smaller memory footprint. Enabling larger batch sizes, hence improving model's performance in contrastive learning setting.
Generalization. Locking the image tower improves generalization as it produces a text model that is well aligned to an already strong and general image representation, as opposed to an image text model that is well aligned but specialized to the dataset used for alignment.
ViT
Transformer/BERT style model was originally proposed in natural language domain, and quickly became the de facto standard model architecture. Its reach into computer vision field came much later, where vision transformers (ViT) [10] showed that a pure transformer applied to suquence of image patches is capable of achieving remarkable results for computer vision tasks. We'll elaborate upon its architecture and performance.
Architecture:

The main modification ViT made was show images are fed to a Transformer. Compared to natural language domain where we first tokenized input text before feeding these tokenized ids through our transformer module, for image, we would convert an image into square sized non-overlapping spatches, each of which gets turned into a vector/patch embedding. In the architecture diagram above, this is referred to as linear projection, and in practice these patch embedding are often times generated via convolutional 2D layer. e.g. If we have a 224x224 pixel images, we would end put with a suquence of 196 16x16 flattened image patches. This is why in public pre-trained models, e.g. google/vit-base-patch16-224-in21k, we'll see information such as patch16-224 indicating the patch size as well as image resolution in which it was pre-trained on. Another example is ViT-B/16 indicating it's a base model trained on 16x16 input patch size. Reason behind this patching is directly applying transformer's self attention to image would require each pixel attending to every other pixel. Given self attention quadratic cost, this does not scale to realistic input sizes. After this preprocessing, there's a special [CLS] token added to the beginning of patch embedding, which can be used as embedding input for downstream task. As well as a learnable position embedding. Both of these are similar to BERT.
Performance:

The main takeaway from this figure is compared to Convolutional Neural Networks (CNN), ViT benefits more from a larger scale pre-trainining data. By scaling from ImageNet 2012 with 1k classes, 1.3M images, to ImageNet-21k with 21k classes and 14M images, as well as private JFT with 18k classes and 300M images, larger ViT models start to dominate other variants from a performance standpoint. This result is also commonly attributed to fundamental differences in how these two types of model process visual information and compared to CNN that are widely used for vision tasks, ViT lack useful inductive biases.
CNN:
CNNs are designed specifically for processing grid-structured data, such as images. They have a strong locality bias which assumes pixels close to each other in the input image are more related and share information. This is why CNNs use convolutional layers that slide small filters (kernels) over input images to capture local patterns.
CNNs leverage the translation equivariance property, which means that if a feature (e.g., an edge or a texture) is important in one part of an image, it is likely to be important in other parts as well. This bias is essential for image recognition tasks.
ViT
ViT aims to preserve the original transformer module without modification to the core self attention operation. The only place where image specific inductive bias are introduced is projecting image to patch embedding as well as when performing fine tuning on higher resolution images (2D interpolation for pre-trained position embedding).
Transformer were originally designed for sequential data such as text. By treating image as a sequence of image patches, and processing them through self attention mechanisms, they are more flexible in how they capture patterns within images.
The experiment result reinforces our intuition that convolutional inductive bias is useful for smaller datasets, but for larger ones, learning the relevant patterns directly from global context is sufficient, even beneficial.
Note, different from BERT, which relied on self-supervised pre-training via masked language modeling (predicting masked tokens), the original ViT is still based on supervised pre-training.
Other notable learnings at the time includes:
Unlike in NLP domain, where self-supervised pre-training were employed. In the original ViT work, the best result was still obtained via supervised pre-training.
Compared to pre-training, we can use a higher image resolution during fine-tuning. When doing so, 2D interpolation is needed to adjust the positional embedding.
Implementation
Upon confirming both our image and text encoder returns an embedding of the same shape, we can now assemble the clip model.
Evaluation
For evaluation, we'll conduct adhoc qualitative analysis by feeding our model a piece of input text and showing the top-k retrieved images as well as perform quantitative evaluation by computing retrieval recall@k and let actual numbers speak to our model's quality. For retrieving top-k results, we'll be using faiss for computing exact cosine similarity between text and image embeddings.
The default setting for this notebook was the use the frozen/lock image encoder setting from LiT. Quick experiment showed this out-performed un-freezing. Original LiT also mentions locking the image encoder provides benefits when performing contrastive learning on very large image-text pair datasets (e.g. 4 billion images). Though the caveat is this claim is for fine tuning on image classification task instead of cross modal retrieval task.
Reference
[1] Github: Simple CLIP
[2] Github : mlfoundations open_clip
[3] Github : openai CLIP
[4] Implementing CLIP with PyTorch Lightning
[5] OpenAI's CLIP is the most important advancement in computer vision this year
[6] Blog: CLIP: Connecting Text and Images
[7] Kaggle: Flickr Image dataset
[8] Alec Radford, Jong Wook Kim, et. al - Learning Transferable Visual Models From Natural Language Supervision - 2021
[9] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Lucas Beyer, et al. - LiT: Zero-Shot Transfer with Locked-image text Tuning (2021)
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Neil Houlsby, et al. - An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)
[11] Xiaoyi Dong, et al. - CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet (2022)