Path: blob/master/examples/vision/ipynb/nl_image_search.ipynb
3236 views
Natural language image search with a Dual Encoder
Author: Khalid Salama
Date created: 2021/01/30
Last modified: 2021/01/30
Description: Implementation of a dual encoder model for retrieving images that match natural language queries.
Introduction
The example demonstrates how to build a dual encoder (also known as two-tower) neural network model to search for images using natural language. The model is inspired by the CLIP approach, introduced by Alec Radford et al. The idea is to train a vision encoder and a text encoder jointly to project the representation of images and their captions into the same embedding space, such that the caption embeddings are located near the embeddings of the images they describe.
This example requires TensorFlow 2.4 or higher. In addition, TensorFlow Hub and TensorFlow Text are required for the BERT model, and TensorFlow Addons is required for the AdamW optimizer. These libraries can be installed using the following command:
Setup
Prepare the data
We will use the MS-COCO dataset to train our dual encoder model. MS-COCO contains over 82,000 images, each of which has at least 5 different caption annotations. The dataset is usually used for image captioning tasks, but we can repurpose the image-caption pairs to train our dual encoder model for image search.
Download and extract the data
First, let's download the dataset, which consists of two compressed folders: one with images, and the other—with associated image captions. Note that the compressed images folder is 13GB in size.
Process and save the data to TFRecord files
You can change the sample_size
parameter to control many image-caption pairs will be used for training the dual encoder model. In this example we set train_size
to 30,000 images, which is about 35% of the dataset. We use 2 captions for each image, thus producing 60,000 image-caption pairs. The size of the training set affects the quality of the produced encoders, but more examples would lead to longer training time.
Create tf.data.Dataset
for training and evaluation
Implement the projection head
The projection head is used to transform the image and the text embeddings to the same embedding space with the same dimensionality.
Implement the vision encoder
In this example, we use Xception from Keras Applications as the base for the vision encoder.
Implement the text encoder
We use BERT from TensorFlow Hub as the text encoder
Implement the dual encoder
To calculate the loss, we compute the pairwise dot-product similarity between each caption_i
and images_j
in the batch as the predictions. The target similarity between caption_i
and image_j
is computed as the average of the (dot-product similarity between caption_i
and caption_j
) and (the dot-product similarity between image_i
and image_j
). Then, we use crossentropy to compute the loss between the targets and the predictions.
Train the dual encoder model
In this experiment, we freeze the base encoders for text and images, and make only the projection head trainable.
Note that training the model with 60,000 image-caption pairs, with a batch size of 256, takes around 12 minutes per epoch using a V100 GPU accelerator. If 2 GPUs are available, the epoch takes around 8 minutes.
Plotting the training loss:
Search for images using natural language queries
We can then retrieve images corresponding to natural language queries via the following steps:
Generate embeddings for the images by feeding them into the
vision_encoder
.Feed the natural language query to the
text_encoder
to generate a query embedding.Compute the similarity between the query embedding and the image embeddings in the index to retrieve the indices of the top matches.
Look up the paths of the top matching images to display them.
Note that, after training the dual encoder
, only the fine-tuned vision_encoder
and text_encoder
models will be used, while the dual_encoder
model will be discarded.
Generate embeddings for the images
We load the images and feed them into the vision_encoder
to generate their embeddings. In large scale systems, this step is performed using a parallel data processing framework, such as Apache Spark or Apache Beam. Generating the image embeddings may take several minutes.
Retrieve relevant images
In this example, we use exact matching by computing the dot product similarity between the input query embedding and the image embeddings, and retrieve the top k matches. However, approximate similarity matching, using frameworks like ScaNN, Annoy, or Faiss is preferred in real-time use cases to scale with a large number of images.
Set the query
variable to the type of images you want to search for. Try things like: 'a plate of healthy food', 'a woman wearing a hat is walking down a sidewalk', 'a bird sits near to the water', or 'wild animals are standing in a field'.
Evaluate the retrieval quality
To evaluate the dual encoder model, we use the captions as queries. We use the out-of-training-sample images and captions to evaluate the retrieval quality, using top k accuracy. A true prediction is counted if, for a given caption, its associated image is retrieved within the top k matches.
Final remarks
You can obtain better results by increasing the size of the training sample, train for more epochs, explore other base encoders for images and text, set the base encoders to be trainable, and tune the hyperparameters, especially the temperature
for the softmax in the loss computation.
Example available on HuggingFace