Path: blob/master/examples/vision/ipynb/image_classification_with_vision_transformer.ipynb
3236 views
Image classification with Vision Transformer
Author: Khalid Salama
Date created: 2021/01/18
Last modified: 2021/01/18
Description: Implementing the Vision Transformer (ViT) model for image classification.
Introduction
This example implements the Vision Transformer (ViT) model by Alexey Dosovitskiy et al. for image classification, and demonstrates it on the CIFAR-100 dataset. The ViT model applies the Transformer architecture with self-attention to sequences of image patches, without using convolution layers.
Setup
Prepare the data
Configure the hyperparameters
Use data augmentation
Implement multilayer perceptron (MLP)
Implement patch creation as a layer
Let's display patches for a sample image
Implement the patch encoding layer
The PatchEncoder
layer will linearly transform a patch by projecting it into a vector of size projection_dim
. In addition, it adds a learnable position embedding to the projected vector.
Build the ViT model
The ViT model consists of multiple Transformer blocks, which use the layers.MultiHeadAttention
layer as a self-attention mechanism applied to the sequence of patches. The Transformer blocks produce a [batch_size, num_patches, projection_dim]
tensor, which is processed via an classifier head with softmax to produce the final class probabilities output.
Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten()
and used as the image representation input to the classifier head. Note that the layers.GlobalAveragePooling1D
layer could also be used instead to aggregate the outputs of the Transformer block, especially when the number of patches and the projection dimensions are large.
Compile, train, and evaluate the mode
After 100 epochs, the ViT model achieves around 55% accuracy and 82% top-5 accuracy on the test data. These are not competitive results on the CIFAR-100 dataset, as a ResNet50V2 trained from scratch on the same data can achieve 67% accuracy.
Note that the state of the art results reported in the paper are achieved by pre-training the ViT model using the JFT-300M dataset, then fine-tuning it on the target dataset. To improve the model quality without pre-training, you can try to train the model for more epochs, use a larger number of Transformer layers, resize the input images, change the patch size, or increase the projection dimensions. Besides, as mentioned in the paper, the quality of the model is affected not only by architecture choices, but also by parameters such as the learning rate schedule, optimizer, weight decay, etc. In practice, it's recommended to fine-tune a ViT model that was pre-trained using a large, high-resolution dataset.