Path: blob/master/examples/vision/ipynb/deeplabv3_plus.ipynb
3236 views
Multiclass semantic segmentation using DeepLabV3+
Author: Soumik Rakshit
Date created: 2021/08/31
Last modified: 2024/01/05
Description: Implement DeepLabV3+ architecture for Multi-class Semantic Segmentation.
Introduction
Semantic segmentation, with the goal to assign semantic labels to every pixel in an image, is an essential computer vision task. In this example, we implement the DeepLabV3+ model for multi-class semantic segmentation, a fully-convolutional architecture that performs well on semantic segmentation benchmarks.
References:
Downloading the data
We will use the Crowd Instance-level Human Parsing Dataset for training our model. The Crowd Instance-level Human Parsing (CIHP) dataset has 38,280 diverse human images. Each image in CIHP is labeled with pixel-wise annotations for 20 categories, as well as instance-level identification. This dataset can be used for the "human part segmentation" task.
Creating a TensorFlow Dataset
Training on the entire CIHP dataset with 38,280 images takes a lot of time, hence we will be using a smaller subset of 200 images for training our model in this example.
Building the DeepLabV3+ model
DeepLabv3+ extends DeepLabv3 by adding an encoder-decoder structure. The encoder module processes multiscale contextual information by applying dilated convolution at multiple scales, while the decoder module refines the segmentation results along object boundaries.
Dilated convolution: With dilated convolution, as we go deeper in the network, we can keep the stride constant but with larger field-of-view without increasing the number of parameters or the amount of computation. Besides, it enables larger output feature maps, which is useful for semantic segmentation.
The reason for using Dilated Spatial Pyramid Pooling is that it was shown that as the sampling rate becomes larger, the number of valid filter weights (i.e., weights that are applied to the valid feature region, instead of padded zeros) becomes smaller.
The encoder features are first bilinearly upsampled by a factor 4, and then concatenated with the corresponding low-level features from the network backbone that have the same spatial resolution. For this example, we use a ResNet50 pretrained on ImageNet as the backbone model, and we use the low-level features from the conv4_block6_2_relu
block of the backbone.
Training
We train the model using sparse categorical crossentropy as the loss function, and Adam as the optimizer.
Inference using Colormap Overlay
The raw predictions from the model represent a one-hot encoded tensor of shape (N, 512, 512, 20)
where each one of the 20 channels is a binary mask corresponding to a predicted label. In order to visualize the results, we plot them as RGB segmentation masks where each pixel is represented by a unique color corresponding to the particular label predicted. We can easily find the color corresponding to each label from the human_colormap.mat
file provided as part of the dataset. We would also plot an overlay of the RGB segmentation mask on the input image as this further helps us to identify the different categories present in the image more intuitively.
Inference on Train Images
Inference on Validation Images
You can use the trained model hosted on Hugging Face Hub and try the demo on Hugging Face Spaces.