Path: blob/master/guides/md/keras_hub/object_detection_retinanet.md
17132 views
Object Detection with KerasHub
Authors: Sachin Prasad, Siva Sravana Kumar Neeli
Date created: 2026/03/27
Last modified: 2026/03/27
Description: RetinaNet Object Detection: Training, Fine-tuning, and Inference.

Introduction
Object detection is a crucial computer vision task that goes beyond simple image classification. It requires models to not only identify the types of objects present in an image but also pinpoint their locations using bounding boxes. This dual requirement of classification and localization makes object detection a more complex and powerful tool. Object detection models are broadly classified into two categories: "two-stage" and "single-stage" detectors. Two-stage detectors often achieve higher accuracy by first proposing regions of interest and then classifying them. However, this approach can be computationally expensive. Single-stage detectors, on the other hand, aim for speed by directly predicting object classes and bounding boxes in a single pass.
In this tutorial, we'll be diving into RetinaNet, a powerful object detection model known for its speed and precision. RetinaNet is a single-stage detector, a design choice that allows it to be remarkably efficient. Its impressive performance stems from two key architectural innovations:
Feature Pyramid Network (FPN): FPN equips
RetinaNetwith the ability to seamlessly detect objects of all scales, from distant, tiny instances to large, prominent ones.Focal Loss: This ingenious loss function tackles the common challenge of imbalanced data by focusing the model's learning on the most crucial and challenging object examples, leading to enhanced accuracy without compromising speed.

References
Setup and Imports
Let's install the dependencies and import the necessary modules.
To run this tutorial, you will need to install the following packages:
keras-hubkerasopencv-python
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1775002035.181029 2381 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1775002035.187532 2381 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Load the dataset
Let's load the training data. Here, we load both the VOC 2007 and 2012 datasets and split them into training and validation sets.
460032000/460032000 ━━━━━━━━━━━━━━━━━━━━ 16s 0us/step
I0000 00:00:1775002057.754358 2381 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38482 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0
Downloading data from http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
1999639040/1999639040 ━━━━━━━━━━━━━━━━━━━━ 65s 0us/step
Downloading data from http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
451020800/451020800 ━━━━━━━━━━━━━━━━━━━━ 15s 0us/step
Preprocessor: "retina_net_object_detector_preprocessor"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Config ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ retina_net_image_converter (RetinaNetImageConverter) │ Image size: (800, 800) │ └───────────────────────────────────────────────────────────────┴──────────────────────────────────────────┘
Model: "retina_net_object_detector"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ images (InputLayer) │ (None, None, None, 3) │ 0 │ - │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ retina_net_backbone │ [(None, None, None, 256), │ 27,429,824 │ images[0][0] │ │ (RetinaNetBackbone) │ (None, None, None, 256), │ │ │ │ │ (None, None, None, 256), │ │ │ │ │ (None, None, None, 256), │ │ │ │ │ (None, None, None, 256)] │ │ │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ box_head (PredictionHead) │ (None, None, None, 36) │ 2,443,300 │ retina_net_backbone[0][0], │ │ │ │ │ retina_net_backbone[0][1], │ │ │ │ │ retina_net_backbone[0][2], │ │ │ │ │ retina_net_backbone[0][3], │ │ │ │ │ retina_net_backbone[0][4] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ classification_head │ (None, None, None, 819) │ 4,248,115 │ retina_net_backbone[0][0], │ │ (PredictionHead) │ │ │ retina_net_backbone[0][1], │ │ │ │ │ retina_net_backbone[0][2], │ │ │ │ │ retina_net_backbone[0][3], │ │ │ │ │ retina_net_backbone[0][4] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ box_pred_P3 (Reshape) │ (None, None, 4) │ 0 │ box_head[0][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ box_pred_P4 (Reshape) │ (None, None, 4) │ 0 │ box_head[1][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ box_pred_P5 (Reshape) │ (None, None, 4) │ 0 │ box_head[2][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ box_pred_P6 (Reshape) │ (None, None, 4) │ 0 │ box_head[3][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ box_pred_P7 (Reshape) │ (None, None, 4) │ 0 │ box_head[4][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ cls_pred_P3 (Reshape) │ (None, None, 91) │ 0 │ classification_head[0][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ cls_pred_P4 (Reshape) │ (None, None, 91) │ 0 │ classification_head[1][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ cls_pred_P5 (Reshape) │ (None, None, 91) │ 0 │ classification_head[2][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ cls_pred_P6 (Reshape) │ (None, None, 91) │ 0 │ classification_head[3][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ cls_pred_P7 (Reshape) │ (None, None, 91) │ 0 │ classification_head[4][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ bbox_regression (Concatenate) │ (None, None, 4) │ 0 │ box_pred_P3[0][0], │ │ │ │ │ box_pred_P4[0][0], │ │ │ │ │ box_pred_P5[0][0], │ │ │ │ │ box_pred_P6[0][0], │ │ │ │ │ box_pred_P7[0][0] │ ├───────────────────────────────┼───────────────────────────┼─────────────────┼────────────────────────────┤ │ cls_logits (Concatenate) │ (None, None, 91) │ 0 │ cls_pred_P3[0][0], │ │ │ │ │ cls_pred_P4[0][0], │ │ │ │ │ cls_pred_P5[0][0], │ │ │ │ │ cls_pred_P6[0][0], │ │ │ │ │ cls_pred_P7[0][0] │ └───────────────────────────────┴───────────────────────────┴─────────────────┴────────────────────────────┘
Total params: 34,121,239 (130.16 MB)
Trainable params: 34,068,119 (129.96 MB)
Non-trainable params: 53,120 (207.50 KB)
Preprocessing Layers
Let's define the below preprocessing layers:
Resizing Layer: Resizes the image and maintains the aspect ratio by applying padding when
pad_to_aspect_ratio=True. Also, sets the default bounding box format for representing the data.Max Bounding Box Layer: Limits the maximum number of bounding boxes per image.
Predict and Visualize
Next, let's obtain predictions from our object detector by loading the image and visualizing them. We'll apply the preprocessing pipeline defined in the preprocessing layers step.
1/1 ━━━━━━━━━━━━━━━━━━━━ 8s 8s/step
Now concatenate both 2007 and 2012 VOC data
Load the eval data
Let's visualize a batch of training data

Decode TFDS records to a tuple for KerasHub
Configure RetinaNet Model
Configure the model with backbone, num_classes and preprocessor. Use callbacks for recording logs and saving checkpoints.
Load backbone weights and preprocessor config
Let's use the "retinanet_resnet50_fpn_coco" pretrained weights as the backbone model, applying its predefined configuration from the preprocessor of the "retinanet_resnet50_fpn_coco" preset. Define a RetinaNet object detector model with the backbone and preprocessor specified above, and set num_classes to 20 to represent the object categories from Pascal VOC. Finally, compile the model using Mean Absolute Error (MAE) as the box loss.
Train the model
Now that the object detector model is compiled, let's train it using the training and validation data we created earlier. For demonstration purposes, we have used a small number of epochs. You can increase the number of epochs to achieve better results.
Note: The model is trained on an L4 GPU. Training for 5 epochs on a T4 GPU takes approximately 7 hours.
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 0s 110ms/step - bbox_regression_loss: 1.1873 - cls_logits_loss: 95.7444 - loss: 96.9318
Epoch 1: val_loss improved from None to 0.31972, saving model to fine_tuning/weights/0001-0.32.weights.h5
Epoch 1: finished saving model to fine_tuning/weights/0001-0.32.weights.h5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 534s 119ms/step - bbox_regression_loss: 0.4609 - cls_logits_loss: 13.6850 - loss: 14.1459 - val_bbox_regression_loss: 0.1833 - val_cls_logits_loss: 0.1364 - val_loss: 0.3197
Epoch 2/5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 0s 110ms/step - bbox_regression_loss: 0.1946 - cls_logits_loss: 0.1243 - loss: 0.3189
Epoch 2: val_loss improved from 0.31972 to 0.25071, saving model to fine_tuning/weights/0002-0.25.weights.h5
Epoch 2: finished saving model to fine_tuning/weights/0002-0.25.weights.h5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 491s 119ms/step - bbox_regression_loss: 0.1863 - cls_logits_loss: 0.1163 - loss: 0.3026 - val_bbox_regression_loss: 0.1518 - val_cls_logits_loss: 0.0989 - val_loss: 0.2507
Epoch 3/5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 0s 111ms/step - bbox_regression_loss: 0.1741 - cls_logits_loss: 0.0943 - loss: 0.2684
Epoch 3: val_loss improved from 0.25071 to 0.20826, saving model to fine_tuning/weights/0003-0.21.weights.h5
Epoch 3: finished saving model to fine_tuning/weights/0003-0.21.weights.h5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 495s 120ms/step - bbox_regression_loss: 0.1695 - cls_logits_loss: 0.0902 - loss: 0.2597 - val_bbox_regression_loss: 0.1298 - val_cls_logits_loss: 0.0784 - val_loss: 0.2083
Epoch 4/5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 0s 110ms/step - bbox_regression_loss: 0.1553 - cls_logits_loss: 0.0727 - loss: 0.2280
Epoch 4: val_loss improved from 0.20826 to 0.20306, saving model to fine_tuning/weights/0004-0.20.weights.h5
Epoch 4: finished saving model to fine_tuning/weights/0004-0.20.weights.h5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 490s 118ms/step - bbox_regression_loss: 0.1486 - cls_logits_loss: 0.0701 - loss: 0.2187 - val_bbox_regression_loss: 0.1437 - val_cls_logits_loss: 0.0593 - val_loss: 0.2031
Epoch 5/5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 0s 111ms/step - bbox_regression_loss: 0.1297 - cls_logits_loss: 0.0566 - loss: 0.1863
Epoch 5: val_loss improved from 0.20306 to 0.17988, saving model to fine_tuning/weights/0005-0.18.weights.h5
Epoch 5: finished saving model to fine_tuning/weights/0005-0.18.weights.h5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 492s 119ms/step - bbox_regression_loss: 0.1269 - cls_logits_loss: 0.0547 - loss: 0.1817 - val_bbox_regression_loss: 0.1297 - val_cls_logits_loss: 0.0501 - val_loss: 0.1799
<keras.src.callbacks.history.History at 0x7f1cbb73a910>
Plot the predictions

Custom training object detector
Additionally, you can customize the object detector by modifying the image converter, selecting a different image encoder, etc.
Image Converter
The RetinaNetImageConverter class prepares images for use with the RetinaNet object detection model. Here's what it does:
Scaling and Offsetting
ImageNet Normalization
Resizing
Image Encoder and RetinaNet Backbone
The image encoder, while typically initialized with pre-trained weights (e.g., from ImageNet), can also be instantiated without them. This results in the image encoder (and, consequently, the entire object detection network built upon it) having randomly initialized weights.
Here we load pre-trained ResNet50 model. This will serve as the base for extracting image features.
And then build the RetinaNet Feature Pyramid Network (FPN) on top of the ResNet50 backbone. The FPN creates multi-scale feature maps for better object detection at different sizes.
Note: use_p5: If True, the output of the last backbone layer (typically P5 in an FPN) is used as input to create higher-level feature maps (e.g., P6, P7) through additional convolutional layers. If False, the original P5 feature map from the backbone is directly used as input for creating the coarser levels, bypassing any further processing of P5 within the feature pyramid. Defaults to False.
Train and visualize RetinaNet model
Note: Training the model (for demonstration purposes only 5 epochs). In a real scenario, you would train for many more epochs (often hundreds) to achieve good results.
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 0s 109ms/step - bbox_regression_loss: 0.2777 - cls_logits_loss: 5.8220 - loss: 6.0997
Epoch 1: val_loss improved from None to 0.28498, saving model to custom_training/weights/0001-0.28.weights.h5
Epoch 1: finished saving model to custom_training/weights/0001-0.28.weights.h5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 521s 118ms/step - bbox_regression_loss: 0.2125 - cls_logits_loss: 0.8302 - loss: 1.0427 - val_bbox_regression_loss: 0.1502 - val_cls_logits_loss: 0.1348 - val_loss: 0.2850
Epoch 2/5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 0s 109ms/step - bbox_regression_loss: 0.1528 - cls_logits_loss: 0.1169 - loss: 0.2697
Epoch 2: val_loss improved from 0.28498 to 0.25430, saving model to custom_training/weights/0002-0.25.weights.h5
Epoch 2: finished saving model to custom_training/weights/0002-0.25.weights.h5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 486s 118ms/step - bbox_regression_loss: 0.1453 - cls_logits_loss: 0.1176 - loss: 0.2629 - val_bbox_regression_loss: 0.1315 - val_cls_logits_loss: 0.1228 - val_loss: 0.2543
Epoch 3/5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 0s 109ms/step - bbox_regression_loss: 0.1255 - cls_logits_loss: 0.0995 - loss: 0.2250
Epoch 3: val_loss improved from 0.25430 to 0.22651, saving model to custom_training/weights/0003-0.23.weights.h5
Epoch 3: finished saving model to custom_training/weights/0003-0.23.weights.h5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 485s 117ms/step - bbox_regression_loss: 0.1215 - cls_logits_loss: 0.0987 - loss: 0.2202 - val_bbox_regression_loss: 0.1270 - val_cls_logits_loss: 0.0995 - val_loss: 0.2265
Epoch 4/5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 0s 109ms/step - bbox_regression_loss: 0.1095 - cls_logits_loss: 0.0803 - loss: 0.1898
Epoch 4: val_loss improved from 0.22651 to 0.18972, saving model to custom_training/weights/0004-0.19.weights.h5
Epoch 4: finished saving model to custom_training/weights/0004-0.19.weights.h5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 485s 117ms/step - bbox_regression_loss: 0.1071 - cls_logits_loss: 0.0801 - loss: 0.1872 - val_bbox_regression_loss: 0.1058 - val_cls_logits_loss: 0.0839 - val_loss: 0.1897
Epoch 5/5
4137/4137 ━━━━━━━━━━━━━━━━━━━━ 0s 109ms/step - bbox_regression_loss: 0.0978 - cls_logits_loss: 0.0663 - loss: 0.1641
Epoch 5: val_loss did not improve from 0.18972
1/1 ━━━━━━━━━━━━━━━━━━━━ 7s 7s/step