Path: blob/master/examples/vision/ipynb/handwriting_recognition.ipynb
3236 views
Handwriting recognition
Authors: A_K_Nain, Sayak Paul
Date created: 2021/08/16
Last modified: 2024/09/01
Description: Training a handwriting recognition model with variable-length sequences.
Introduction
This example shows how the Captcha OCR example can be extended to the IAM Dataset, which has variable length ground-truth targets. Each sample in the dataset is an image of some handwritten text, and its corresponding target is the string present in the image. The IAM Dataset is widely used across many OCR benchmarks, so we hope this example can serve as a good starting point for building OCR systems.
Data collection
Preview how the dataset is organized. Lines prepended by "#" are just metadata information.
Imports
Dataset splitting
We will split the dataset into three subsets with a 90:5:5 ratio (train:validation:test).
Data input pipeline
We start building our data input pipeline by first preparing the image paths.
Then we prepare the ground-truth labels.
Now we clean the validation and the test labels as well.
Building the character vocabulary
Keras provides different preprocessing layers to deal with different modalities of data. This guide provides a comprehensive introduction. Our example involves preprocessing labels at the character level. This means that if there are two labels, e.g. "cat" and "dog", then our character vocabulary should be {a, c, d, g, o, t} (without any special tokens). We use the StringLookup
layer for this purpose.
Resizing images without distortion
Instead of square images, many OCR models work with rectangular images. This will become clearer in a moment when we will visualize a few samples from the dataset. While aspect-unaware resizing square images does not introduce a significant amount of distortion this is not the case for rectangular images. But resizing images to a uniform size is a requirement for mini-batching. So we need to perform our resizing such that the following criteria are met:
Aspect ratio is preserved.
Content of the images is not affected.
If we just go with the plain resizing then the images would look like so:
Notice how this resizing would have introduced unnecessary stretching.
Putting the utilities together
Prepare tf.data.Dataset
objects
Visualize a few samples
You will notice that the content of original image is kept as faithful as possible and has been padded accordingly.
Model
Our model will use the CTC loss as an endpoint layer. For a detailed understanding of the CTC loss, refer to this post.
Evaluation metric
Edit Distance is the most widely used metric for evaluating OCR models. In this section, we will implement it and use it as a callback to monitor our model.
We first segregate the validation images and their labels for convenience.
Now, we create a callback to monitor the edit distances.
Training
Now we are ready to kick off model training.
Inference
To get better results the model should be trained for at least 50 epochs.
Final remarks
The
prediction_model
is fully compatible with TensorFlow Lite. If you are interested, you can use it inside a mobile application. You may find this notebook to be useful in this regard.Not all the training examples are perfectly aligned as observed in this example. This can hurt model performance for complex sequences. To this end, we can leverage Spatial Transformer Networks (Jaderberg et al.) that can help the model learn affine transformations that maximize its performance.