Path: blob/main/examples/single_country_retrain.ipynb
1238 views
In this example, we will retrain a pre-trained model to maximize its performance for specific countries (e.g. the UK or Canada).
Retrain a Model
First, to retrain our supervised model, we need parsed address example, as shown in the following figure. Fortunately, we have access to a public dataset of such parsed examples, the Structured Multinational Address Dataset.

For our example, we will focus on UK addresses since we want to parse addresses only from the UK. So let's first download the dataset directly from the public repository using Deepparse download_from_public_repository function.
The dataset archive is a zip directory of subdirectories in which each country's data is compressed into an LZMA file (a more aggressive compression algorithm). The dataset public repository offers a script to decompress the LZMA compress dataset zip archive. We will use the basic idea of it to decompress the dataset in the next code cell (the script handles CLI parameters).
Now, let's import our train and test datasets into memory to retrain our parser model.
We will use the FastText one for our base pre-trained model since it is faster to retrain.
But first, let's see what the performance is before retraining.
Epoch: 1/5 Train steps: 2500 Val steps: 625 57.82s loss: 0.096870 accuracy: 99.663765 val_loss: 0.105059 val_accuracy: 99.660023
Epoch 1: val_loss improved from inf to 0.10506, saving file to ./uk_faster_retrain/checkpoint_epoch_1.ckpt
Epoch: 2/5 Train steps: 2500 Val steps: 625 58.84s loss: 0.092458 accuracy: 99.677379 val_loss: 0.103238 val_accuracy: 99.672032
Epoch 2: val_loss improved from 0.10506 to 0.10324, saving file to ./uk_faster_retrain/checkpoint_epoch_2.ckpt
Epoch: 3/5 Train steps: 2500 Val steps: 625 58.43s loss: 0.090964 accuracy: 99.683519 val_loss: 0.103035 val_accuracy: 99.673781
Epoch 3: val_loss improved from 0.10324 to 0.10304, saving file to ./uk_faster_retrain/checkpoint_epoch_3.ckpt
Epoch: 4/5 Train steps: 2500 Val steps: 625 58.37s loss: 0.089921 accuracy: 99.685827 val_loss: 0.103027 val_accuracy: 99.673781
Epoch 4: val_loss improved from 0.10304 to 0.10303, saving file to ./uk_faster_retrain/checkpoint_epoch_4.ckpt
Epoch: 5/5 Train steps: 2500 Val steps: 625 58.31s loss: 0.090967 accuracy: 99.684051 val_loss: 0.103027 val_accuracy: 99.673781
Epoch 5: val_loss improved from 0.10303 to 0.10303, saving file to ./uk_faster_retrain/checkpoint_epoch_5.ckpt
Restoring data from ./uk_faster_retrain/checkpoint_epoch_5.ckpt
Running test
Test steps: 57 1.74s test_loss: 0.120875 test_accuracy: 99.575062
To further improve performance, we could train for longer, increase the training dataset size (the actual size of 100,000 addresses), or rework the Seq2Seq hidden sizes. See the retrain interface documentation for all the training parameters.