Path: blob/master/deep_learning/seq2seq/huggingface_torch_transformer.ipynb
2594 views
Machine Translation with Huggingface Transformer
In this article we'll be leveraging Huggingface's Transformer on our machine translation task. The library provides thousands of pretrained models that we can use on our tasks. Apart from that, we'll also take a look at how to use its pre-built tokenizer and model architecture to train a model from scratch.
Data Preprocessing
We'll be using the Multi30k dataset to demonstrate using the transfomer model in a machine translation task. This German to English training dataset's size is around 29K, a moderate sized dataset so that we can get our results without waiting too long. We'll start off by downloading the raw dataset and extracting them. Feel free to swap this step with any other machine translation dataset.
We print out the content in the data directory and some sample data.
The original dataset is splits the source and the target language into two separate files (e.g. train.de, train.en are the training dataset for German and English). This type of format is useful when we wish to train a tokenizer on top of the source or target language as we'll soon see.
On the other hand, having the source and target pair together in one single file makes it easier to load them in batches for training or evaluating our machine translation model. We'll create the paired dataset, and load the dataset. For loading the dataset, it will be helpful to have some basic understanding of Huggingface's dataset.
We can acsess the split, and each record/pair with the following syntax.
Pretrained Model
To get started we'll use the MarianMT pretrained model translation model.
First thing we'll do is to load the pre-trained tokenizer, using the from_pretrained syntax. This ensures we get the tokenizer and vocabulary corresponding to the model architecture for this specific checkpoint.
We can pass a single record, or a list of records to huggingface's tokenizer. Then depending on the model, we might see different keys in the dictionary returned. For example, here, we have:
input_ids: The tokenizer converted our raw input text into numerical ids.attention_maskMask to avoid performing attention on padded token ids. As we haven't yet performed the padding step, the numbers are all showing up as 1, indicating they are not masked.
We can apply the tokenizers to our entire raw dataset, so this preprocessing will be a one time process. By passing the function to our dataset dict's map method, it will apply the same tokenizing step to all the splits in our data.
Having prepared our dataset, we'll load the pre-trained model. Similar to the tokenizer, we can use the .from_pretrained method, and specify a valid huggingface model.
We can directly use this model to generate the translations, and eyeball the results.
Training Model From Scratch
The next section shows the steps for training the model parameters from scratch. Instead of directly instantiating the model using .from_pretrained method. We use the .from_config method, where we specify the configurations for a particular model architecture. The configuration will be created using .from_pretrained, as well as updating some of the configuration hyper parameters, where we opted for a smaller model for faster iteration.
The huggingface library offers pre-built functionality to avoid writing the training logic from scratch. This step can be swapped out with other higher level trainer packages or even implementing our own logic. We setup the:
Seq2SeqTrainingArgumentsa class that contains all the attributes to customize the training. At the bare minimum, it requires one folder name, which will be used to save model checkpoint.DataCollatorForSeq2Seqa helper class provided to batch our examples. Where the padding logic resides.
We can take a look at the batched examples. Understanding the output can be beneficial if we wish to customize the data collate function later.
attention_maskPadded tokens will be masked out with 0..input_ids. Input ids are padded with the padding special tokens.labels. By default -100 will be automatically ignored by PyTorch loss functions, hence we will use that particular id when padding our labels.
Similar to what we did before, we can use this model to generate the translations, and eyeball the results.
Training Tokenizer and Model From Scratch
From our raw pair, we need to use or train a tokenizer to convert them into numerical indices. Here we'll be training our tokenizer from scratch using Huggingface's tokenizer. Feel free to swap this step out with other tokenization procedures, what's important is to leave rooms for special tokens such as the init token that represents the beginning of a sentence, the end of sentence token that represents the end of a sentence, unknown token, and padding token that pads sentence batches into equivalent length.
We'll perform this tokenization step for all our dataset up front, so we can do as little preprocessing as possible while feeding our dataset to model. Note that we do not perform the padding step at this stage.
Given the custom tokenizer, we can also custom our data collate class that does the padding for input and labels.
Given that we are using our own tokenizer instead of the pre-trained ones, we need to update a couple of other parameters in our config. The one that's worth pointing out is that this model starts generating with pad_token_id, that's why the decoder_start_token_id is the same as the pad_token_id.
Then rest of model training code should be the same as the ones in the previous section.
Confirming saving and loading the model gives us identical predictions.
As the last step, we'll write a inferencing function that performs batch scoring on a given dataset. Here we generate the predictions and save it in a pandas dataframe along with the source and the target.