Path: blob/master/deep_learning/seq2seq/torch_transformer.ipynb
2581 views
Transformer
Seq2Seq based machine translation system usually comprises of two main components, an encoder that encodes in source sentence into context vectors and a decoder that decodes the context vectors into target sentence, transformer model is no different in this regards. Reasons to their growing popularity at the time of writing this document are primarily due to self attention layers and parallel computation.
Previous RNN based encoder and decoder has a constraint of sequential computation. A hidden state at time in a recurrent layer, has only seen tokens and all the tokens before it, even though this gives us the benefit of modeling long dependencies, it hinders training speed as we can't process the next time step until we finish processing the current one. Transformer model aims to mitigate this issue by solely relying on attention mechanism, where each context vector produced by a transformer model has seen all tokens at all positions within the input sequence. In other words, instead of compressing the entire source sentence, into a single context vector, , it produces a sequence of context vectors, in one parallel computation. We'll get to the details of attention mechanism, self attention, that's used throughout the Transformer model in later sections. One important thing to note here is that breakthrough of this model is not due to invention of the attention mechansim, as this concept existed well before. The highlight here is we can build a highly performant model with attention mechanism in isolation, i.e. without the use of recurrent (RNN) or convolutional (CNN) neural networks in the mix.
In this article, we will be implementing Transformer module from the famous Attention is all you need paper [9]. This implementation's structure is largely based on [1]. With the primary difference that we'll be using huggingface's dataset instead of torchtext for data loading, as well as show casing how to implement Transformer module leveraging PyTorch built in Transformer Encoder and Decoder block.
Data Preprocessing
We'll be using the Multi30k dataset to demonstrate using the transfomer model in a machine translation task. This German to English training dataset's size is around 29K. We'll start off by downloading the raw dataset and extracting them. Feel free to swap this step with any other machine translation dataset. If the original link for these datasets fails to load, use this alternative google drive link.
We print out the content in the data directory and some sample data.
The original dataset is splits the source and the target language into two separate files (e.g. train.de, train.en are the training dataset for German and English). This type of format is useful when we wish to train a tokenizer on top of the source or target language as we'll soon see.
On the other hand, having the source and target pair together in one single file makes it easier to load them in batches for training or evaluating our machine translation model. We'll create the paired dataset, and load the dataset. For loading the dataset, it will be helpful to have some basic understanding of Huggingface's dataset.
We can access each split, and record/pair with the following syntax.
From our raw pair, we need to use or train a tokenizer to convert them into numerical indices. Here we'll be training our tokenizer from scratch using Huggingface's tokenizer. Feel free to swap this step out with other tokenization procedures, what's important is to leave rooms for special tokens such as the init token that represents the start of a sentence, the end of sentence token that represents the end of a sentence, and padding token that pads sentence batches into equivalent length.
We'll perform this tokenization step for all our dataset up front, so we can do as little preprocessing as possible while feeding our dataset to model. Note that we do not perform the padding step at this stage.
The final step for our data preprocessing step is to prepare the DataLoader, which prepares batches of tokenized ids for our model. The customized collate function performs the batching as well as padding.
Model Architecture From Scratch
Having prepared the data, we can now start implementing Transformer model's architecture, which looks like the following:

Position Wise Embedding
First, input tokens are passed through a standard embedding layer. Next, as the entire sentence is fed into the model in one go, by default it has no idea about the tokens' order within the sequence. We cope with this by using a second embedding layer, positional embedding. This is an embedding layer where our input is not the token id but the token's position within the sequence. If we configure our position embedding to have a "vocabulary" size of 100, this means our model can accept sentences up to 100 tokens long.
The original Transformer implementation from the Attention is All You Need paper does not learn positional embeddings. Instead it uses a fixed static positional encoding. Modern Transformer architectures, like BERT, use positional embeddings, hence, we have decided to use them in these tutorials. Feel free to check out other tutorials [7] [8] to read more about positional encoding used in the original Transformer model.
Next, token and positional embeddings are combined together using an elementwise sum operation, giving us a single vector that contains information on both the token and its position with in the sequence. Before they are summed, token embeddings are multiplied by a scaling factor , where is the hidden dimension size, hid_dim. This supposedly reduces variance in the embeddings and without this scaling factor, it becomes difficult to train the model reliably. Dropout is then applied to the combined embeddings.
The combined embeddings are then passed through encoder layers to get our context vectors . Before jumping straight into the encoder layers, we'll introduce some of the core building blocks behind them.
Multi Head Attention Layer
One of the key concepts introduced by Transformer model is multi-head attention layer.

The purpose behind an attention mechanism is to relate inputs from different parts of the sequence. Attention operation is comprised of queries, keys and values. It might be helpful to look at these terms from an informational retrieval perspective, where every time we issue a query to a search engine, the search engine will match it with some key (title, description), and retrieve the associated value (content).
To be specific, Transformer model uses scaled dot-product attention, where query is used with key to get an attention vector, which is then used to get a weighted sum of the values.
Where , is our input matrix, , , are linear layers for the query, key and value. is the head dimension, head_dim, which we will further explain shortly. In essence, we are multiplying our input matrix with 3 different weight matrices. We first peform a dot product between query and key followed by a softmax to calculate attention weight, which measures correlation between the two words, finally scaling it by before doing a dot product with the value to get the weighted value. Scaling is done to prevent the results of the dot product from growing too large, and causing the gradients to become too small.
Multi-head attention extends the single attention mechansim so we can potentially pay attention to different concepts that exists at different sequence positions. If end users are familiar with convolutional neural networks, this trick is very similar to introducing multiple filters so each can learn different aspects of the input. Instead of doing a single attention operation, the queries, keys and values have their hid_dim split into heads each of size , and the scaled dot-product attention is calculated over all heads in parallel. After this computation, we re-combine the heads back to hid_dim shape. By reducing the dimensionality of each head/concept, the total computational cost is similar to a full dimension single-head attention.
Where is the linear layer applied at the end of the multi-head attention layer.
In the implementation below, we carry out the multi head attention in parallel using batch matrix multiplication as opposed to a for loop. And while calculating the attention weights, we introduce the capability of applying a mask so the model does not pay attention to irrelevant tokens. We'll elaborate more on this in future sections.
Position Wise Feed Forward Layer
Another building block is the position wise feed forward layer, which consists of two linear transformations. These transformations are identical across different positions. i.e. feed forward layers are typically used on a tensor of shape (batch_size, hidden_dim), here it is directly operating on a tensor of shape (batch size, seq_len, hidden_dim).
The input is transformed from hid_dim to pf_dim, where pf_dim is usually a lot larger than hid_dim. Then an activation function is applied before it is transformed back into a hid_dim representation.
Encoder
We'll now put our building blocks together to form the encoder.

We first pass the source sentence through a position wise embedding layer, this is then followed by N (configurable) encoder layers, the "meat" of modern transformer based architecture. The main role of our encoder is to update our embeddings/weights so that it can learn some contextual information about our text sequence, e.g. the word "bank" will be updated to be more "financial establishment" like and less "land along river" if words such as money and investment are close to it.
Inside the encoder layer, we start from the multi-head attention layer, perform dropout on it, apply a residual connection, pass it through a layer normalization layer. followed by a position-wise feedforward layer and then, again, apply dropout, a residual connection and then layer normalization to get the output, this is then fed into the next layer. This sounds like a mouthful, but potentially the code will clarify things a bit. Things worth noting:
Parameters are not shared between layers.
Multi head attention layer is used by the encoder layer to attend to the source sentence, i.e. it is calculating and applying attention over itself instead of another sequence, hence we call it self attention. This layer is the only layer that propagates information along the sequence, other layers operate on each individual token in isolation.
The gist behind layer normalization is that it normalizes the features' values across the hidden dimension so each feature has a mean of 0 and a standard deviation of 1. This trick along with residual connection, makes it easier to train neural networks with a larger number of layers, like the Transformer.
Decoder
Now comes the decoder part:

Decoder's main goal is to take our source sentence's encoded representation, , convert it into predicted tokens in the target sentence, . We then compare it with the actual tokens in the target sentence, , to calculate our loss and update our parameters to improve our predictions.
Decoder layer contains similar building blocks as the encoder layer, except it now has two multi-head attention layers, self_attention and encoder_attention.
The former attention layer performs self attention on our target sentence's embedding representation to generate a decoder representation. Whereas for the encoder/decoder attention layer, decoder's intermediate presentation will represent queries, whereas keys and values come from encoder representation's output.
Seq2Seq
Now that we have our encoder and decoder, the final part is to have a Seq2Seq module that encapsulates the two. In this module, we'll also handle masking.
The source mask is created by checking where our source sequence is not equal to the <pad> token. It is 1 where the token is not a token and 0 when it is. This is used in our encoder layers' multi-head attention mechanisms, where we want our model to not pay any attention to <pad> tokens, which contain no useful information.
The target mask is a bit more involved. First, we create a mask for the tokens, as we did for the source mask. Next, we create a "subsequent" mask, trg_sub_mask, using torch.tril. This creates a diagonal matrix where the elements above the diagonal will be zero and the elements below the diagonal will be set to whatever the input tensor is. In this case, the input tensor will be a tensor filled with ones, meaning our trg_sub_mask will look something like this (for a target with 5 tokens):
This shows what each target token (row) is allowed to look at (column). Our first target token has a mask of [1, 0, 0, 0, 0] which means it can only look at the first target token, whereas the second target token has a mask of [1, 1, 0, 0, 0] which it means it can look at both the first and second target tokens and so on.
The "subsequent" mask is then logically anded with the padding mask, this combines the two masks ensuring both the subsequent tokens and the padding tokens cannot be attended to. For example if the last two tokens were <pad> tokens the final target mask would look like:
These masks are fed in into model along with source and target sentence to get out predicted target output.
Site Note: Introducing some other terminology that we might come across. The need to create a subsequent mask is very common in autoregressive model, where the task is to predict the next token in the sequence (e.g. language model). By introducing this masking, we are making the self attention block casual. Different implementation or library might have different ways of specifying this masking, but the core idea is to prevent the model from "cheating" by copying the tokens that are after the ones it's currently processing.
Model Training
The training loop also requires a bit of explanation.
We want our model to predict the <eos> token but not have it be an input into our model, hence we slice the <eos> token off the end of our target sequence.
We then calculate our loss using the original target tensor with the <sos> token sliced off the front, retaining the <eos> token.
All in all, our model receives the target up to the last character (excluding the last), whereas the ground truth will be from the second character onward.
Evaluation loop is similar to the training loop, just without the updating the model's parameters.
While defining our loss function, we also ensure we ignore loss that are calculated over the <pad> tokens.
Model Evaluation
Transformer Module
Instead of resorting to our own Transformer encoder and decoder implementation, PyTorch's nn module already comes with a pre-built one. The major difference here is it expects a different shape for the padding and subsequent mask.
In this notebook, we delved into the implementation of Transformer models. Although originally proposed for solving NLP tasks like machine translation, this module or building block is also gaining popularity in other fields such as computer vision [2].
Reference
[1] Jupyter Notebook: Attention is All You Need
[2] Jupyter Notebook: Tutorial 6: Transformers and Multi-Head Attention
[3] Colab: Simple PyTorch Transformer Example with Greedy Decoding
[4] Blog: Transformers from scratch
[5] Blog: Making Pytorch Transformer Twice as Fast on Sequence Generation
[6] Blog: How Transformers work in deep learning and NLP: an intuitive introduction
[7] PyTorch Documentation: Sequence to sequence modeling with nn.Transformer and Torchtext
[8] The Annotated Transformer
[9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin - Attention is All you Need (2017)