Path: blob/master/deep_learning/seq2seq/2_torch_seq2seq_attention.ipynb
2587 views
Seq2Seq With Attention
Seq2Seq framework involves a family of encoders and decoders, where the encoder encodes a source sequence into a fixed length vector from which the decoder picks up and aims to correctly generates the target sequence. The vanilla version of this type of architecture looks something along the lines of:

The RNN encoder has an input sequence . We denote the encoder states by . The encoder outputs a single output vector which is passed as input to the decoder. Like the encoder, the decoder is also a single-layered RNN, we denote the decoder states by and the network's output by . A problem with this vanilla architecture lies in the fact that the decoder needs to represent the entire input sequence as a single vector , which can cause information loss. In other words, the fixed-length context vector is hypothesized to be the bottleneck in this framework.
The attention mechanism that we'll be introducing here extends this approach by allowing the model to soft search for parts of the source sequence that are relevant to predicting the target sequence, which looks like the following:

The attention mechanism is located between the encoder and the decoder, its input is composed of the encoder's output vectors and the states of the decoder , the attention's output is a sequence of vectors called context vectors denoted by .
These context vectors enable the decoder to focus on certain parts of the input when predicting its output. Each context vector is a weighted sum of the encoder's output vectors , where each vector contains information about the whole input sequence with a strong focus on the parts surrounding the i-th vector of the input sequence. The vectors are scaled by weights capturing the degree of relevance of input to output at time , . The context vectors are calculated by:
The attention weights are learned using an additional fully-connected network, denoted by , whose input consists of the decoder's hidden state and the encoder's output . It's computation can be more formally defined by:
Where:

As can be seen in the above image, the fully-connected network receives the concatenation of vectors as input at time step . The network has a single fully-connected layer, the outputs of the layer, denoted by , are passed through a softmax function computing the attention weights, which lie in .
Note that we are using the same fully-connected network for all the concatenated pairs , meaning there is a single network learning the attention weights.

To re-emphasize the attention weights reflects the importance of with respect to the previous hidden state in deciding the next state and generating . A large attention weight causes the RNN to focus on input (represented by the encoder's output ), when predicting the output .
We can talk through an iteration of the algorithm to see how it all ties together.

The first computation performed is the computation of vectors by the encoder. These are then used as inputs to the attention mechanism. This is where the decoder is first involved by inputting its initial state vector (note that for this initial state of the decoder, we often times use the hidden state from the encoder) and we have the first attention input sequence .

The attention mechanism picks up the inputs and computes the first set of attention weights enabling the computation of the first context vector . The decoder now uses to generate the first output . This process then repeats itself, until we've generated all the outputs.
Data Preparation
This part is pretty much identical to that of the vanilla seq2seq, hence explanation is omitted.
Model Implementation
The following sections are heavily "borrowed" from the wonderful tutorial on this topic listed below.
Some personal preference modifications have been made.
Encoder
Like other seq2seq-like architectures, we first need to specify an encoder. Here we'll be using a bidirectional GRU layer. With a bidirectional layer, we have a forward layer scanning the sentence from left to right (shown below in green), and a backward layer scanning the sentence from right to left (yellow). From the coding perspective, we need to set the bidirectional=True for the GRU layer's argument.

More formally, we now have:
Where and .
As before, we only pass an embedded input to our GRU layer. We'll get two context vectors, one from the forward layer after it has seen the final word in the sentence, , and one from the backward layer after it has seen the first word in the sentence, .
As we'll be using bidirectional layer, the next section is devoted to help us understand how the output looks like before we implement the actual encoder that we'll be using. The shape of the output is explicitly printed out to make it easier to comprehend. Here, we're using GRU layer, which can be replaced with a LSTM layer, which is similar, but return an additional cell state variable that has the same size as the hidden state.
Notice that output's last dimension is 1024, which is the hidden dimension (512) multiplied by the number of directions (2). Whereas the hidden's first dimension is 2, representing the number of directions (2).
The returned outputs of bidirectional RNN at timestep is the output after feeding input to both the reverse and normal RNN unit at timestep , where normal RNN has seen inputs and reverse RNN has seen inputs , with being the length of the sequence).
The returned hidden state of bidirectional RNN is the hidden state after the whole sequence is consume. For normal RNN it's after timestep ; for reverse RNN it's after timestep 1.
The following diagram can also come in handy when visualizing the difference between output and hidden.

In the diagram notes each timestep, and denotes the number of layer.
output comprises all the hidden states in the last layer ("last" depth-wise, not time-wise).
(, ) comprise of the hidden states after the last timestep, , so we could potentially feed them into another LSTM layer.
Notice now the first dimension of the hidden cell becomes 4, which represents the number of layers (2) multiplied by the number of directions (2). The order of the hidden state is stacked by [forward_1, backward_1, forward_2, backward_2, ...]
We'll need some final touches for our actual encoder. As our encoder's hidden state will be used as the decoder's initial hidden state, we need to make sure we make them the same shape. In our example, the decoder is not bidirectional, and only needs a single context vector, , to use as its initial hidden state, , and we currently have two, a forward and a backward one ( and , respectively). We solve this by concatenating the two context vectors together, passing them through a linear layer, , and applying the activation function.
Attention
The next part is the hightlight. The attention layer will take in the previous hidden state of the decoder , and all of the stacked forward and backward hidden state from the encoder . The output will be an attention vector , that is the length of the source sentece, each element of this vector will be a floating number between 0 and 1, and the entire vector sums up to 1.
Intuitively, this layer takes in what we've decoded so far , and all of what have encoded , to produce a vector , that represents which word in the source sentence should we pay the most attention to in order to correctly predict the next thing in the target sequence .
Graphically, this looks something like below. For the very first attention vector, where we use the encoder's hidden state as the initial hidden state from the decoder. The green/yellow blocks represent the hidden states from both the forward and backward RNNs, and the attention computation is all done within the pink block.

Decoder
Now comes the decoder, within the decoder, we first use the attention layer that we've created in the previous section to compute the attention weight, this gives us the weight for each source sentence that the model should pay attention to when generating the current target output in the sequence. Along with the output from the encoder, this gives us the context vector. Finally, the decoder takes the embedded input along with the context to generate the target output in the sequence.
Seq2Seq
This part is about putting the encoder and decoder together and is very much identical to the vanilla seq2seq framework, hence the explanation is omitted.
Training Seq2Seq
We've done the hard work of defining our seq2seq module. The final touch is to specify the training/evaluation loop.
Evaluating Seq2Seq
Here, we pick a random example in our dataset, print out the original source and target sentence. Then takes a look at whether the "predicted" target sentence generated by the model.
Summary
Upon implementing the attention mechanism, we were able to achieve a better evaluation score on the test set, while even using less parameters. As mentioned in the original paper:
We extended the basic encoder–decoder by letting a model (soft)search for a set of input words. This frees the model from having to encode the whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word. This has a major positive impact on the ability of the neural machine translation system to yield good results on longer sentences.
Note that another interesting thing that we're capable of doing but wasn't done here is to visualize the attention weight to see for a given translation, where the model is focusing on.