Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
ethen8181
GitHub Repository: ethen8181/machine-learning
Path: blob/master/deep_learning/seq2seq/1_torch_seq2seq_intro.ipynb
2587 views
Kernel: Python 3
# code for loading the format for the notebook import os # path : store the current path to convert back to it later path = os.getcwd() os.chdir(os.path.join('..', '..', 'notebook_format')) from formats import load_style load_style(css_style='custom2.css', plot_style=False)
os.chdir(path) # 1. magic for inline plot # 2. magic to print version # 3. magic so that the notebook will reload external python modules # 4. magic to enable retina (high resolution) plots # https://gist.github.com/minrk/3301035 %matplotlib inline %load_ext watermark %load_ext autoreload %autoreload 2 %config InlineBackend.figure_format='retina' import os import math import time import spacy import torch import random import numpy as np import torch.nn as nn import torch.optim as optim from typing import List from torchtext.datasets import Multi30k from torchtext.data import Field, BucketIterator %watermark -a 'Ethen' -d -t -v -p numpy,torch,torchtext,spacy
Ethen 2020-01-07 21:44:32 CPython 3.6.4 IPython 7.9.0 numpy 1.16.5 torch 1.3.1 torchtext 0.4.0 spacy 2.1.6

Seq2Seq

Seq2Seq (Sequence to Sequence) is a many to many network where two neural networks, one encoder and one decoder work together to transform one sequence to another. The core highlight of this method is having no restrictions on the length of the source and target sequence. At a high-level, the way it works is:

  • The encoder network condenses an input sequence into a vector, this vector is a smaller dimensional representation and is often referred to as the context/thought vector. This thought vector is served as an abstract representation for the entire input sequence.

  • The decoder network takes in that thought vector and unfolds that vector into the output sequence.

The main use case includes:

  • chatbots

  • text summarization

  • speech recognition

  • image captioning

  • machine translation

In this notebook, we'll be implementing the seq2seq model ourselves using Pytorch and use it in the context of German to English translations.

Seq2Seq Introduction

The following sections are heavily "borrowed" from the wonderful tutorial on this topic listed below.

Some personal preference modifications have been made.

The above image shows an example translation. The input/source sentence, "guten morgen", is input into the encoder (green) one word at a time. We also append a start of sequence (<sos>) and end of sequence (<eos>) token to the start and end of sentence, respectively. At each time-step, the input to the encoder is both the current word, xtx_t, as well as the hidden state from the previous time-step, ht1h_{t-1}, and the encoder outputs a new hidden state hth_t. We can think of the hidden state as a vector representation of the sentence so far. The can be represented as a function of both of xtx_t and ht1h_{t-1}:

ht=Encoder(xt,ht1)h_t = \text{Encoder}(x_t, h_{t-1})

We're using the term encoder loosely here, in practice, it can be any type of architecture, the most common ones being RNN-type network such as LSTM (Long Short-Term Memory) or a GRU (Gated Recurrent Unit).

Here, we have X={x1,x2,...,xT}X = \{x_1, x_2, ..., x_T\}, where x1=<sos>,x2=gutenx_1 = \text{<sos>}, x_2 = \text{guten}, etc. The initial hidden state, h0h_0, is usually either initialized to zeros or a learned parameter.

Once the final word, xTx_T, has been passed into the encoder, we use the final hidden state, hTh_T, as the context vector, i.e. hT=zh_T = z. This is a vector representation of the entire source sentence.

Now we have our context vector, zz, we can start decoding it to get the target sentence, "good morning". Again, we append the start and end of sequence tokens to the target sentence. At each time-step, the input to the decoder (blue) is the current word, yty_t, as well as the hidden state from the previous time-step, st1s_{t-1}, where the initial decoder hidden state, s0s_0, is the context vector, s0=z=hTs_0 = z = h_T, i.e. the initial decoder hidden state is the final encoder hidden state. Thus, similar to the encoder, we can represent the decoder as:

st=Decoder(yt,st1)s_t = \text{Decoder}(y_t, s_{t-1})

In the decoder, we need to go from the hidden state to an actual word, therefore at each time-step we use sts_t to predict (by passing it through a Linear layer, shown in purple) what we think is the next word in the sequence, y^t\hat{y}_t.

y^t=f(st)\hat{y}_t = f(s_t)

The words in the decoder are always generated one after another, with one per time-step. We always use <sos> for the first input to the decoder, y1y_1, but for subsequent inputs, yt>1y_{t>1}, we will sometimes use the actual, ground truth next word in the sequence, yty_t and sometimes use the word predicted by our decoder, y^t1\hat{y}_{t-1}. This is called teacher forcing, which we'll later see in action.

When training/testing our model, we always know how many words are in our target sentence, so we stop generating words once we hit that many. During inference (i.e. real world usage) it is common to keep generating words until the model outputs an <eos> token or after a certain amount of words have been generated.

Once we have our predicted target sentence, Y^={y^1,y^2,...,y^T}\hat{Y} = \{ \hat{y}_1, \hat{y}_2, ..., \hat{y}_T \}, we compare it against our actual target sentence, Y={y1,y2,...,yT}Y = \{ y_1, y_2, ..., y_T \}, to calculate our loss. We then use this loss to update all of the parameters in our model.

Data Preparation

We'll be coding up the models in PyTorch and using TorchText to help us do all of the pre-processing required. We'll also be using spaCy to assist in the tokenization of the data. We will introduce the functionalities some these libraries along the way as well.

SEED = 2222 random.seed(SEED) torch.manual_seed(SEED)
<torch._C.Generator at 0x123280170>

The next two code chunks:

  • Downloads the spacy model for the German and English language.

  • Create the tokenizer functions, which will take in the sentence as the input and return the sentence as a list of tokens. These functions can then be passed to torchtext.

# !python -m spacy download de # !python -m spacy download en
# the link below contains explanation of how spacy's tokenization works # https://spacy.io/usage/spacy-101#annotations-token spacy_de = spacy.load('de_core_news_sm') spacy_en = spacy.load('en_core_web_sm') def tokenize_de(text: str) -> List[str]: return [tok.text for tok in spacy_de.tokenizer(text)][::-1] def tokenize_en(text: str) -> List[str]: return [tok.text for tok in spacy_en.tokenizer(text)]
text = "I don't like apple." tokenize_en(text)
['I', 'do', "n't", 'like', 'apple', '.']

The tokenizer is language specific, e.g. it knows that in the English language don't should be tokenized into do not (n't).

Another thing to note is that the order of the source sentence is reversed during the tokenization process. The rationale behind things comes from the original seq2seq paper where they identified that this trick improved the result of their model.

Normally, when we concatenate a source sentence with a target sentence, each word in the source sentence is far from its corresponding word in the target sentence. By reversing the source sentence, the first few words in the source sentence now becomes very close to the first few words in the target sentence, thus the model would have lesser issue establishing communication between the source and target sentence. Although, the average distance between words in the source and target language remains the same during this process, however, it was shown that the model learned much better even on later parts of the sentence.

Declaring Fields

Moving on, we will begin leveraging torchtext's functionality. The first once is Field, which is where we specify how we wish to preprocess our text data for a certain field.

Here, we set the tokenize argument to the correct tokenization function for the source and target field, with German being the source field and English being the target field. The field also appends the "start of sequence" and "end of sequence" tokens via the init_token and eos_token arguments, and converts all words to lowercase. The docstring of the Field object is pretty well-written, please refer to it to see other arguments that it takes in.

source = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True) target = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)

Constructing Dataset

We've defined the logic of processing our raw text data, now we need to tell the fields what data it should work on. This is where Dataset comes in. The dataset we'll be using is the Multi30k dataset. This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence. Torchtext comes with a capability for us to download and load the training, validation and test data.

exts specifies which languages to use as the source and target (source goes first) and fields specifies which field to use for the source and target.

train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(source, target)) print(f"Number of training examples: {len(train_data.examples)}") print(f"Number of validation examples: {len(valid_data.examples)}") print(f"Number of testing examples: {len(test_data.examples)}")
Number of training examples: 29000 Number of validation examples: 1014 Number of testing examples: 1000

Upon loading the dataset, we can indexed and iterate over the Dataset like a normal list. Each element in the dataset bundles the attributes of a single record for us. We can index our dataset like a list and then access the .src and .trg attribute to take a look at the tokenized source and target sentence.

# equivalent, albeit more verbiage train_data.examples[0].src train_data[0].src
['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei']
train_data[0].trg
['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']

The next missing piece is to build the vocabulary for the source and target languages. That way we can convert our tokenized tokens into integers so that they can be fed into downstream models. Constructing the vocabulary and word to integer mapping is done by calling the build_vocab method of a Field on a dataset. This adds the vocab attribute to the field.

The vocabularies of the source and target languages are distinct. Using the min_freq argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an <unk> (unknown) token (we can customize this in the Field earlier if we like).

It is important to note that our vocabulary should only be built from the training set and not the validation/test set. This prevents "information leakage" into our model, giving us artificially inflated validation/test scores.

source.build_vocab(train_data, min_freq=2) target.build_vocab(train_data, min_freq=2) print(f"Unique tokens in source (de) vocabulary: {len(source.vocab)}") print(f"Unique tokens in target (en) vocabulary: {len(target.vocab)}")
Unique tokens in source (de) vocabulary: 7855 Unique tokens in target (en) vocabulary: 5893

Constructing Iterator

The final step of preparing the data is to create the iterators. Very similar to DataLoader in the standard pytorch package, Iterator in torchtext converts our data into batches, so that they can be fed into the model. These can be iterated on to return a batch of data which will have a src and trg attribute (PyTorch tensors containing a batch of numericalized source and target sentences). Numericalized is just a fancy way of saying they have been converted from a sequence of tokens to a sequence of corresponding indices, where the mapping between the tokens and indices comes from the learned vocabulary.

When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. Luckily, torchtext iterators handle this for us! BucketIterator is a extremely useful torchtext feature. It automatically shuffles and buckets the input sequences into sequences of similar length, this minimizes the amount of padding that we need to perform.

BATCH_SIZE = 128 # pytorch boilerplate that determines whether a GPU is present or not, # this determines whether our dataset or model can to moved to a GPU device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # create batches out of the dataset and sends them to the appropriate device train_iterator, valid_iterator, test_iterator = BucketIterator.splits( (train_data, valid_data, test_data), batch_size=BATCH_SIZE, device=device)
# pretend that we're iterating over the iterator and print out the print element test_batch = next(iter(test_iterator)) test_batch
[torchtext.data.batch.Batch of size 128 from MULTI30K] [.src]:[torch.LongTensor of size 10x128] [.trg]:[torch.LongTensor of size 14x128]
test_batch.src
tensor([[ 2, 2, 2, ..., 2, 2, 2], [ 4, 4, 4, ..., 4, 714, 4], [ 123, 91, 3449, ..., 669, 12, 1643], ..., [6788, 41, 26, ..., 1, 1, 1], [ 18, 105, 5, ..., 1, 1, 1], [ 3, 3, 3, ..., 1, 1, 1]])

We can list out the first batch, we see each element of the iterator is a Batch object, similar to element of a Dataset, we can access the fields via its attributes. The next important thing to note that it is of size [sentence length, batch size], and the longest sentence in the first batch of the source language has a length of 10.

Seq2Seq Implementation

# adjustable parameters INPUT_DIM = len(source.vocab) OUTPUT_DIM = len(target.vocab) ENC_EMB_DIM = 256 DEC_EMB_DIM = 256 HID_DIM = 512 N_LAYERS = 2 ENC_DROPOUT = 0.5 DEC_DROPOUT = 0.5

To define our seq2seq model, we first specify the encoder and decoder separately.

Encoder Module

class Encoder(nn.Module): """ Input : - source batch Layer : source batch -> Embedding -> LSTM Output : - LSTM hidden state - LSTM cell state Parmeters --------- input_dim : int Input dimension, should equal to the source vocab size. emb_dim : int Embedding layer's dimension. hid_dim : int LSTM Hidden/Cell state's dimension. n_layers : int Number of LSTM layers. dropout : float Dropout for the LSTM layer. """ def __init__(self, input_dim: int, emb_dim: int, hid_dim: int, n_layers: int, dropout: float): super().__init__() self.emb_dim = emb_dim self.hid_dim = hid_dim self.input_dim = input_dim self.n_layers = n_layers self.dropout = dropout self.embedding = nn.Embedding(input_dim, emb_dim) self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout) def forward(self, src_batch: torch.LongTensor): """ Parameters ---------- src_batch : 2d torch.LongTensor Batched tokenized source sentence of shape [sent len, batch size]. Returns ------- hidden, cell : 3d torch.LongTensor Hidden and cell state of the LSTM layer. Each state's shape [n layers * n directions, batch size, hidden dim] """ embedded = self.embedding(src_batch) # [sent len, batch size, emb dim] outputs, (hidden, cell) = self.rnn(embedded) # outputs -> [sent len, batch size, hidden dim * n directions] return hidden, cell
encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT).to(device) hidden, cell = encoder(test_batch.src) hidden.shape, cell.shape
(torch.Size([2, 128, 512]), torch.Size([2, 128, 512]))

Decoder Module

The decoder accept a batch of input tokens, previous hidden states and previous cell states. Note that in the decoder module, we are only decoding one token at a time, the input tokens will always have a sequence length of 1. This is different from the encoder module where we encode the entire source sentence all at once.

class Decoder(nn.Module): """ Input : - first token in the target batch - LSTM hidden state from the encoder - LSTM cell state from the encoder Layer : target batch -> Embedding -- | encoder hidden state ------|--> LSTM -> Linear | encoder cell state ------- Output : - prediction - LSTM hidden state - LSTM cell state Parmeters --------- output : int Output dimension, should equal to the target vocab size. emb_dim : int Embedding layer's dimension. hid_dim : int LSTM Hidden/Cell state's dimension. n_layers : int Number of LSTM layers. dropout : float Dropout for the LSTM layer. """ def __init__(self, output_dim: int, emb_dim: int, hid_dim: int, n_layers: int, dropout: float): super().__init__() self.emb_dim = emb_dim self.hid_dim = hid_dim self.output_dim = output_dim self.n_layers = n_layers self.dropout = dropout self.embedding = nn.Embedding(output_dim, emb_dim) self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout) self.out = nn.Linear(hid_dim, output_dim) def forward(self, trg: torch.LongTensor, hidden: torch.FloatTensor, cell: torch.FloatTensor): """ Parameters ---------- trg : 1d torch.LongTensor Batched tokenized source sentence of shape [batch size]. hidden, cell : 3d torch.FloatTensor Hidden and cell state of the LSTM layer. Each state's shape [n layers * n directions, batch size, hidden dim] Returns ------- prediction : 2d torch.LongTensor For each token in the batch, the predicted target vobulary. Shape [batch size, output dim] hidden, cell : 3d torch.FloatTensor Hidden and cell state of the LSTM layer. Each state's shape [n layers * n directions, batch size, hidden dim] """ # [1, batch size, emb dim], the 1 serves as sent len embedded = self.embedding(trg.unsqueeze(0)) outputs, (hidden, cell) = self.rnn(embedded, (hidden, cell)) prediction = self.out(outputs.squeeze(0)) return prediction, hidden, cell
decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT).to(device) # notice that we are not passing the entire the .trg prediction, hidden, cell = decoder(test_batch.trg[0], hidden, cell) prediction.shape, hidden.shape, cell.shape
(torch.Size([128, 5893]), torch.Size([2, 128, 512]), torch.Size([2, 128, 512]))

Seq2Seq Module

For the final part of the implementation, we'll implement the seq2seq model. This will handle:

  • receiving the input/source sentence

  • using the encoder to produce the context vectors

  • using the decoder to produce the predicted output/target sentence

The Seq2Seq model takes in an Encoder, Decoder, and a device (used to place tensors on the GPU, if it exists).

For this implementation, we have to ensure that the number of layers and the hidden (and cell) dimensions are equal in the Encoder and Decoder. This is not always the case, as we do not necessarily need the same number of layers or the same hidden dimension sizes in a sequence-to-sequence model. However, if we do have a different number of layers we will need to make decisions about how this is handled. For example, if our encoder has 2 layers and our decoder only has 1, how is this handled? Do we average the two context vectors output by the decoder? Do we pass both through a linear layer? Do we only use the context vector from the highest layer? etc.

Our forward method takes the source sentence, target sentence and a teacher-forcing ratio. The teacher forcing ratio is used when training our model. When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded. With probability equal to the teaching forcing ratio (teacher_forcing_ratio) we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. However, with probability 1 - teacher_forcing_ratio, we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence. Note that the teacher forcing ratio is only done during training and should be shut off during evaluation.

The first thing we do in the forward method is to create an outputs tensor that will store all of our predictions, Y^\hat{Y}.

We then feed the input/source sentence, XX/src, into the encoder and receive our final hidden and cell states.

The first input to the decoder is the start of sequence (<sos>) token. As our trg tensor already has the <sos> token appended (all the way back when we defined the init_token in our target field) we get our y1y_1 by slicing into it. We know how long our target sentences should be (max_len), so we loop that many times. During each iteration of the loop, we:

  • pass the input, previous hidden and previous cell states (yt,st1,ct1y_t, s_{t-1}, c_{t-1}) into the decoder

  • receive a prediction, next hidden state and next cell state (y^t+1,st,ct\hat{y}_{t+1}, s_{t}, c_{t}) from the decoder

  • place our prediction, y^t+1\hat{y}_{t+1}/output in our tensor of predictions, Y^\hat{Y}/outputs

  • decide if we are going to "teacher force" or not

    • if we do, the next input is the ground-truth next token in the sequence, yt+1y_{t+1}/trg[t]

    • if we don't, the next input is the predicted next token in the sequence, y^t+1\hat{y}_{t+1}/top1, which we get by doing an argmax over the output tensor

Once we've made all of our predictions, we return our tensor full of predictions, Y^\hat{Y}/outputs.

Note: our decoder loop starts at 1, not 0. This means the 0th element of our outputs tensor remains all zeros. So our trg and outputs look something like:

trg=[<sos>,y1,y2,y3,<eos>]outputs=[0,y^1,y^2,y^3,<eos>]\begin{align*} \text{trg} = [<sos>, &y_1, y_2, y_3, <eos>]\\ \text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>] \end{align*}

Later on when we calculate the loss, we cut off the first element of each tensor to get:

trg=[y1,y2,y3,<eos>]outputs=[y^1,y^2,y^3,<eos>]\begin{align*} \text{trg} = [&y_1, y_2, y_3, <eos>]\\ \text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>] \end{align*}

All of this should make more sense after we look at the code in the next few section. Feel free to check out the discussion in these two github issues for some more context with this topic. issue-45 and issue-46

class Seq2Seq(nn.Module): def __init__(self, encoder: Encoder, decoder: Decoder, device: torch.device): super().__init__() self.encoder = encoder self.decoder = decoder self.device = device assert encoder.hid_dim == decoder.hid_dim, \ 'Hidden dimensions of encoder and decoder must be equal!' assert encoder.n_layers == decoder.n_layers, \ 'Encoder and decoder must have equal number of layers!' def forward(self, src_batch: torch.LongTensor, trg_batch: torch.LongTensor, teacher_forcing_ratio: float=0.5): max_len, batch_size = trg_batch.shape trg_vocab_size = self.decoder.output_dim # tensor to store decoder's output outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device) # last hidden & cell state of the encoder is used as the decoder's initial hidden state hidden, cell = self.encoder(src_batch) trg = trg_batch[0] for i in range(1, max_len): prediction, hidden, cell = self.decoder(trg, hidden, cell) outputs[i] = prediction if random.random() < teacher_forcing_ratio: trg = trg_batch[i] else: trg = prediction.argmax(1) return outputs
# note that this implementation assumes that the size of the hidden layer, # and the number of layer are the same between the encoder and decoder encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT) decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT) seq2seq = Seq2Seq(encoder, decoder, device).to(device) seq2seq
Seq2Seq( (encoder): Encoder( (embedding): Embedding(7855, 256) (rnn): LSTM(256, 512, num_layers=2, dropout=0.5) ) (decoder): Decoder( (embedding): Embedding(5893, 256) (rnn): LSTM(256, 512, num_layers=2, dropout=0.5) (out): Linear(in_features=512, out_features=5893, bias=True) ) )
outputs = seq2seq(test_batch.src, test_batch.trg) outputs.shape
torch.Size([14, 128, 5893])
def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) print(f'The model has {count_parameters(seq2seq):,} trainable parameters')
The model has 13,899,013 trainable parameters

Training Seq2Seq

We've done the hard work of defining our seq2seq module. The final touch is to specify the training/evaluation loop.

optimizer = optim.Adam(seq2seq.parameters()) # ignore the padding index when calculating the loss PAD_IDX = target.vocab.stoi['<pad>'] criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
def train(seq2seq, iterator, optimizer, criterion): seq2seq.train() epoch_loss = 0 for batch in iterator: optimizer.zero_grad() outputs = seq2seq(batch.src, batch.trg) # 1. as mentioned in the seq2seq section, we will # cut off the first element when performing the evaluation # 2. the loss function only works on 2d inputs # with 1d targets we need to flatten each of them outputs_flatten = outputs[1:].view(-1, outputs.shape[-1]) trg_flatten = batch.trg[1:].view(-1) loss = criterion(outputs_flatten, trg_flatten) loss.backward() optimizer.step() epoch_loss += loss.item() return epoch_loss / len(iterator)
def evaluate(seq2seq, iterator, criterion): seq2seq.eval() epoch_loss = 0 with torch.no_grad(): for batch in iterator: # turn off teacher forcing outputs = seq2seq(batch.src, batch.trg, teacher_forcing_ratio=0) # trg = [trg sent len, batch size] # output = [trg sent len, batch size, output dim] outputs_flatten = outputs[1:].view(-1, outputs.shape[-1]) trg_flatten = batch.trg[1:].view(-1) loss = criterion(outputs_flatten, trg_flatten) epoch_loss += loss.item() return epoch_loss / len(iterator)
def epoch_time(start_time, end_time): elapsed_time = end_time - start_time elapsed_mins = int(elapsed_time / 60) elapsed_secs = int(elapsed_time - (elapsed_mins * 60)) return elapsed_mins, elapsed_secs
N_EPOCHS = 20 best_valid_loss = float('inf') for epoch in range(N_EPOCHS): start_time = time.time() train_loss = train(seq2seq, train_iterator, optimizer, criterion) valid_loss = evaluate(seq2seq, valid_iterator, criterion) end_time = time.time() epoch_mins, epoch_secs = epoch_time(start_time, end_time) if valid_loss < best_valid_loss: best_valid_loss = valid_loss torch.save(seq2seq.state_dict(), 'tut1-model.pt') # it's easier to see a change in perplexity between epoch as it's an exponential # of the loss, hence the scale of the measure is much bigger print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s') print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}') print(f'\t Val. Loss: {valid_loss:.3f} | Val. PPL: {math.exp(valid_loss):7.3f}')
Epoch: 01 | Time: 1m 12s Train Loss: 5.023 | Train PPL: 151.870 Val. Loss: 4.904 | Val. PPL: 134.856 Epoch: 02 | Time: 1m 12s Train Loss: 4.396 | Train PPL: 81.134 Val. Loss: 4.651 | Val. PPL: 104.687 Epoch: 03 | Time: 1m 12s Train Loss: 4.076 | Train PPL: 58.924 Val. Loss: 4.411 | Val. PPL: 82.381 Epoch: 04 | Time: 1m 12s Train Loss: 3.811 | Train PPL: 45.217 Val. Loss: 4.314 | Val. PPL: 74.703 Epoch: 05 | Time: 1m 12s Train Loss: 3.569 | Train PPL: 35.482 Val. Loss: 4.014 | Val. PPL: 55.342 Epoch: 06 | Time: 1m 12s Train Loss: 3.355 | Train PPL: 28.659 Val. Loss: 3.933 | Val. PPL: 51.046 Epoch: 07 | Time: 1m 12s Train Loss: 3.187 | Train PPL: 24.222 Val. Loss: 3.811 | Val. PPL: 45.207 Epoch: 08 | Time: 1m 12s Train Loss: 3.028 | Train PPL: 20.662 Val. Loss: 3.810 | Val. PPL: 45.140 Epoch: 09 | Time: 1m 12s Train Loss: 2.863 | Train PPL: 17.513 Val. Loss: 3.709 | Val. PPL: 40.809 Epoch: 10 | Time: 1m 12s Train Loss: 2.751 | Train PPL: 15.661 Val. Loss: 3.755 | Val. PPL: 42.746 Epoch: 11 | Time: 1m 12s Train Loss: 2.615 | Train PPL: 13.666 Val. Loss: 3.727 | Val. PPL: 41.568 Epoch: 12 | Time: 1m 12s Train Loss: 2.481 | Train PPL: 11.959 Val. Loss: 3.692 | Val. PPL: 40.135 Epoch: 13 | Time: 1m 12s Train Loss: 2.389 | Train PPL: 10.898 Val. Loss: 3.734 | Val. PPL: 41.846 Epoch: 14 | Time: 1m 12s Train Loss: 2.281 | Train PPL: 9.791 Val. Loss: 3.748 | Val. PPL: 42.419 Epoch: 15 | Time: 1m 12s Train Loss: 2.179 | Train PPL: 8.838 Val. Loss: 3.722 | Val. PPL: 41.360 Epoch: 16 | Time: 1m 12s Train Loss: 2.082 | Train PPL: 8.019 Val. Loss: 3.798 | Val. PPL: 44.629 Epoch: 17 | Time: 1m 12s Train Loss: 2.017 | Train PPL: 7.514 Val. Loss: 3.731 | Val. PPL: 41.717 Epoch: 18 | Time: 1m 12s Train Loss: 1.912 | Train PPL: 6.767 Val. Loss: 3.791 | Val. PPL: 44.289 Epoch: 19 | Time: 1m 11s Train Loss: 1.839 | Train PPL: 6.292 Val. Loss: 3.789 | Val. PPL: 44.197 Epoch: 20 | Time: 1m 11s Train Loss: 1.758 | Train PPL: 5.802 Val. Loss: 3.880 | Val. PPL: 48.423

Evaluating Seq2Seq

seq2seq.load_state_dict(torch.load('tut1-model.pt')) test_loss = evaluate(seq2seq, test_iterator, criterion) print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')
| Test Loss: 3.650 | Test PPL: 38.477 |

Here, we pick a random example in our dataset, print out the original source and target sentence. Then takes a look at whether the "predicted" target sentence generated by the model.

example_idx = 0 example = train_data.examples[example_idx] print('source sentence: ', ' '.join(example.src)) print('target sentence: ', ' '.join(example.trg))
source sentence: . büsche vieler nähe der in freien im sind männer weiße junge zwei target sentence: two young , white males are outside near many bushes .
src_tensor = source.process([example.src]).to(device) trg_tensor = target.process([example.trg]).to(device) print(trg_tensor.shape) seq2seq.eval() with torch.no_grad(): outputs = seq2seq(src_tensor, trg_tensor, teacher_forcing_ratio=0) outputs.shape
torch.Size([13, 1])
torch.Size([13, 1, 5893])
output_idx = outputs[1:].squeeze(1).argmax(1) ' '.join([target.vocab.itos[idx] for idx in output_idx])
'two young men in large large are near some sort . <eos>'

Summary

In this document:

  • We took a stab at implementing a vanilla version of the seq2seq model, and train it on a German to English translation.

  • Implemented the trick introduced by the original seq2seq paper where they reverse the order of the tokens in the source sentence.

There are a lot of other tricks/ideas that are mentioned in the original paper and worth exploring. e.g.

  • A LSTM with 4 layers was chosen.

  • Beam Search was also used to decode the sentence.

  • Instead of only relying on log-loss or perplexity, another evaluation metric that they used to evaluate the quality of their translation.

Reference