CoCalc -- neural_machine_translation_with_keras

GitHub Repository: keras-team/keras-io
Path: blob/master/examples/nlp/neural_machine_translation_with_keras_hub.py
⁸⁴¹² views
1
"""
2
Title: English-to-Spanish translation with KerasHub
3
Author: [Abheesht Sharma](https://github.com/abheesht17/)
4
Date created: 2022/05/26
5
Last modified: 2024/04/30
6
Description: Use KerasHub to train a sequence-to-sequence Transformer model on the machine translation task.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Introduction
12

13
KerasHub provides building blocks for NLP (model layers, tokenizers, metrics, etc.) and
14
makes it convenient to construct NLP pipelines.
15

16
In this example, we'll use KerasHub layers to build an encoder-decoder Transformer
17
model, and train it on the English-to-Spanish machine translation task.
18

19
This example is based on the
20
[English-to-Spanish NMT
21
example](https://keras.io/examples/nlp/neural_machine_translation_with_transformer/)
22
by [fchollet](https://twitter.com/fchollet). The original example is more low-level
23
and implements layers from scratch, whereas this example uses KerasHub to show
24
some more advanced approaches, such as subword tokenization and using metrics
25
to compute the quality of generated translations.
26

27
You'll learn how to:
28

29
- Tokenize text using `keras_hub.tokenizers.WordPieceTokenizer`.
30
- Implement a sequence-to-sequence Transformer model using KerasHub's
31
`keras_hub.layers.TransformerEncoder`, `keras_hub.layers.TransformerDecoder` and
32
`keras_hub.layers.TokenAndPositionEmbedding` layers, and train it.
33
- Use `keras_hub.samplers` to generate translations of unseen input sentences
34
 using the top-p decoding strategy!
35

36
Don't worry if you aren't familiar with KerasHub. This tutorial will start with
37
the basics. Let's dive right in!
38
"""
39

40
"""
41
## Setup
42

43
Before we start implementing the pipeline, let's import all the libraries we need.
44
"""
45

46
"""shell
47
pip install -q --upgrade rouge-score
48
pip install -q --upgrade keras-hub
49
"""
50

51
import keras_hub
52
import pathlib
53
import random
54

55
import keras
56
from keras import ops
57

58
import tensorflow.data as tf_data
59

60
"""
61
Let's also define our parameters/hyperparameters.
62
"""
63

64
BATCH_SIZE = 64
65
EPOCHS = 1  # This should be at least 10 for convergence
66
MAX_SEQUENCE_LENGTH = 40
67
ENG_VOCAB_SIZE = 15000
68
SPA_VOCAB_SIZE = 15000
69

70
EMBED_DIM = 256
71
INTERMEDIATE_DIM = 2048
72
NUM_HEADS = 8
73

74
"""
75
## Downloading the data
76

77
We'll be working with an English-to-Spanish translation dataset
78
provided by [Anki](https://www.manythings.org/anki/). Let's download it:
79
"""
80

81
text_file = keras.utils.get_file(
82
    fname="spa-eng.zip",
83
    origin="http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip",
84
    extract=True,
85
)
86
text_file = pathlib.Path(text_file) / "spa-eng" / "spa.txt"
87

88
"""
89
## Parsing the data
90

91
Each line contains an English sentence and its corresponding Spanish sentence.
92
The English sentence is the *source sequence* and Spanish one is the *target sequence*.
93
Before adding the text to a list, we convert it to lowercase.
94
"""
95

96
with open(text_file) as f:
97
    lines = f.read().split("\n")[:-1]
98
text_pairs = []
99
for line in lines:
100
    eng, spa = line.split("\t")
101
    eng = eng.lower()
102
    spa = spa.lower()
103
    text_pairs.append((eng, spa))
104

105
"""
106
Here's what our sentence pairs look like:
107
"""
108

109
for _ in range(5):
110
    print(random.choice(text_pairs))
111

112
"""
113
Now, let's split the sentence pairs into a training set, a validation set,
114
and a test set.
115
"""
116

117
random.shuffle(text_pairs)
118
num_val_samples = int(0.15 * len(text_pairs))
119
num_train_samples = len(text_pairs) - 2 * num_val_samples
120
train_pairs = text_pairs[:num_train_samples]
121
val_pairs = text_pairs[num_train_samples : num_train_samples + num_val_samples]
122
test_pairs = text_pairs[num_train_samples + num_val_samples :]
123

124
print(f"{len(text_pairs)} total pairs")
125
print(f"{len(train_pairs)} training pairs")
126
print(f"{len(val_pairs)} validation pairs")
127
print(f"{len(test_pairs)} test pairs")
128

129

130
"""
131
## Tokenizing the data
132

133
We'll define two tokenizers - one for the source language (English), and the other
134
for the target language (Spanish). We'll be using
135
`keras_hub.tokenizers.WordPieceTokenizer` to tokenize the text.
136
`keras_hub.tokenizers.WordPieceTokenizer` takes a WordPiece vocabulary
137
and has functions for tokenizing the text, and detokenizing sequences of tokens.
138

139
Before we define the two tokenizers, we first need to train them on the dataset
140
we have. The WordPiece tokenization algorithm is a subword tokenization algorithm;
141
training it on a corpus gives us a vocabulary of subwords. A subword tokenizer
142
is a compromise between word tokenizers (word tokenizers need very large
143
vocabularies for good coverage of input words), and character tokenizers
144
(characters don't really encode meaning like words do). Luckily, KerasHub
145
makes it very simple to train WordPiece on a corpus with the
146
`keras_hub.tokenizers.compute_word_piece_vocabulary` utility.
147
"""
148

149

150
def train_word_piece(text_samples, vocab_size, reserved_tokens):
151
    word_piece_ds = tf_data.Dataset.from_tensor_slices(text_samples)
152
    vocab = keras_hub.tokenizers.compute_word_piece_vocabulary(
153
        word_piece_ds.batch(1000).prefetch(2),
154
        vocabulary_size=vocab_size,
155
        reserved_tokens=reserved_tokens,
156
    )
157
    return vocab
158

159

160
"""
161
Every vocabulary has a few special, reserved tokens. We have four such tokens:
162

163
- `"[PAD]"` - Padding token. Padding tokens are appended to the input sequence
164
length when the input sequence length is shorter than the maximum sequence length.
165
- `"[UNK]"` - Unknown token.
166
- `"[START]"` - Token that marks the start of the input sequence.
167
- `"[END]"` - Token that marks the end of the input sequence.
168
"""
169

170
reserved_tokens = ["[PAD]", "[UNK]", "[START]", "[END]"]
171

172
eng_samples = [text_pair[0] for text_pair in train_pairs]
173
eng_vocab = train_word_piece(eng_samples, ENG_VOCAB_SIZE, reserved_tokens)
174

175
spa_samples = [text_pair[1] for text_pair in train_pairs]
176
spa_vocab = train_word_piece(spa_samples, SPA_VOCAB_SIZE, reserved_tokens)
177

178
"""
179
Let's see some tokens!
180
"""
181

182
print("English Tokens: ", eng_vocab[100:110])
183
print("Spanish Tokens: ", spa_vocab[100:110])
184

185
"""
186
Now, let's define the tokenizers. We will configure the tokenizers with the
187
the vocabularies trained above.
188
"""
189

190
eng_tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
191
    vocabulary=eng_vocab, lowercase=False
192
)
193
spa_tokenizer = keras_hub.tokenizers.WordPieceTokenizer(
194
    vocabulary=spa_vocab, lowercase=False
195
)
196

197
"""
198
Let's try and tokenize a sample from our dataset! To verify whether the text has
199
been tokenized correctly, we can also detokenize the list of tokens back to the
200
original text.
201
"""
202

203
eng_input_ex = text_pairs[0][0]
204
eng_tokens_ex = eng_tokenizer.tokenize(eng_input_ex)
205
print("English sentence: ", eng_input_ex)
206
print("Tokens: ", eng_tokens_ex)
207
print(
208
    "Recovered text after detokenizing: ",
209
    eng_tokenizer.detokenize(eng_tokens_ex),
210
)
211

212
print()
213

214
spa_input_ex = text_pairs[0][1]
215
spa_tokens_ex = spa_tokenizer.tokenize(spa_input_ex)
216
print("Spanish sentence: ", spa_input_ex)
217
print("Tokens: ", spa_tokens_ex)
218
print(
219
    "Recovered text after detokenizing: ",
220
    spa_tokenizer.detokenize(spa_tokens_ex),
221
)
222

223
"""
224
## Format datasets
225

226
Next, we'll format our datasets.
227

228
At each training step, the model will seek to predict target words N+1 (and beyond)
229
using the source sentence and the target words 0 to N.
230

231
As such, the training dataset will yield a tuple `(inputs, targets)`, where:
232

233
- `inputs` is a dictionary with the keys `encoder_inputs` and `decoder_inputs`.
234
`encoder_inputs` is the tokenized source sentence and `decoder_inputs` is the target
235
sentence "so far",
236
that is to say, the words 0 to N used to predict word N+1 (and beyond) in the target
237
sentence.
238
- `target` is the target sentence offset by one step:
239
it provides the next words in the target sentence -- what the model will try to predict.
240

241
We will add special tokens, `"[START]"` and `"[END]"`, to the input Spanish
242
sentence after tokenizing the text. We will also pad the input to a fixed length.
243
This can be easily done using `keras_hub.layers.StartEndPacker`.
244
"""
245

246

247
def preprocess_batch(eng, spa):
248
    eng = eng_tokenizer(eng)
249
    spa = spa_tokenizer(spa)
250

251
    # Pad `eng` to `MAX_SEQUENCE_LENGTH`.
252
    eng_start_end_packer = keras_hub.layers.StartEndPacker(
253
        sequence_length=MAX_SEQUENCE_LENGTH,
254
        pad_value=eng_tokenizer.token_to_id("[PAD]"),
255
    )
256
    eng = eng_start_end_packer(eng)
257

258
    # Add special tokens (`"[START]"` and `"[END]"`) to `spa` and pad it as well.
259
    spa_start_end_packer = keras_hub.layers.StartEndPacker(
260
        sequence_length=MAX_SEQUENCE_LENGTH + 1,
261
        start_value=spa_tokenizer.token_to_id("[START]"),
262
        end_value=spa_tokenizer.token_to_id("[END]"),
263
        pad_value=spa_tokenizer.token_to_id("[PAD]"),
264
    )
265
    spa = spa_start_end_packer(spa)
266

267
    return (
268
        {
269
            "encoder_inputs": eng,
270
            "decoder_inputs": spa[:, :-1],
271
        },
272
        spa[:, 1:],
273
    )
274

275

276
def make_dataset(pairs):
277
    eng_texts, spa_texts = zip(*pairs)
278
    eng_texts = list(eng_texts)
279
    spa_texts = list(spa_texts)
280
    dataset = tf_data.Dataset.from_tensor_slices((eng_texts, spa_texts))
281
    dataset = dataset.batch(BATCH_SIZE)
282
    dataset = dataset.map(preprocess_batch, num_parallel_calls=tf_data.AUTOTUNE)
283
    return dataset.shuffle(2048).prefetch(16).cache()
284

285

286
train_ds = make_dataset(train_pairs)
287
val_ds = make_dataset(val_pairs)
288

289
"""
290
Let's take a quick look at the sequence shapes
291
(we have batches of 64 pairs, and all sequences are 40 steps long):
292
"""
293

294
for inputs, targets in train_ds.take(1):
295
    print(f'inputs["encoder_inputs"].shape: {inputs["encoder_inputs"].shape}')
296
    print(f'inputs["decoder_inputs"].shape: {inputs["decoder_inputs"].shape}')
297
    print(f"targets.shape: {targets.shape}")
298

299

300
"""
301
## Building the model
302

303
Now, let's move on to the exciting part - defining our model!
304
We first need an embedding layer, i.e., a vector for every token in our input sequence.
305
This embedding layer can be initialised randomly. We also need a positional
306
embedding layer which encodes the word order in the sequence. The convention is
307
to add these two embeddings. KerasHub has a `keras_hub.layers.TokenAndPositionEmbedding `
308
layer which does all of the above steps for us.
309

310
Our sequence-to-sequence Transformer consists of a `keras_hub.layers.TransformerEncoder`
311
layer and a `keras_hub.layers.TransformerDecoder` layer chained together.
312

313
The source sequence will be passed to `keras_hub.layers.TransformerEncoder`, which
314
will produce a new representation of it. This new representation will then be passed
315
to the `keras_hub.layers.TransformerDecoder`, together with the target sequence
316
so far (target words 0 to N). The `keras_hub.layers.TransformerDecoder` will
317
then seek to predict the next words in the target sequence (N+1 and beyond).
318

319
A key detail that makes this possible is causal masking.
320
The `keras_hub.layers.TransformerDecoder` sees the entire sequence at once, and
321
thus we must make sure that it only uses information from target tokens 0 to N
322
when predicting token N+1 (otherwise, it could use information from the future,
323
which would result in a model that cannot be used at inference time). Causal masking
324
is enabled by default in `keras_hub.layers.TransformerDecoder`.
325

326
We also need to mask the padding tokens (`"[PAD]"`). For this, we can set the
327
`mask_zero` argument of the `keras_hub.layers.TokenAndPositionEmbedding` layer
328
to True. This will then be propagated to all subsequent layers.
329
"""
330

331
# Encoder
332
encoder_inputs = keras.Input(shape=(None,), name="encoder_inputs")
333

334
x = keras_hub.layers.TokenAndPositionEmbedding(
335
    vocabulary_size=ENG_VOCAB_SIZE,
336
    sequence_length=MAX_SEQUENCE_LENGTH,
337
    embedding_dim=EMBED_DIM,
338
)(encoder_inputs)
339

340
encoder_outputs = keras_hub.layers.TransformerEncoder(
341
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
342
)(inputs=x)
343
encoder = keras.Model(encoder_inputs, encoder_outputs)
344

345

346
# Decoder
347
decoder_inputs = keras.Input(shape=(None,), name="decoder_inputs")
348
encoded_seq_inputs = keras.Input(shape=(None, EMBED_DIM), name="decoder_state_inputs")
349

350
x = keras_hub.layers.TokenAndPositionEmbedding(
351
    vocabulary_size=SPA_VOCAB_SIZE,
352
    sequence_length=MAX_SEQUENCE_LENGTH,
353
    embedding_dim=EMBED_DIM,
354
)(decoder_inputs)
355

356
x = keras_hub.layers.TransformerDecoder(
357
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
358
)(decoder_sequence=x, encoder_sequence=encoded_seq_inputs)
359
x = keras.layers.Dropout(0.5)(x)
360
decoder_outputs = keras.layers.Dense(SPA_VOCAB_SIZE, activation="softmax")(x)
361
decoder = keras.Model(
362
    [
363
        decoder_inputs,
364
        encoded_seq_inputs,
365
    ],
366
    decoder_outputs,
367
)
368
decoder_outputs = decoder([decoder_inputs, encoder_outputs])
369

370
transformer = keras.Model(
371
    [encoder_inputs, decoder_inputs],
372
    decoder_outputs,
373
    name="transformer",
374
)
375

376
"""
377
## Training our model
378

379
We'll use accuracy as a quick way to monitor training progress on the validation data.
380
Note that machine translation typically uses BLEU scores as well as other metrics,
381
rather than accuracy. However, in order to use metrics like ROUGE, BLEU, etc. we
382
will have decode the probabilities and generate the text. Text generation is
383
computationally expensive, and performing this during training is not recommended.
384

385
Here we only train for 1 epoch, but to get the model to actually converge
386
you should train for at least 10 epochs.
387
"""
388

389
transformer.summary()
390
transformer.compile(
391
    "rmsprop", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
392
)
393
transformer.fit(train_ds, epochs=EPOCHS, validation_data=val_ds)
394

395
"""
396
## Decoding test sentences (qualitative analysis)
397

398
Finally, let's demonstrate how to translate brand new English sentences.
399
We simply feed into the model the tokenized English sentence
400
as well as the target token `"[START]"`. The model outputs probabilities of the
401
next token. We then we repeatedly generated the next token conditioned on the
402
tokens generated so far, until we hit the token `"[END]"`.
403

404
For decoding, we will use the `keras_hub.samplers` module from
405
KerasHub. Greedy Decoding is a text decoding method which outputs the most
406
likely next token at each time step, i.e., the token with the highest probability.
407
"""
408

409

410
def decode_sequences(input_sentences):
411
    batch_size = 1
412

413
    # Tokenize the encoder input.
414
    encoder_input_tokens = ops.convert_to_tensor(
415
        eng_tokenizer(input_sentences), sparse=False, ragged=False
416
    )
417
    if ops.shape(encoder_input_tokens)[1] < MAX_SEQUENCE_LENGTH:
418
        pads = ops.zeros(
419
            (1, MAX_SEQUENCE_LENGTH - ops.shape(encoder_input_tokens)[1]),
420
            dtype=encoder_input_tokens.dtype,
421
        )
422
        encoder_input_tokens = ops.concatenate([encoder_input_tokens, pads], 1)
423

424
    # Define a function that outputs the next token's probability given the
425
    # input sequence.
426
    def next(prompt, cache, index):
427
        logits = transformer([encoder_input_tokens, prompt])[:, index - 1, :]
428
        # Ignore hidden states for now; only needed for contrastive search.
429
        hidden_states = None
430
        return logits, hidden_states, cache
431

432
    # Build a prompt of length 40 with a start token and padding tokens.
433
    length = 40
434
    start = ops.full((batch_size, 1), spa_tokenizer.token_to_id("[START]"))
435
    pad = ops.full((batch_size, length - 1), spa_tokenizer.token_to_id("[PAD]"))
436
    prompt = ops.concatenate((start, pad), axis=-1)
437

438
    generated_tokens = keras_hub.samplers.GreedySampler()(
439
        next,
440
        prompt,
441
        stop_token_ids=[spa_tokenizer.token_to_id("[END]")],
442
        index=1,  # Start sampling after start token.
443
    )
444
    generated_sentences = spa_tokenizer.detokenize(generated_tokens)
445
    return generated_sentences
446

447

448
test_eng_texts = [pair[0] for pair in test_pairs]
449
for i in range(2):
450
    input_sentence = random.choice(test_eng_texts)
451
    translated = decode_sequences([input_sentence])[0]
452
    translated = (
453
        translated.replace("[PAD]", "")
454
        .replace("[START]", "")
455
        .replace("[END]", "")
456
        .strip()
457
    )
458
    print(f"** Example {i} **")
459
    print(input_sentence)
460
    print(translated)
461
    print()
462

463
"""
464
## Evaluating our model (quantitative analysis)
465

466
There are many metrics which are used for text generation tasks. Here, to
467
evaluate translations generated by our model, let's compute the ROUGE-1 and
468
ROUGE-2 scores. Essentially, ROUGE-N is a score based on the number of common
469
n-grams between the reference text and the generated text. ROUGE-1 and ROUGE-2
470
use the number of common unigrams and bigrams, respectively.
471

472
We will calculate the score over 30 test samples (since decoding is an
473
expensive process).
474
"""
475

476
rouge_1 = keras_hub.metrics.RougeN(order=1)
477
rouge_2 = keras_hub.metrics.RougeN(order=2)
478

479
for test_pair in test_pairs[:30]:
480
    input_sentence = test_pair[0]
481
    reference_sentence = test_pair[1]
482

483
    translated_sentence = decode_sequences([input_sentence])[0]
484
    translated_sentence = (
485
        translated_sentence.replace("[PAD]", "")
486
        .replace("[START]", "")
487
        .replace("[END]", "")
488
        .strip()
489
    )
490

491
    rouge_1(reference_sentence, translated_sentence)
492
    rouge_2(reference_sentence, translated_sentence)
493

494
print("ROUGE-1 Score: ", rouge_1.result())
495
print("ROUGE-2 Score: ", rouge_2.result())
496

497
"""
498
After 10 epochs, the scores are as follows:
499

500
|               | **ROUGE-1** | **ROUGE-2** |
501
|:-------------:|:-----------:|:-----------:|
502
| **Precision** |    0.568    |    0.374    |
503
|   **Recall**  |    0.615    |    0.394    |
504
|  **F1 Score** |    0.579    |    0.381    |
505
"""
506

507
"""
508
## Relevant Chapters from Deep Learning with Python
509
- [Chapter 15: Language models and the Transformer](https://deeplearningwithpython.io/chapters/chapter15_language-models-and-the-transformer)
510
- [Chapter 16: Text generation](https://deeplearningwithpython.io/chapters/chapter16_text-generation)
511
"""
512

513
Product

Resources

Company