CoCalc -- _preprocessing

GitHub Repository: keras-team/keras-io
Path: blob/master/guides/_preprocessing_layers.py
³²⁷³ views
1
"""
2
Title: Working with preprocessing layers
3
Authors: Francois Chollet, Mark Omernick
4
Date created: 2020/07/25
5
Last modified: 2021/04/23
6
Description: Overview of how to leverage preprocessing layers to create end-to-end models.
7
Accelerator: GPU
8
"""
9

10
"""
11
## Keras preprocessing
12

13
The Keras preprocessing layers API allows developers to build Keras-native input
14
processing pipelines. These input processing pipelines can be used as independent
15
preprocessing code in non-Keras workflows, combined directly with Keras models, and
16
exported as part of a Keras SavedModel.
17

18
With Keras preprocessing layers, you can build and export models that are truly
19
end-to-end: models that accept raw images or raw structured data as input; models that
20
handle feature normalization or feature value indexing on their own.
21
"""
22

23
"""
24
## Available preprocessing
25

26
### Text preprocessing
27

28
- `tf.keras.layers.TextVectorization`: turns raw strings into an encoded
29
  representation that can be read by an `Embedding` layer or `Dense` layer.
30

31
### Numerical features preprocessing
32

33
- `tf.keras.layers.Normalization`: performs feature-wise normalization of
34
  input features.
35
- `tf.keras.layers.Discretization`: turns continuous numerical features
36
  into integer categorical features.
37

38
### Categorical features preprocessing
39

40
- `tf.keras.layers.CategoryEncoding`: turns integer categorical features
41
  into one-hot, multi-hot, or count dense representations.
42
- `tf.keras.layers.Hashing`: performs categorical feature hashing, also known as
43
  the "hashing trick".
44
- `tf.keras.layers.StringLookup`: turns string categorical values into an encoded
45
  representation that can be read by an `Embedding` layer or `Dense` layer.
46
- `tf.keras.layers.IntegerLookup`: turns integer categorical values into an
47
  encoded representation that can be read by an `Embedding` layer or `Dense`
48
  layer.
49

50

51
### Image preprocessing
52

53
These layers are for standardizing the inputs of an image model.
54

55
- `tf.keras.layers.Resizing`: resizes a batch of images to a target size.
56
- `tf.keras.layers.Rescaling`: rescales and offsets the values of a batch of
57
  images (e.g. go from inputs in the `[0, 255]` range to inputs in the `[0, 1]`
58
  range.
59
- `tf.keras.layers.CenterCrop`: returns a center crop of a batch of images.
60

61
### Image data augmentation
62

63
These layers apply random augmentation transforms to a batch of images. They
64
are only active during training.
65

66
- `tf.keras.layers.RandomCrop`
67
- `tf.keras.layers.RandomFlip`
68
- `tf.keras.layers.RandomTranslation`
69
- `tf.keras.layers.RandomRotation`
70
- `tf.keras.layers.RandomZoom`
71
- `tf.keras.layers.RandomContrast`
72

73
"""
74

75
"""
76
## The `adapt()` method
77

78
Some preprocessing layers have an internal state that can be computed based on
79
a sample of the training data. The list of stateful preprocessing layers is:
80

81
- `TextVectorization`: holds a mapping between string tokens and integer indices
82
- `StringLookup` and `IntegerLookup`: hold a mapping between input values and integer
83
indices.
84
- `Normalization`: holds the mean and standard deviation of the features.
85
- `Discretization`: holds information about value bucket boundaries.
86

87
Crucially, these layers are **non-trainable**. Their state is not set during training; it
88
must be set **before training**, either by initializing them from a precomputed constant,
89
or by "adapting" them on data.
90

91
You set the state of a preprocessing layer by exposing it to training data, via the
92
`adapt()` method:
93

94
"""
95

96
import numpy as np
97
import tensorflow as tf
98
import keras
99
from keras import layers
100

101
data = np.array(
102
    [
103
        [0.1, 0.2, 0.3],
104
        [0.8, 0.9, 1.0],
105
        [1.5, 1.6, 1.7],
106
    ]
107
)
108
layer = layers.Normalization()
109
layer.adapt(data)
110
normalized_data = layer(data)
111

112
print("Features mean: %.2f" % (normalized_data.numpy().mean()))
113
print("Features std: %.2f" % (normalized_data.numpy().std()))
114

115
"""
116
The `adapt()` method takes either a Numpy array or a `tf.data.Dataset` object. In the
117
case of `StringLookup` and `TextVectorization`, you can also pass a list of strings:
118
"""
119

120
data = [
121
    "ξεῖν᾽, ἦ τοι μὲν ὄνειροι ἀμήχανοι ἀκριτόμυθοι",
122
    "γίγνοντ᾽, οὐδέ τι πάντα τελείεται ἀνθρώποισι.",
123
    "δοιαὶ γάρ τε πύλαι ἀμενηνῶν εἰσὶν ὀνείρων:",
124
    "αἱ μὲν γὰρ κεράεσσι τετεύχαται, αἱ δ᾽ ἐλέφαντι:",
125
    "τῶν οἳ μέν κ᾽ ἔλθωσι διὰ πριστοῦ ἐλέφαντος,",
126
    "οἵ ῥ᾽ ἐλεφαίρονται, ἔπε᾽ ἀκράαντα φέροντες:",
127
    "οἱ δὲ διὰ ξεστῶν κεράων ἔλθωσι θύραζε,",
128
    "οἵ ῥ᾽ ἔτυμα κραίνουσι, βροτῶν ὅτε κέν τις ἴδηται.",
129
]
130
layer = layers.TextVectorization()
131
layer.adapt(data)
132
vectorized_text = layer(data)
133
print(vectorized_text)
134

135
"""
136
In addition, adaptable layers always expose an option to directly set state via
137
constructor arguments or weight assignment. If the intended state values are known at
138
layer construction time, or are calculated outside of the `adapt()` call, they can be set
139
without relying on the layer's internal computation. For instance, if external vocabulary
140
files for the `TextVectorization`, `StringLookup`, or `IntegerLookup` layers already
141
exist, those can be loaded directly into the lookup tables by passing a path to the
142
vocabulary file in the layer's constructor arguments.
143

144
Here's an example where you instantiate a `StringLookup` layer with precomputed vocabulary:
145
"""
146

147
vocab = ["a", "b", "c", "d"]
148
data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
149
layer = layers.StringLookup(vocabulary=vocab)
150
vectorized_data = layer(data)
151
print(vectorized_data)
152

153
"""
154
## Preprocessing data before the model or inside the model
155

156
There are two ways you could be using preprocessing layers:
157

158
**Option 1:** Make them part of the model, like this:
159

160
```python
161
inputs = keras.Input(shape=input_shape)
162
x = preprocessing_layer(inputs)
163
outputs = rest_of_the_model(x)
164
model = keras.Model(inputs, outputs)
165
```
166

167
With this option, preprocessing will happen on device, synchronously with the rest of the
168
model execution, meaning that it will benefit from GPU acceleration.
169
If you're training on a GPU, this is the best option for the `Normalization` layer, and for
170
all image preprocessing and data augmentation layers.
171

172
**Option 2:** apply it to your `tf.data.Dataset`, so as to obtain a dataset that yields
173
batches of preprocessed data, like this:
174

175
```python
176
dataset = dataset.map(lambda x, y: (preprocessing_layer(x), y))
177
```
178

179
With this option, your preprocessing will happen on a CPU, asynchronously, and will be
180
buffered before going into the model.
181
In addition, if you call `dataset.prefetch(tf.data.AUTOTUNE)` on your dataset,
182
the preprocessing will happen efficiently in parallel with training:
183

184
```python
185
dataset = dataset.map(lambda x, y: (preprocessing_layer(x), y))
186
dataset = dataset.prefetch(tf.data.AUTOTUNE)
187
model.fit(dataset, ...)
188
```
189

190
This is the best option for `TextVectorization`, and all structured data preprocessing
191
layers. It can also be a good option if you're training on a CPU and you use image preprocessing
192
layers.
193

194
Note that the `TextVectorization` layer can only be executed on a CPU, as it is mostly a
195
dictionary lookup operation. Therefore, if you are training your model on a GPU or a TPU,
196
you should put the `TextVectorization` layer in the `tf.data` pipeline to get the best performance.
197

198
**When running on a TPU, you should always place preprocessing layers in the `tf.data` pipeline**
199
(with the exception of `Normalization` and `Rescaling`, which run fine on a TPU and are commonly
200
used as the first layer in an image model).
201
"""
202

203
"""
204
## Benefits of doing preprocessing inside the model at inference time
205

206
Even if you go with option 2, you may later want to export an inference-only end-to-end
207
model that will include the preprocessing layers. The key benefit to doing this is that
208
**it makes your model portable** and it **helps reduce the
209
[training/serving skew](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew)**.
210

211
When all data preprocessing is part of the model, other people can load and use your
212
model without having to be aware of how each feature is expected to be encoded &
213
normalized. Your inference model will be able to process raw images or raw structured
214
data, and will not require users of the model to be aware of the details of e.g. the
215
tokenization scheme used for text, the indexing scheme used for categorical features,
216
whether image pixel values are normalized to `[-1, +1]` or to `[0, 1]`, etc. This is
217
especially powerful if you're exporting
218
your model to another runtime, such as TensorFlow.js: you won't have to
219
reimplement your preprocessing pipeline in JavaScript.
220

221
If you initially put your preprocessing layers in your `tf.data` pipeline,
222
you can export an inference model that packages the preprocessing.
223
Simply instantiate a new model that chains
224
your preprocessing layers and your training model:
225

226
```python
227
inputs = keras.Input(shape=input_shape)
228
x = preprocessing_layer(inputs)
229
outputs = training_model(x)
230
inference_model = keras.Model(inputs, outputs)
231
```
232
"""
233

234
"""
235
## Preprocessing during multi-worker training
236

237
Preprocessing layers are compatible with the
238
[tf.distribute](https://www.tensorflow.org/api_docs/python/tf/distribute) API
239
for running training across multiple machines.
240

241
In general, preprocessing layers should be placed inside a `tf.distribute.Strategy.scope()`
242
and called either inside or before the model as discussed above.
243

244
```python
245
with strategy.scope():
246
    inputs = keras.Input(shape=input_shape)
247
    preprocessing_layer = tf.keras.layers.Hashing(10)
248
    dense_layer = tf.keras.layers.Dense(16)
249
```
250

251
For more details, refer to the _Data preprocessing_ section
252
of the [Distributed input](https://www.tensorflow.org/tutorials/distribute/input)
253
tutorial.
254
"""
255

256
"""
257
## Quick recipes
258

259
### Image data augmentation
260

261
Note that image data augmentation layers are only active during training (similarly to
262
the `Dropout` layer).
263
"""
264

265
from tensorflow import keras
266
from tensorflow.keras import layers
267

268
# Create a data augmentation stage with horizontal flipping, rotations, zooms
269
data_augmentation = keras.Sequential(
270
    [
271
        layers.RandomFlip("horizontal"),
272
        layers.RandomRotation(0.1),
273
        layers.RandomZoom(0.1),
274
    ]
275
)
276

277
# Load some data
278
(x_train, y_train), _ = keras.datasets.cifar10.load_data()
279
input_shape = x_train.shape[1:]
280
classes = 10
281

282
# Create a tf.data pipeline of augmented images (and their labels)
283
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
284
train_dataset = train_dataset.batch(16).map(lambda x, y: (data_augmentation(x), y))
285

286

287
# Create a model and train it on the augmented image data
288
inputs = keras.Input(shape=input_shape)
289
x = layers.Rescaling(1.0 / 255)(inputs)  # Rescale inputs
290
outputs = keras.applications.ResNet50(  # Add the rest of the model
291
    weights=None, input_shape=input_shape, classes=classes
292
)(x)
293
model = keras.Model(inputs, outputs)
294
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy")
295
model.fit(train_dataset, steps_per_epoch=5)
296

297
"""
298
You can see a similar setup in action in the example
299
[image classification from scratch](https://keras.io/examples/vision/image_classification_from_scratch/).
300
"""
301

302
"""
303
### Normalizing numerical features
304
"""
305

306
# Load some data
307
(x_train, y_train), _ = keras.datasets.cifar10.load_data()
308
x_train = x_train.reshape((len(x_train), -1))
309
input_shape = x_train.shape[1:]
310
classes = 10
311

312
# Create a Normalization layer and set its internal state using the training data
313
normalizer = layers.Normalization()
314
normalizer.adapt(x_train)
315

316
# Create a model that include the normalization layer
317
inputs = keras.Input(shape=input_shape)
318
x = normalizer(inputs)
319
outputs = layers.Dense(classes, activation="softmax")(x)
320
model = keras.Model(inputs, outputs)
321

322
# Train the model
323
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
324
model.fit(x_train, y_train)
325

326
"""
327
### Encoding string categorical features via one-hot encoding
328
"""
329

330
# Define some toy data
331
data = tf.constant([["a"], ["b"], ["c"], ["b"], ["c"], ["a"]])
332

333
# Use StringLookup to build an index of the feature values and encode output.
334
lookup = layers.StringLookup(output_mode="one_hot")
335
lookup.adapt(data)
336

337
# Convert new test data (which includes unknown feature values)
338
test_data = tf.constant([["a"], ["b"], ["c"], ["d"], ["e"], [""]])
339
encoded_data = lookup(test_data)
340
print(encoded_data)
341

342
"""
343
Note that, here, index 0 is reserved for out-of-vocabulary values
344
(values that were not seen during `adapt()`).
345

346
You can see the `StringLookup` in action in the
347
[Structured data classification from scratch](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/)
348
example.
349
"""
350

351
"""
352
### Encoding integer categorical features via one-hot encoding
353
"""
354

355
# Define some toy data
356
data = tf.constant([[10], [20], [20], [10], [30], [0]])
357

358
# Use IntegerLookup to build an index of the feature values and encode output.
359
lookup = layers.IntegerLookup(output_mode="one_hot")
360
lookup.adapt(data)
361

362
# Convert new test data (which includes unknown feature values)
363
test_data = tf.constant([[10], [10], [20], [50], [60], [0]])
364
encoded_data = lookup(test_data)
365
print(encoded_data)
366

367
"""
368
Note that index 0 is reserved for missing values (which you should specify as the value
369
0), and index 1 is reserved for out-of-vocabulary values (values that were not seen
370
during `adapt()`). You can configure this by using the `mask_token` and `oov_token`
371
constructor arguments  of `IntegerLookup`.
372

373
You can see the `IntegerLookup` in action in the example
374
[structured data classification from scratch](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/).
375
"""
376

377
"""
378
### Applying the hashing trick to an integer categorical feature
379

380
If you have a categorical feature that can take many different values (on the order of
381
10e3 or higher), where each value only appears a few times in the data,
382
it becomes impractical and ineffective to index and one-hot encode the feature values.
383
Instead, it can be a good idea to apply the "hashing trick": hash the values to a vector
384
of fixed size. This keeps the size of the feature space manageable, and removes the need
385
for explicit indexing.
386
"""
387

388
# Sample data: 10,000 random integers with values between 0 and 100,000
389
data = np.random.randint(0, 100000, size=(10000, 1))
390

391
# Use the Hashing layer to hash the values to the range [0, 64]
392
hasher = layers.Hashing(num_bins=64, salt=1337)
393

394
# Use the CategoryEncoding layer to multi-hot encode the hashed values
395
encoder = layers.CategoryEncoding(num_tokens=64, output_mode="multi_hot")
396
encoded_data = encoder(hasher(data))
397
print(encoded_data.shape)
398

399
"""
400
### Encoding text as a sequence of token indices
401

402
This is how you should preprocess text to be passed to an `Embedding` layer.
403
"""
404

405
# Define some text data to adapt the layer
406
adapt_data = tf.constant(
407
    [
408
        "The Brain is wider than the Sky",
409
        "For put them side by side",
410
        "The one the other will contain",
411
        "With ease and You beside",
412
    ]
413
)
414

415
# Create a TextVectorization layer
416
text_vectorizer = layers.TextVectorization(output_mode="int")
417
# Index the vocabulary via `adapt()`
418
text_vectorizer.adapt(adapt_data)
419

420
# Try out the layer
421
print(
422
    "Encoded text:\n",
423
    text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
424
)
425

426
# Create a simple model
427
inputs = keras.Input(shape=(None,), dtype="int64")
428
x = layers.Embedding(input_dim=text_vectorizer.vocabulary_size(), output_dim=16)(inputs)
429
x = layers.GRU(8)(x)
430
outputs = layers.Dense(1)(x)
431
model = keras.Model(inputs, outputs)
432

433
# Create a labeled dataset (which includes unknown tokens)
434
train_dataset = tf.data.Dataset.from_tensor_slices(
435
    (["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0])
436
)
437

438
# Preprocess the string inputs, turning them into int sequences
439
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
440
# Train the model on the int sequences
441
print("\nTraining model...")
442
model.compile(optimizer="rmsprop", loss="mse")
443
model.fit(train_dataset)
444

445
# For inference, you can export a model that accepts strings as input
446
inputs = keras.Input(shape=(1,), dtype="string")
447
x = text_vectorizer(inputs)
448
outputs = model(x)
449
end_to_end_model = keras.Model(inputs, outputs)
450

451
# Call the end-to-end model on test data (which includes unknown tokens)
452
print("\nCalling end-to-end model on test string...")
453
test_data = tf.constant(["The one the other will absorb"])
454
test_output = end_to_end_model(test_data)
455
print("Model output:", test_output)
456

457
"""
458
You can see the `TextVectorization` layer in action, combined with an `Embedding` mode,
459
in the example
460
[text classification from scratch](https://keras.io/examples/nlp/text_classification_from_scratch/).
461

462
Note that when training such a model, for best performance, you should always
463
use the `TextVectorization` layer as part of the input pipeline.
464
"""
465

466
"""
467
### Encoding text as a dense matrix of N-grams with multi-hot encoding
468

469
This is how you should preprocess text to be passed to a `Dense` layer.
470
"""
471

472
# Define some text data to adapt the layer
473
adapt_data = tf.constant(
474
    [
475
        "The Brain is wider than the Sky",
476
        "For put them side by side",
477
        "The one the other will contain",
478
        "With ease and You beside",
479
    ]
480
)
481
# Instantiate TextVectorization with "multi_hot" output_mode
482
# and ngrams=2 (index all bigrams)
483
text_vectorizer = layers.TextVectorization(output_mode="multi_hot", ngrams=2)
484
# Index the bigrams via `adapt()`
485
text_vectorizer.adapt(adapt_data)
486

487
# Try out the layer
488
print(
489
    "Encoded text:\n",
490
    text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
491
)
492

493
# Create a simple model
494
inputs = keras.Input(shape=(text_vectorizer.vocabulary_size(),))
495
outputs = layers.Dense(1)(inputs)
496
model = keras.Model(inputs, outputs)
497

498
# Create a labeled dataset (which includes unknown tokens)
499
train_dataset = tf.data.Dataset.from_tensor_slices(
500
    (["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0])
501
)
502

503
# Preprocess the string inputs, turning them into int sequences
504
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
505
# Train the model on the int sequences
506
print("\nTraining model...")
507
model.compile(optimizer="rmsprop", loss="mse")
508
model.fit(train_dataset)
509

510
# For inference, you can export a model that accepts strings as input
511
inputs = keras.Input(shape=(1,), dtype="string")
512
x = text_vectorizer(inputs)
513
outputs = model(x)
514
end_to_end_model = keras.Model(inputs, outputs)
515

516
# Call the end-to-end model on test data (which includes unknown tokens)
517
print("\nCalling end-to-end model on test string...")
518
test_data = tf.constant(["The one the other will absorb"])
519
test_output = end_to_end_model(test_data)
520
print("Model output:", test_output)
521

522
"""
523
### Encoding text as a dense matrix of N-grams with TF-IDF weighting
524

525
This is an alternative way of preprocessing text before passing it to a `Dense` layer.
526
"""
527

528
# Define some text data to adapt the layer
529
adapt_data = tf.constant(
530
    [
531
        "The Brain is wider than the Sky",
532
        "For put them side by side",
533
        "The one the other will contain",
534
        "With ease and You beside",
535
    ]
536
)
537
# Instantiate TextVectorization with "tf-idf" output_mode
538
# (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)
539
text_vectorizer = layers.TextVectorization(output_mode="tf-idf", ngrams=2)
540
# Index the bigrams and learn the TF-IDF weights via `adapt()`
541
text_vectorizer.adapt(adapt_data)
542

543
# Try out the layer
544
print(
545
    "Encoded text:\n",
546
    text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
547
)
548

549
# Create a simple model
550
inputs = keras.Input(shape=(text_vectorizer.vocabulary_size(),))
551
outputs = layers.Dense(1)(inputs)
552
model = keras.Model(inputs, outputs)
553

554
# Create a labeled dataset (which includes unknown tokens)
555
train_dataset = tf.data.Dataset.from_tensor_slices(
556
    (["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0])
557
)
558

559
# Preprocess the string inputs, turning them into int sequences
560
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
561
# Train the model on the int sequences
562
print("\nTraining model...")
563
model.compile(optimizer="rmsprop", loss="mse")
564
model.fit(train_dataset)
565

566
# For inference, you can export a model that accepts strings as input
567
inputs = keras.Input(shape=(1,), dtype="string")
568
x = text_vectorizer(inputs)
569
outputs = model(x)
570
end_to_end_model = keras.Model(inputs, outputs)
571

572
# Call the end-to-end model on test data (which includes unknown tokens)
573
print("\nCalling end-to-end model on test string...")
574
test_data = tf.constant(["The one the other will absorb"])
575
test_output = end_to_end_model(test_data)
576
print("Model output:", test_output)
577

578

579
"""
580
## Important gotchas
581

582
### Working with lookup layers with very large vocabularies
583

584
You may find yourself working with a very large vocabulary in a `TextVectorization`, a `StringLookup` layer,
585
or an `IntegerLookup` layer. Typically, a vocabulary larger than 500MB would be considered "very large".
586

587
In such a case, for best performance, you should avoid using `adapt()`.
588
Instead, pre-compute your vocabulary in advance
589
(you could use Apache Beam or TF Transform for this)
590
and store it in a file. Then load the vocabulary into the layer at construction
591
time by passing the file path as the `vocabulary` argument.
592

593

594
### Using lookup layers on a TPU pod or with `ParameterServerStrategy`.
595

596
There is an outstanding issue that causes performance to degrade when using
597
a `TextVectorization`, `StringLookup`, or `IntegerLookup` layer while
598
training on a TPU pod or on multiple machines via `ParameterServerStrategy`.
599
This is slated to be fixed in TensorFlow 2.7.
600

601
"""
602

603
Product

Resources

Company