Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/guides/_preprocessing_layers.py
3273 views
1
"""
2
Title: Working with preprocessing layers
3
Authors: Francois Chollet, Mark Omernick
4
Date created: 2020/07/25
5
Last modified: 2021/04/23
6
Description: Overview of how to leverage preprocessing layers to create end-to-end models.
7
Accelerator: GPU
8
"""
9
10
"""
11
## Keras preprocessing
12
13
The Keras preprocessing layers API allows developers to build Keras-native input
14
processing pipelines. These input processing pipelines can be used as independent
15
preprocessing code in non-Keras workflows, combined directly with Keras models, and
16
exported as part of a Keras SavedModel.
17
18
With Keras preprocessing layers, you can build and export models that are truly
19
end-to-end: models that accept raw images or raw structured data as input; models that
20
handle feature normalization or feature value indexing on their own.
21
"""
22
23
"""
24
## Available preprocessing
25
26
### Text preprocessing
27
28
- `tf.keras.layers.TextVectorization`: turns raw strings into an encoded
29
representation that can be read by an `Embedding` layer or `Dense` layer.
30
31
### Numerical features preprocessing
32
33
- `tf.keras.layers.Normalization`: performs feature-wise normalization of
34
input features.
35
- `tf.keras.layers.Discretization`: turns continuous numerical features
36
into integer categorical features.
37
38
### Categorical features preprocessing
39
40
- `tf.keras.layers.CategoryEncoding`: turns integer categorical features
41
into one-hot, multi-hot, or count dense representations.
42
- `tf.keras.layers.Hashing`: performs categorical feature hashing, also known as
43
the "hashing trick".
44
- `tf.keras.layers.StringLookup`: turns string categorical values into an encoded
45
representation that can be read by an `Embedding` layer or `Dense` layer.
46
- `tf.keras.layers.IntegerLookup`: turns integer categorical values into an
47
encoded representation that can be read by an `Embedding` layer or `Dense`
48
layer.
49
50
51
### Image preprocessing
52
53
These layers are for standardizing the inputs of an image model.
54
55
- `tf.keras.layers.Resizing`: resizes a batch of images to a target size.
56
- `tf.keras.layers.Rescaling`: rescales and offsets the values of a batch of
57
images (e.g. go from inputs in the `[0, 255]` range to inputs in the `[0, 1]`
58
range.
59
- `tf.keras.layers.CenterCrop`: returns a center crop of a batch of images.
60
61
### Image data augmentation
62
63
These layers apply random augmentation transforms to a batch of images. They
64
are only active during training.
65
66
- `tf.keras.layers.RandomCrop`
67
- `tf.keras.layers.RandomFlip`
68
- `tf.keras.layers.RandomTranslation`
69
- `tf.keras.layers.RandomRotation`
70
- `tf.keras.layers.RandomZoom`
71
- `tf.keras.layers.RandomContrast`
72
73
"""
74
75
"""
76
## The `adapt()` method
77
78
Some preprocessing layers have an internal state that can be computed based on
79
a sample of the training data. The list of stateful preprocessing layers is:
80
81
- `TextVectorization`: holds a mapping between string tokens and integer indices
82
- `StringLookup` and `IntegerLookup`: hold a mapping between input values and integer
83
indices.
84
- `Normalization`: holds the mean and standard deviation of the features.
85
- `Discretization`: holds information about value bucket boundaries.
86
87
Crucially, these layers are **non-trainable**. Their state is not set during training; it
88
must be set **before training**, either by initializing them from a precomputed constant,
89
or by "adapting" them on data.
90
91
You set the state of a preprocessing layer by exposing it to training data, via the
92
`adapt()` method:
93
94
"""
95
96
import numpy as np
97
import tensorflow as tf
98
import keras
99
from keras import layers
100
101
data = np.array(
102
[
103
[0.1, 0.2, 0.3],
104
[0.8, 0.9, 1.0],
105
[1.5, 1.6, 1.7],
106
]
107
)
108
layer = layers.Normalization()
109
layer.adapt(data)
110
normalized_data = layer(data)
111
112
print("Features mean: %.2f" % (normalized_data.numpy().mean()))
113
print("Features std: %.2f" % (normalized_data.numpy().std()))
114
115
"""
116
The `adapt()` method takes either a Numpy array or a `tf.data.Dataset` object. In the
117
case of `StringLookup` and `TextVectorization`, you can also pass a list of strings:
118
"""
119
120
data = [
121
"ξεῖν᾽, ἦ τοι μὲν ὄνειροι ἀμήχανοι ἀκριτόμυθοι",
122
"γίγνοντ᾽, οὐδέ τι πάντα τελείεται ἀνθρώποισι.",
123
"δοιαὶ γάρ τε πύλαι ἀμενηνῶν εἰσὶν ὀνείρων:",
124
"αἱ μὲν γὰρ κεράεσσι τετεύχαται, αἱ δ᾽ ἐλέφαντι:",
125
"τῶν οἳ μέν κ᾽ ἔλθωσι διὰ πριστοῦ ἐλέφαντος,",
126
"οἵ ῥ᾽ ἐλεφαίρονται, ἔπε᾽ ἀκράαντα φέροντες:",
127
"οἱ δὲ διὰ ξεστῶν κεράων ἔλθωσι θύραζε,",
128
"οἵ ῥ᾽ ἔτυμα κραίνουσι, βροτῶν ὅτε κέν τις ἴδηται.",
129
]
130
layer = layers.TextVectorization()
131
layer.adapt(data)
132
vectorized_text = layer(data)
133
print(vectorized_text)
134
135
"""
136
In addition, adaptable layers always expose an option to directly set state via
137
constructor arguments or weight assignment. If the intended state values are known at
138
layer construction time, or are calculated outside of the `adapt()` call, they can be set
139
without relying on the layer's internal computation. For instance, if external vocabulary
140
files for the `TextVectorization`, `StringLookup`, or `IntegerLookup` layers already
141
exist, those can be loaded directly into the lookup tables by passing a path to the
142
vocabulary file in the layer's constructor arguments.
143
144
Here's an example where you instantiate a `StringLookup` layer with precomputed vocabulary:
145
"""
146
147
vocab = ["a", "b", "c", "d"]
148
data = tf.constant([["a", "c", "d"], ["d", "z", "b"]])
149
layer = layers.StringLookup(vocabulary=vocab)
150
vectorized_data = layer(data)
151
print(vectorized_data)
152
153
"""
154
## Preprocessing data before the model or inside the model
155
156
There are two ways you could be using preprocessing layers:
157
158
**Option 1:** Make them part of the model, like this:
159
160
```python
161
inputs = keras.Input(shape=input_shape)
162
x = preprocessing_layer(inputs)
163
outputs = rest_of_the_model(x)
164
model = keras.Model(inputs, outputs)
165
```
166
167
With this option, preprocessing will happen on device, synchronously with the rest of the
168
model execution, meaning that it will benefit from GPU acceleration.
169
If you're training on a GPU, this is the best option for the `Normalization` layer, and for
170
all image preprocessing and data augmentation layers.
171
172
**Option 2:** apply it to your `tf.data.Dataset`, so as to obtain a dataset that yields
173
batches of preprocessed data, like this:
174
175
```python
176
dataset = dataset.map(lambda x, y: (preprocessing_layer(x), y))
177
```
178
179
With this option, your preprocessing will happen on a CPU, asynchronously, and will be
180
buffered before going into the model.
181
In addition, if you call `dataset.prefetch(tf.data.AUTOTUNE)` on your dataset,
182
the preprocessing will happen efficiently in parallel with training:
183
184
```python
185
dataset = dataset.map(lambda x, y: (preprocessing_layer(x), y))
186
dataset = dataset.prefetch(tf.data.AUTOTUNE)
187
model.fit(dataset, ...)
188
```
189
190
This is the best option for `TextVectorization`, and all structured data preprocessing
191
layers. It can also be a good option if you're training on a CPU and you use image preprocessing
192
layers.
193
194
Note that the `TextVectorization` layer can only be executed on a CPU, as it is mostly a
195
dictionary lookup operation. Therefore, if you are training your model on a GPU or a TPU,
196
you should put the `TextVectorization` layer in the `tf.data` pipeline to get the best performance.
197
198
**When running on a TPU, you should always place preprocessing layers in the `tf.data` pipeline**
199
(with the exception of `Normalization` and `Rescaling`, which run fine on a TPU and are commonly
200
used as the first layer in an image model).
201
"""
202
203
"""
204
## Benefits of doing preprocessing inside the model at inference time
205
206
Even if you go with option 2, you may later want to export an inference-only end-to-end
207
model that will include the preprocessing layers. The key benefit to doing this is that
208
**it makes your model portable** and it **helps reduce the
209
[training/serving skew](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew)**.
210
211
When all data preprocessing is part of the model, other people can load and use your
212
model without having to be aware of how each feature is expected to be encoded &
213
normalized. Your inference model will be able to process raw images or raw structured
214
data, and will not require users of the model to be aware of the details of e.g. the
215
tokenization scheme used for text, the indexing scheme used for categorical features,
216
whether image pixel values are normalized to `[-1, +1]` or to `[0, 1]`, etc. This is
217
especially powerful if you're exporting
218
your model to another runtime, such as TensorFlow.js: you won't have to
219
reimplement your preprocessing pipeline in JavaScript.
220
221
If you initially put your preprocessing layers in your `tf.data` pipeline,
222
you can export an inference model that packages the preprocessing.
223
Simply instantiate a new model that chains
224
your preprocessing layers and your training model:
225
226
```python
227
inputs = keras.Input(shape=input_shape)
228
x = preprocessing_layer(inputs)
229
outputs = training_model(x)
230
inference_model = keras.Model(inputs, outputs)
231
```
232
"""
233
234
"""
235
## Preprocessing during multi-worker training
236
237
Preprocessing layers are compatible with the
238
[tf.distribute](https://www.tensorflow.org/api_docs/python/tf/distribute) API
239
for running training across multiple machines.
240
241
In general, preprocessing layers should be placed inside a `tf.distribute.Strategy.scope()`
242
and called either inside or before the model as discussed above.
243
244
```python
245
with strategy.scope():
246
inputs = keras.Input(shape=input_shape)
247
preprocessing_layer = tf.keras.layers.Hashing(10)
248
dense_layer = tf.keras.layers.Dense(16)
249
```
250
251
For more details, refer to the _Data preprocessing_ section
252
of the [Distributed input](https://www.tensorflow.org/tutorials/distribute/input)
253
tutorial.
254
"""
255
256
"""
257
## Quick recipes
258
259
### Image data augmentation
260
261
Note that image data augmentation layers are only active during training (similarly to
262
the `Dropout` layer).
263
"""
264
265
from tensorflow import keras
266
from tensorflow.keras import layers
267
268
# Create a data augmentation stage with horizontal flipping, rotations, zooms
269
data_augmentation = keras.Sequential(
270
[
271
layers.RandomFlip("horizontal"),
272
layers.RandomRotation(0.1),
273
layers.RandomZoom(0.1),
274
]
275
)
276
277
# Load some data
278
(x_train, y_train), _ = keras.datasets.cifar10.load_data()
279
input_shape = x_train.shape[1:]
280
classes = 10
281
282
# Create a tf.data pipeline of augmented images (and their labels)
283
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
284
train_dataset = train_dataset.batch(16).map(lambda x, y: (data_augmentation(x), y))
285
286
287
# Create a model and train it on the augmented image data
288
inputs = keras.Input(shape=input_shape)
289
x = layers.Rescaling(1.0 / 255)(inputs) # Rescale inputs
290
outputs = keras.applications.ResNet50( # Add the rest of the model
291
weights=None, input_shape=input_shape, classes=classes
292
)(x)
293
model = keras.Model(inputs, outputs)
294
model.compile(optimizer="rmsprop", loss="sparse_categorical_crossentropy")
295
model.fit(train_dataset, steps_per_epoch=5)
296
297
"""
298
You can see a similar setup in action in the example
299
[image classification from scratch](https://keras.io/examples/vision/image_classification_from_scratch/).
300
"""
301
302
"""
303
### Normalizing numerical features
304
"""
305
306
# Load some data
307
(x_train, y_train), _ = keras.datasets.cifar10.load_data()
308
x_train = x_train.reshape((len(x_train), -1))
309
input_shape = x_train.shape[1:]
310
classes = 10
311
312
# Create a Normalization layer and set its internal state using the training data
313
normalizer = layers.Normalization()
314
normalizer.adapt(x_train)
315
316
# Create a model that include the normalization layer
317
inputs = keras.Input(shape=input_shape)
318
x = normalizer(inputs)
319
outputs = layers.Dense(classes, activation="softmax")(x)
320
model = keras.Model(inputs, outputs)
321
322
# Train the model
323
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
324
model.fit(x_train, y_train)
325
326
"""
327
### Encoding string categorical features via one-hot encoding
328
"""
329
330
# Define some toy data
331
data = tf.constant([["a"], ["b"], ["c"], ["b"], ["c"], ["a"]])
332
333
# Use StringLookup to build an index of the feature values and encode output.
334
lookup = layers.StringLookup(output_mode="one_hot")
335
lookup.adapt(data)
336
337
# Convert new test data (which includes unknown feature values)
338
test_data = tf.constant([["a"], ["b"], ["c"], ["d"], ["e"], [""]])
339
encoded_data = lookup(test_data)
340
print(encoded_data)
341
342
"""
343
Note that, here, index 0 is reserved for out-of-vocabulary values
344
(values that were not seen during `adapt()`).
345
346
You can see the `StringLookup` in action in the
347
[Structured data classification from scratch](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/)
348
example.
349
"""
350
351
"""
352
### Encoding integer categorical features via one-hot encoding
353
"""
354
355
# Define some toy data
356
data = tf.constant([[10], [20], [20], [10], [30], [0]])
357
358
# Use IntegerLookup to build an index of the feature values and encode output.
359
lookup = layers.IntegerLookup(output_mode="one_hot")
360
lookup.adapt(data)
361
362
# Convert new test data (which includes unknown feature values)
363
test_data = tf.constant([[10], [10], [20], [50], [60], [0]])
364
encoded_data = lookup(test_data)
365
print(encoded_data)
366
367
"""
368
Note that index 0 is reserved for missing values (which you should specify as the value
369
0), and index 1 is reserved for out-of-vocabulary values (values that were not seen
370
during `adapt()`). You can configure this by using the `mask_token` and `oov_token`
371
constructor arguments of `IntegerLookup`.
372
373
You can see the `IntegerLookup` in action in the example
374
[structured data classification from scratch](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/).
375
"""
376
377
"""
378
### Applying the hashing trick to an integer categorical feature
379
380
If you have a categorical feature that can take many different values (on the order of
381
10e3 or higher), where each value only appears a few times in the data,
382
it becomes impractical and ineffective to index and one-hot encode the feature values.
383
Instead, it can be a good idea to apply the "hashing trick": hash the values to a vector
384
of fixed size. This keeps the size of the feature space manageable, and removes the need
385
for explicit indexing.
386
"""
387
388
# Sample data: 10,000 random integers with values between 0 and 100,000
389
data = np.random.randint(0, 100000, size=(10000, 1))
390
391
# Use the Hashing layer to hash the values to the range [0, 64]
392
hasher = layers.Hashing(num_bins=64, salt=1337)
393
394
# Use the CategoryEncoding layer to multi-hot encode the hashed values
395
encoder = layers.CategoryEncoding(num_tokens=64, output_mode="multi_hot")
396
encoded_data = encoder(hasher(data))
397
print(encoded_data.shape)
398
399
"""
400
### Encoding text as a sequence of token indices
401
402
This is how you should preprocess text to be passed to an `Embedding` layer.
403
"""
404
405
# Define some text data to adapt the layer
406
adapt_data = tf.constant(
407
[
408
"The Brain is wider than the Sky",
409
"For put them side by side",
410
"The one the other will contain",
411
"With ease and You beside",
412
]
413
)
414
415
# Create a TextVectorization layer
416
text_vectorizer = layers.TextVectorization(output_mode="int")
417
# Index the vocabulary via `adapt()`
418
text_vectorizer.adapt(adapt_data)
419
420
# Try out the layer
421
print(
422
"Encoded text:\n",
423
text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
424
)
425
426
# Create a simple model
427
inputs = keras.Input(shape=(None,), dtype="int64")
428
x = layers.Embedding(input_dim=text_vectorizer.vocabulary_size(), output_dim=16)(inputs)
429
x = layers.GRU(8)(x)
430
outputs = layers.Dense(1)(x)
431
model = keras.Model(inputs, outputs)
432
433
# Create a labeled dataset (which includes unknown tokens)
434
train_dataset = tf.data.Dataset.from_tensor_slices(
435
(["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0])
436
)
437
438
# Preprocess the string inputs, turning them into int sequences
439
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
440
# Train the model on the int sequences
441
print("\nTraining model...")
442
model.compile(optimizer="rmsprop", loss="mse")
443
model.fit(train_dataset)
444
445
# For inference, you can export a model that accepts strings as input
446
inputs = keras.Input(shape=(1,), dtype="string")
447
x = text_vectorizer(inputs)
448
outputs = model(x)
449
end_to_end_model = keras.Model(inputs, outputs)
450
451
# Call the end-to-end model on test data (which includes unknown tokens)
452
print("\nCalling end-to-end model on test string...")
453
test_data = tf.constant(["The one the other will absorb"])
454
test_output = end_to_end_model(test_data)
455
print("Model output:", test_output)
456
457
"""
458
You can see the `TextVectorization` layer in action, combined with an `Embedding` mode,
459
in the example
460
[text classification from scratch](https://keras.io/examples/nlp/text_classification_from_scratch/).
461
462
Note that when training such a model, for best performance, you should always
463
use the `TextVectorization` layer as part of the input pipeline.
464
"""
465
466
"""
467
### Encoding text as a dense matrix of N-grams with multi-hot encoding
468
469
This is how you should preprocess text to be passed to a `Dense` layer.
470
"""
471
472
# Define some text data to adapt the layer
473
adapt_data = tf.constant(
474
[
475
"The Brain is wider than the Sky",
476
"For put them side by side",
477
"The one the other will contain",
478
"With ease and You beside",
479
]
480
)
481
# Instantiate TextVectorization with "multi_hot" output_mode
482
# and ngrams=2 (index all bigrams)
483
text_vectorizer = layers.TextVectorization(output_mode="multi_hot", ngrams=2)
484
# Index the bigrams via `adapt()`
485
text_vectorizer.adapt(adapt_data)
486
487
# Try out the layer
488
print(
489
"Encoded text:\n",
490
text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
491
)
492
493
# Create a simple model
494
inputs = keras.Input(shape=(text_vectorizer.vocabulary_size(),))
495
outputs = layers.Dense(1)(inputs)
496
model = keras.Model(inputs, outputs)
497
498
# Create a labeled dataset (which includes unknown tokens)
499
train_dataset = tf.data.Dataset.from_tensor_slices(
500
(["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0])
501
)
502
503
# Preprocess the string inputs, turning them into int sequences
504
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
505
# Train the model on the int sequences
506
print("\nTraining model...")
507
model.compile(optimizer="rmsprop", loss="mse")
508
model.fit(train_dataset)
509
510
# For inference, you can export a model that accepts strings as input
511
inputs = keras.Input(shape=(1,), dtype="string")
512
x = text_vectorizer(inputs)
513
outputs = model(x)
514
end_to_end_model = keras.Model(inputs, outputs)
515
516
# Call the end-to-end model on test data (which includes unknown tokens)
517
print("\nCalling end-to-end model on test string...")
518
test_data = tf.constant(["The one the other will absorb"])
519
test_output = end_to_end_model(test_data)
520
print("Model output:", test_output)
521
522
"""
523
### Encoding text as a dense matrix of N-grams with TF-IDF weighting
524
525
This is an alternative way of preprocessing text before passing it to a `Dense` layer.
526
"""
527
528
# Define some text data to adapt the layer
529
adapt_data = tf.constant(
530
[
531
"The Brain is wider than the Sky",
532
"For put them side by side",
533
"The one the other will contain",
534
"With ease and You beside",
535
]
536
)
537
# Instantiate TextVectorization with "tf-idf" output_mode
538
# (multi-hot with TF-IDF weighting) and ngrams=2 (index all bigrams)
539
text_vectorizer = layers.TextVectorization(output_mode="tf-idf", ngrams=2)
540
# Index the bigrams and learn the TF-IDF weights via `adapt()`
541
text_vectorizer.adapt(adapt_data)
542
543
# Try out the layer
544
print(
545
"Encoded text:\n",
546
text_vectorizer(["The Brain is deeper than the sea"]).numpy(),
547
)
548
549
# Create a simple model
550
inputs = keras.Input(shape=(text_vectorizer.vocabulary_size(),))
551
outputs = layers.Dense(1)(inputs)
552
model = keras.Model(inputs, outputs)
553
554
# Create a labeled dataset (which includes unknown tokens)
555
train_dataset = tf.data.Dataset.from_tensor_slices(
556
(["The Brain is deeper than the sea", "for if they are held Blue to Blue"], [1, 0])
557
)
558
559
# Preprocess the string inputs, turning them into int sequences
560
train_dataset = train_dataset.batch(2).map(lambda x, y: (text_vectorizer(x), y))
561
# Train the model on the int sequences
562
print("\nTraining model...")
563
model.compile(optimizer="rmsprop", loss="mse")
564
model.fit(train_dataset)
565
566
# For inference, you can export a model that accepts strings as input
567
inputs = keras.Input(shape=(1,), dtype="string")
568
x = text_vectorizer(inputs)
569
outputs = model(x)
570
end_to_end_model = keras.Model(inputs, outputs)
571
572
# Call the end-to-end model on test data (which includes unknown tokens)
573
print("\nCalling end-to-end model on test string...")
574
test_data = tf.constant(["The one the other will absorb"])
575
test_output = end_to_end_model(test_data)
576
print("Model output:", test_output)
577
578
579
"""
580
## Important gotchas
581
582
### Working with lookup layers with very large vocabularies
583
584
You may find yourself working with a very large vocabulary in a `TextVectorization`, a `StringLookup` layer,
585
or an `IntegerLookup` layer. Typically, a vocabulary larger than 500MB would be considered "very large".
586
587
In such a case, for best performance, you should avoid using `adapt()`.
588
Instead, pre-compute your vocabulary in advance
589
(you could use Apache Beam or TF Transform for this)
590
and store it in a file. Then load the vocabulary into the layer at construction
591
time by passing the file path as the `vocabulary` argument.
592
593
594
### Using lookup layers on a TPU pod or with `ParameterServerStrategy`.
595
596
There is an outstanding issue that causes performance to degrade when using
597
a `TextVectorization`, `StringLookup`, or `IntegerLookup` layer while
598
training on a TPU pod or on multiple machines via `ParameterServerStrategy`.
599
This is slated to be fixed in TensorFlow 2.7.
600
601
"""
602
603