Path: blob/master/site/en-snapshot/guide/migrate/migrating_feature_columns.ipynb
25118 views
Copyright 2021 The TensorFlow Authors.
Migrate tf.feature_column
s to Keras preprocessing layers
Training a model usually comes with some amount of feature preprocessing, particularly when dealing with structured data. When training a tf.estimator.Estimator
in TensorFlow 1, you usually perform feature preprocessing with the tf.feature_column
API. In TensorFlow 2, you can do this directly with Keras preprocessing layers.
This migration guide demonstrates common feature transformations using both feature columns and preprocessing layers, followed by training a complete model with both APIs.
First, start with a couple of necessary imports:
Now, add a utility function for calling a feature column for demonstration:
Input handling
To use feature columns with an estimator, model inputs are always expected to be a dictionary of tensors:
Each feature column needs to be created with a key to index into the source data. The output of all feature columns is concatenated and used by the estimator model.
In Keras, model input is much more flexible. A tf.keras.Model
can handle a single tensor input, a list of tensor features, or a dictionary of tensor features. You can handle dictionary input by passing a dictionary of tf.keras.Input
on model creation. Inputs will not be concatenated automatically, which allows them to be used in much more flexible ways. They can be concatenated with tf.keras.layers.Concatenate
.
One-hot encoding integer IDs
A common feature transformation is one-hot encoding integer inputs of a known range. Here is an example using feature columns:
Using Keras preprocessing layers, these columns can be replaced by a single tf.keras.layers.CategoryEncoding
layer with output_mode
set to 'one_hot'
:
Note: For large one-hot encodings, it is much more efficient to use a sparse representation of the output. If you pass sparse=True
to the CategoryEncoding
layer, the output of the layer will be a tf.sparse.SparseTensor
, which can be efficiently handled as input to a tf.keras.layers.Dense
layer.
Normalizing numeric features
When handling continuous, floating-point features with feature columns, you need to use a tf.feature_column.numeric_column
. In the case where the input is already normalized, converting this to Keras is trivial. You can simply use a tf.keras.Input
directly into your model, as shown above.
A numeric_column
can also be used to normalize input:
In contrast, with Keras, this normalization can be done with tf.keras.layers.Normalization
.
Bucketizing and one-hot encoding numeric features
Another common transformation of continuous, floating point inputs is to bucketize then to integers of a fixed range.
In feature columns, this can be achieved with a tf.feature_column.bucketized_column
:
In Keras, this can be replaced by tf.keras.layers.Discretization
:
One-hot encoding string data with a vocabulary
Handling string features often requires a vocabulary lookup to translate strings into indices. Here is an example using feature columns to lookup strings and then one-hot encode the indices:
Using Keras preprocessing layers, use the tf.keras.layers.StringLookup
layer with output_mode
set to 'one_hot'
:
Note: For large one-hot encodings, it is much more efficient to use a sparse representation of the output. If you pass sparse=True
to the StringLookup
layer, the output of the layer will be a tf.sparse.SparseTensor
, which can be efficiently handled as input to a tf.keras.layers.Dense
layer.
Embedding string data with a vocabulary
For larger vocabularies, an embedding is often needed for good performance. Here is an example embedding a string feature using feature columns:
Using Keras preprocessing layers, this can be achieved by combining a tf.keras.layers.StringLookup
layer and an tf.keras.layers.Embedding
layer. The default output for the StringLookup
will be integer indices which can be fed directly into an embedding.
Note: The Embedding
layer contains trainable parameters. While the StringLookup
layer can be applied to data inside or outside of a model, the Embedding
must always be part of a trainable Keras model to train correctly.
Summing weighted categorical data
In some cases, you need to deal with categorical data where each occurance of a category comes with an associated weight. In feature columns, this is handled with tf.feature_column.weighted_categorical_column
. When paired with an indicator_column
, this has the effect of summing weights per category.
In Keras, this can be done by passing a count_weights
input to tf.keras.layers.CategoryEncoding
with output_mode='count'
.
Embedding weighted categorical data
You might alternately want to embed weighted categorical inputs. In feature columns, the embedding_column
contains a combiner
argument. If any sample contains multiple entries for a category, they will be combined according to the argument setting (by default 'mean'
).
In Keras, there is no combiner
option to tf.keras.layers.Embedding
, but you can achieve the same effect with tf.keras.layers.Dense
. The embedding_column
above is simply linearly combining embedding vectors according to category weight. Though not obvious at first, it is exactly equivalent to representing your categorical inputs as a sparse weight vector of size (num_tokens)
, and multiplying them by a Dense
kernel of shape (embedding_size, num_tokens)
.
Complete training example
To show a complete training workflow, first prepare some data with three features of different types:
Define some common constants for both TensorFlow 1 and TensorFlow 2 workflows:
With feature columns
Feature columns must be passed as a list to the estimator on creation, and will be called implicitly during training.
The feature columns will also be used to transform input data when running inference on the model.
With Keras preprocessing layers
Keras preprocessing layers are more flexible in where they can be called. A layer can be applied directly to tensors, used inside a tf.data
input pipeline, or built directly into a trainable Keras model.
In this example, you will apply preprocessing layers inside a tf.data
input pipeline. To do this, you can define a separate tf.keras.Model
to preprocess your input features. This model is not trainable, but is a convenient way to group preprocessing layers.
Note: As an alternative to supplying a vocabulary and normalization statistics on layer creation, many preprocessing layers provide an adapt()
method for learning layer state directly from the input data. See the preprocessing guide for more details.
You can now apply this model inside a call to tf.data.Dataset.map
. Please note that the function passed to map
will automatically be converted into a tf.function
, and usual caveats for writing tf.function
code apply (no side effects).
Next, you can define a separate Model
containing the trainable layers. Note how the inputs to this model now reflect the preprocessed feature types and shapes.
You can now train the training_model
with tf.keras.Model.fit
.
Finally, at inference time, it can be useful to combine these separate stages into a single model that handles raw feature inputs.
This composed model can be saved as a .keras
file for later use.
Note: Preprocessing layers are not trainable, which allows you to apply them asynchronously using tf.data
. This has performance benefits, as you can both prefetch preprocessed batches, and free up any accelerators to focus on the differentiable parts of a model (learn more in the Prefetching section of the Better performance with the tf.data
API guide). As this guide shows, separating preprocessing during training and composing it during inference is a flexible way to leverage these performance gains. However, if your model is small or preprocessing time is negligible, it may be simpler to build preprocessing into a complete model from the start. To do this you can build a single model starting with tf.keras.Input
, followed by preprocessing layers, followed by trainable layers.
Feature column equivalence table
For reference, here is an approximate correspondence between feature columns and Keras preprocessing layers:
* The output_mode
can be passed to tf.keras.layers.CategoryEncoding
, tf.keras.layers.StringLookup
, tf.keras.layers.IntegerLookup
, and tf.keras.layers.TextVectorization
.
† tf.keras.layers.TextVectorization
can handle freeform text input directly (for example, entire sentences or paragraphs). This is not one-to-one replacement for categorical sequence handling in TensorFlow 1, but may offer a convenient replacement for ad-hoc text preprocessing.
Note: Linear estimators, such as tf.estimator.LinearClassifier
, can handle direct categorical input (integer indices) without an embedding_column
or indicator_column
. However, integer indices cannot be passed directly to tf.keras.layers.Dense
or tf.keras.experimental.LinearModel
. These inputs should be first encoded with tf.layers.CategoryEncoding
with output_mode='count'
(and sparse=True
if the category sizes are large) before calling into Dense
or LinearModel
.
Next steps
For more information on Keras preprocessing layers, go to the Working with preprocessing layers guide.
For a more in-depth example of applying preprocessing layers to structured data, refer to the Classify structured data using Keras preprocessing layers tutorial.