Path: blob/master/site/en-snapshot/tfx/tutorials/transform/simple.ipynb
25118 views
Copyright 2021 The TensorFlow Authors.
Preprocess data with TensorFlow Transform
The Feature Engineering Component of TensorFlow Extended (TFX)
Note: We recommend running this tutorial in a Colab notebook, with no setup required! Just click "Run in Google Colab".
This example colab notebook provides a very simple example of how TensorFlow Transform (tf.Transform
) can be used to preprocess data using exactly the same code for both training a model and serving inferences in production.
TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset. For example, using TensorFlow Transform you could:
Normalize an input value by using the mean and standard deviation
Convert strings to integers by generating a vocabulary over all of the input values
Convert floats to integers by assigning them to buckets, based on the observed data distribution
TensorFlow has built-in support for manipulations on a single example or a batch of examples. tf.Transform
extends these capabilities to support full passes over the entire training dataset.
The output of tf.Transform
is exported as a TensorFlow graph which you can use for both training and serving. Using the same graph for both training and serving can prevent skew, since the same transformations are applied in both stages.
Upgrade Pip
To avoid upgrading Pip in a system when running locally, check to make sure that we're running in Colab. Local systems can of course be upgraded separately.
Install TensorFlow Transform
Imports
Data: Create some dummy data
We'll create some simple dummy data for our simple example:
raw_data
is the initial raw data that we're going to preprocessraw_data_metadata
contains the schema that tells us the types of each of the columns inraw_data
. In this case, it's very simple.
Transform: Create a preprocessing function
The preprocessing function is the most important concept of tf.Transform. A preprocessing function is where the transformation of the dataset really happens. It accepts and returns a dictionary of tensors, where a tensor means a Tensor
or SparseTensor
. There are two main groups of API calls that typically form the heart of a preprocessing function:
TensorFlow Ops: Any function that accepts and returns tensors, which usually means TensorFlow ops. These add TensorFlow operations to the graph that transforms raw data into transformed data one feature vector at a time. These will run for every example, during both training and serving.
Tensorflow Transform Analyzers/Mappers: Any of the analyzers/mappers provided by tf.Transform. These also accept and return tensors, and typically contain a combination of Tensorflow ops and Beam computation, but unlike TensorFlow ops they only run in the Beam pipeline during analysis requiring a full pass over the entire training dataset. The Beam computation runs only once, (prior to training, during analysis), and typically make a full pass over the entire training dataset. They create
tf.constant
tensors, which are added to your graph. For example,tft.min
computes the minimum of a tensor over the training dataset.
Caution: When you apply your preprocessing function to serving inferences, the constants that were created by analyzers during training do not change. If your data has trend or seasonality components, plan accordingly.
Note: The preprocessing_fn
is not directly callable. This means that calling preprocessing_fn(raw_data)
will not work. Instead, it must be passed to the Transform Beam API as shown in the following cells.
Syntax
You're almost ready to put everything together and use Apache Beam to run it.
Apache Beam uses a special syntax to define and invoke transforms. For example, in this line:
The method to_this_call
is being invoked and passed the object called pass_this
, and this operation will be referred to as name this step
in a stack trace. The result of the call to to_this_call
is returned in result
. You will often see stages of a pipeline chained together like this:
and since that started with a new pipeline, you can continue like this:
Putting it all together
Now we're ready to transform our data. We'll use Apache Beam with a direct runner, and supply three inputs:
raw_data
- The raw input data that we created aboveraw_data_metadata
- The schema for the raw datapreprocessing_fn
- The function that we created to do our transformation
Is this the right answer?
Previously, we used tf.Transform
to do this:
x_centered - With input of
[1, 2, 3]
the mean of x is 2, and we subtract it from x to center our x values at 0. So our result of[-1.0, 0.0, 1.0]
is correct.y_normalized - We wanted to scale our y values between 0 and 1. Our input was
[1, 2, 3]
so our result of[0.0, 0.5, 1.0]
is correct.s_integerized - We wanted to map our strings to indexes in a vocabulary, and there were only 2 words in our vocabulary ("hello" and "world"). So with input of
["hello", "world", "hello"]
our result of[0, 1, 0]
is correct. Since "hello" occurs most frequently in this data, it will be the first entry in the vocabulary.x_centered_times_y_normalized - We wanted to create a new feature by crossing
x_centered
andy_normalized
using multiplication. Note that this multiplies the results, not the original values, and our new result of[-0.0, 0.0, 1.0]
is correct.
Use the resulting transform_fn
The transform_fn/
directory contains a tf.saved_model
implementing with all the constants tensorflow-transform analysis results built into the graph.
It is possible to load this directly with tf.saved_model.load
, but this not easy to use:
A better approach is to load it using tft.TFTransformOutput
. The TFTransformOutput.transform_features_layer
method returns a tft.TransformFeaturesLayer
object that can be used to apply the transformation:
This tft.TransformFeaturesLayer
expects a dictionary of batched features. So create a Dict[str, tf.Tensor]
from the List[Dict[str, Any]]
in raw_data
:
You can use the tft.TransformFeaturesLayer
on it's own:
Export
A more typical use case would use tf.Transform
to apply the transformation to the training and evaluation datasets (see the next tutorial for an example). Then, after training, before exporting the model attach the tft.TransformFeaturesLayer
as the first layer so that you can export it as part of your tf.saved_model
. For a concrete example, keep reading.
An example training model
Below is a model that:
takes the transformed batch,
stacks them all together into a simple
(batch, features)
matrix,runs them through a few dense layers, and
produces 10 linear outputs.
In a real use case you would apply a one-hot to the s_integerized
feature.
You could train this model on a dataset transformed by tf.Transform
:
Imagine we trained the model.
This model runs on the transformed inputs
An example export wrapper
Imagine you've trained the above model and want to export it.
You'll want to include the transform function in the exported model:
This combined model works on the raw data, and produces exactly the same results as calling the trained model directly:
This export_model
includes the tft.TransformFeaturesLayer
and is entierly self-contained. You can save it and restore it in another environment and still get exactly the same result: