Path: blob/master/site/en-snapshot/tfx/tutorials/transform/census.ipynb
25118 views
Copyright 2020 The TensorFlow Authors.
Preprocessing data with TensorFlow Transform
The Feature Engineering Component of TensorFlow Extended (TFX)
This example colab notebook provides a somewhat more advanced example of how TensorFlow Transform (tf.Transform
) can be used to preprocess data using exactly the same code for both training a model and serving inferences in production.
TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset. For example, using TensorFlow Transform you could:
Normalize an input value by using the mean and standard deviation
Convert strings to integers by generating a vocabulary over all of the input values
Convert floats to integers by assigning them to buckets, based on the observed data distribution
TensorFlow has built-in support for manipulations on a single example or a batch of examples. tf.Transform
extends these capabilities to support full passes over the entire training dataset.
The output of tf.Transform
is exported as a TensorFlow graph which you can use for both training and serving. Using the same graph for both training and serving can prevent skew, since the same transformations are applied in both stages.
Key Point: In order to understand tf.Transform
and how it works with Apache Beam, you'll need to know a little bit about Apache Beam itself. The Beam Programming Guide is a great place to start.
##What we're doing in this example
In this example we'll be processing a widely used dataset containing census data, and training a model to do classification. Along the way we'll be transforming the data using tf.Transform
.
Key Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is a feature relevant to the problem you want to solve or will it introduce bias? For more information, read about ML fairness.
Note: TensorFlow Model Analysis is a powerful tool for understanding how well your model predicts for various segments of your data, including understanding how your model may reinforce societal biases and disparities.
Install TensorFlow Transform
Imports and globals
First import the stuff we need.
Next download the data files:
Name our columns
We'll create some handy lists for referencing the columns in our dataset.
Here's a quick preview of the data:
The test data has 1 header line that needs to be skipped, and a trailing "." at the end of each line.
###Define our features and schema Let's define a schema based on what types the columns are in our input. Among other things this will help with importing them correctly.
[Optional] Encode and decode tf.train.Example protos
This tutorial needs to convert examples from the dataset to and from tf.train.Example
protos in a few places.
The hidden encode_example
function below converts a dictionary of features forom the dataset to a tf.train.Example
.
Now you can convert dataset examples into Example
protos:
You can also convert batches of serialized Example protos back into a dictionary of tensors:
In some cases the label will not be passed in, so the encode function is written so that the label is optional:
When creating an Example
proto it will simply not contain the label key.
###Setting hyperparameters and basic housekeeping
Constants and hyperparameters used for training.
##Preprocessing with tf.Transform
###Create a tf.Transform
preprocessing_fn The preprocessing function is the most important concept of tf.Transform. A preprocessing function is where the transformation of the dataset really happens. It accepts and returns a dictionary of tensors, where a tensor means a Tensor
or SparseTensor
. There are two main groups of API calls that typically form the heart of a preprocessing function:
TensorFlow Ops: Any function that accepts and returns tensors, which usually means TensorFlow ops. These add TensorFlow operations to the graph that transforms raw data into transformed data one feature vector at a time. These will run for every example, during both training and serving.
Tensorflow Transform Analyzers/Mappers: Any of the analyzers/mappers provided by tf.Transform. These also accept and return tensors, and typically contain a combination of Tensorflow ops and Beam computation, but unlike TensorFlow ops they only run in the Beam pipeline during analysis requiring a full pass over the entire training dataset. The Beam computation runs only once, (prior to training, during analysis), and typically make a full pass over the entire training dataset. They create
tf.constant
tensors, which are added to your graph. For example,tft.min
computes the minimum of a tensor over the training dataset.
Caution: When you apply your preprocessing function to serving inferences, the constants that were created by analyzers during training do not change. If your data has trend or seasonality components, plan accordingly.
Here is a preprocessing_fn
for this dataset. It does several things:
Using
tft.scale_to_0_1
, it scales the numeric features to the[0,1]
range.Using
tft.compute_and_apply_vocabulary
, it computes a vocabulary for each of the categorical features, and returns the integer IDs for each input as antf.int64
. This applies both to string and integer categorical-inputs.It applies some manual transformations to the data using standard TensorFlow operations. Here these operations are applied to the label but could transform the features as well. The TensorFlow operations do several things:
They build a lookup table for the label (the
tf.init_scope
ensures that the table is only created the first time the function is called).They normalize the text of the label.
They convert the label to a one-hot.
Syntax
You're almost ready to put everything together and use Apache Beam to run it.
Apache Beam uses a special syntax to define and invoke transforms. For example, in this line:
The method to_this_call
is being invoked and passed the object called pass_this
, and this operation will be referred to as name this step
in a stack trace. The result of the call to to_this_call
is returned in result
. You will often see stages of a pipeline chained together like this:
and since that started with a new pipeline, you can continue like this:
Transform the data
Now we're ready to start transforming our data in an Apache Beam pipeline.
Read in the data using the
tfxio.CsvTFXIO
CSV reader (to process lines of text in a pipeline usetfxio.BeamRecordCsvTFXIO
instead).Analyse and transform the data using the
preprocessing_fn
defined above.Write out the result as a
TFRecord
ofExample
protos, which we will use for training a model later
Run the pipeline:
Wrap up the output directory as a tft.TFTransformOutput
:
If you look in the directory you'll see it contains three things:
The
train_transformed
andtest_transformed
data filesThe
transform_fn
directory (atf.saved_model
)The
transformed_metadata
The followning sections show how to use these artifacts to train a model.
##Using our preprocessed data to train a model using tf.keras
To show how tf.Transform
enables us to use the same code for both training and serving, and thus prevent skew, we're going to train a model. To train our model and prepare our trained model for production we need to create input functions. The main difference between our training input function and our serving input function is that training data contains the labels, and production data does not. The arguments and returns are also somewhat different.
###Create an input function for training
Running the pipeline in the previous section created TFRecord
files containing the the transformed data.
The following code uses tf.data.experimental.make_batched_features_dataset
and tft.TFTransformOutput.transformed_feature_spec
to read these data files as a tf.data.Dataset
:
Below you can see a transformed sample of the data. Note how the numeric columns like education-num
and hourd-per-week
are converted to floats with a range of [0,1], and the string columns have been converted to IDs:
Train, Evaluate the model
Build the model
Build the datasets
Train and evaluate the model:
Transform new data
In the previous section the training process used the hard-copies of the transformed data that were generated by tft_beam.AnalyzeAndTransformDataset
in the transform_dataset
function.
For operating on new data you'll need to load final version of the preprocessing_fn
that was saved by tft_beam.WriteTransformFn
.
The TFTransformOutput.transform_features_layer
method loads the preprocessing_fn
SavedModel from the output directory.
Here's a function to load new, unprocessed batches from a source file:
Load the tft.TransformFeaturesLayer
to transform this data with the preprocessing_fn
:
The tft_layer
is smart enough to still execute the transformation if only a subset of features are passed in. For example, if you only pass in two features, you'll get just the transformed versions of those features back:
Here's a more robust version that drops features that are not in the feature-spec, and returns a (features, label)
pair if the label is in the provided features:
Now you can use Dataset.map
to apply that transformation, on the fly to new data:
Export the model
So you have a trained model, and a method to apply the preprocessing_fn
to new data. Assemble them into a new model that accepts serialized tf.train.Example
protos as input.
Build the model and test-run it on the batch of serialized examples:
Export the model as a SavedModel:
Reload the the model and test it on the same batch of examples:
##What we did In this example we used tf.Transform
to preprocess a dataset of census data, and train a model with the cleaned and transformed data. We also created an input function that we could use when we deploy our trained model in a production environment to perform inference. By using the same code for both training and inference we avoid any issues with data skew. Along the way we learned about creating an Apache Beam transform to perform the transformation that we needed for cleaning the data. We also saw how to use this transformed data to train a model using tf.keras
. This is just a small piece of what TensorFlow Transform can do! We encourage you to dive into tf.Transform
and discover what it can do for you.
[Optional] Using our preprocessed data to train a model using tf.estimator
Warning: Estimators are not recommended for new code. Estimators run
v1.Session
-style code which is more difficult to write correctly, and can behave unexpectedly, especially when combined with TF 2 code. Estimators do fall under our compatibility guarantees, but will receive no fixes other than security vulnerabilities. See the migration guide for details.
###Create an input function for training
###Create an input function for serving
Let's create an input function that we could use in production, and prepare our trained model for serving.
###Wrap our input data in FeatureColumns Our model will expect our data in TensorFlow FeatureColumns.
###Train, Evaluate, and Export our model
###Put it all together We've created all the stuff we need to preprocess our census data, train a model, and prepare it for serving. So far we've just been getting things ready. It's time to start running!
Note: Scroll the output from this cell to see the whole process. The results will be at the bottom.