Path: blob/master/site/en-snapshot/tutorials/structured_data/preprocessing_layers.ipynb
25118 views
Copyright 2019 The TensorFlow Authors.
Classify structured data using Keras preprocessing layers
This tutorial demonstrates how to classify structured data, such as tabular data, using a simplified version of the PetFinder dataset from a Kaggle competition stored in a CSV file.
You will use Keras to define the model, and Keras preprocessing layers as a bridge to map from columns in a CSV file to features used to train the model. The goal is to predict if a pet will be adopted.
This tutorial contains complete code for:
Building an input pipeline to batch and shuffle the rows using
tf.data
. (Visit tf.data: Build TensorFlow input pipelines for more details.)Mapping from columns in the CSV file to features used to train the model with the Keras preprocessing layers.
Building, training, and evaluating a model using the Keras built-in methods.
Note: This tutorial is similar to Classify structured data with feature columns. This version uses the Keras preprocessing layers instead of the tf.feature_column
API, as the former are more intuitive and can be easily included inside your model to simplify deployment.
The PetFinder.my mini dataset
There are several thousand rows in the PetFinder.my mini's CSV dataset file, where each row describes a pet (a dog or a cat) and each column describes an attribute, such as age, breed, color, and so on.
In the dataset's summary below, notice there are mostly numerical and categorical columns. In this tutorial, you will only be dealing with those two feature types, dropping Description
(a free text feature) and AdoptionSpeed
(a classification feature) during data preprocessing.
Column | Pet description | Feature type | Data type |
---|---|---|---|
Type | Type of animal (Dog , Cat ) | Categorical | String |
Age | Age | Numerical | Integer |
Breed1 | Primary breed | Categorical | String |
Color1 | Color 1 | Categorical | String |
Color2 | Color 2 | Categorical | String |
MaturitySize | Size at maturity | Categorical | String |
FurLength | Fur length | Categorical | String |
Vaccinated | Pet has been vaccinated | Categorical | String |
Sterilized | Pet has been sterilized | Categorical | String |
Health | Health condition | Categorical | String |
Fee | Adoption fee | Numerical | Integer |
Description | Profile write-up | Text | String |
PhotoAmt | Total uploaded photos | Numerical | Integer |
AdoptionSpeed | Categorical speed of adoption | Classification | Integer |
Import TensorFlow and other libraries
Load the dataset and read it into a pandas DataFrame
pandas is a Python library with many helpful utilities for loading and working with structured data. Use tf.keras.utils.get_file
to download and extract the CSV file with the PetFinder.my mini dataset, and load it into a DataFrame with pandas.read_csv
:
Inspect the dataset by checking the first five rows of the DataFrame:
Create a target variable
The original task in Kaggle's PetFinder.my Adoption Prediction competition was to predict the speed at which a pet will be adopted (e.g. in the first week, the first month, the first three months, and so on).
In this tutorial, you will simplify the task by transforming it into a binary classification problem, where you simply have to predict whether a pet was adopted or not.
After modifying the AdoptionSpeed
column, 0
will indicate the pet was not adopted, and 1
will indicate it was.
Split the DataFrame into training, validation, and test sets
The dataset is in a single pandas DataFrame. Split it into training, validation, and test sets using a, for example, 80:10:10 ratio, respectively:
Create an input pipeline using tf.data
Next, create a utility function that converts each training, validation, and test set DataFrame into a tf.data.Dataset
, then shuffles and batches the data.
Note: If you were working with a very large CSV file (so large that it does not fit into memory), you would use the tf.data
API to read it from disk directly. That is not covered in this tutorial.
Now, use the newly created function (df_to_dataset
) to check the format of the data the input pipeline helper function returns by calling it on the training data, and use a small batch size to keep the output readable:
As the output demonstrates, the training set returns a dictionary of column names (from the DataFrame) that map to column values from rows.
Apply the Keras preprocessing layers
The Keras preprocessing layers allow you to build Keras-native input processing pipelines, which can be used as independent preprocessing code in non-Keras workflows, combined directly with Keras models, and exported as part of a Keras SavedModel.
In this tutorial, you will use the following four preprocessing layers to demonstrate how to perform preprocessing, structured data encoding, and feature engineering:
tf.keras.layers.Normalization
: Performs feature-wise normalization of input features.tf.keras.layers.CategoryEncoding
: Turns integer categorical features into one-hot, multi-hot, or tf-idf dense representations.tf.keras.layers.StringLookup
: Turns string categorical values into integer indices.tf.keras.layers.IntegerLookup
: Turns integer categorical values into integer indices.
You can learn more about the available layers in the Working with preprocessing layers guide.
For numerical features of the PetFinder.my mini dataset, you will use a
tf.keras.layers.Normalization
layer to standardize the distribution of the data.For categorical features, such as pet
Type
s (Dog
andCat
strings), you will transform them to multi-hot encoded tensors withtf.keras.layers.CategoryEncoding
.
Numerical columns
For each numeric feature in the PetFinder.my mini dataset, you will use a tf.keras.layers.Normalization
layer to standardize the distribution of the data.
Define a new utility function that returns a layer which applies feature-wise normalization to numerical features using that Keras preprocessing layer:
Next, test the new function by calling it on the total uploaded pet photo features to normalize 'PhotoAmt'
:
Note: If you have many numeric features (hundreds, or more), it is more efficient to concatenate them first and use a single tf.keras.layers.Normalization
layer.
Categorical columns
Pet Type
s in the dataset are represented as strings—Dog
s and Cat
s—which need to be multi-hot encoded before being fed into the model. The Age
feature
Define another new utility function that returns a layer which maps values from a vocabulary to integer indices and multi-hot encodes the features using the tf.keras.layers.StringLookup
, tf.keras.layers.IntegerLookup
, and tf.keras.CategoryEncoding
preprocessing layers:
Test the get_category_encoding_layer
function by calling it on pet 'Type'
features to turn them into multi-hot encoded tensors:
Repeat the process on the pet 'Age'
features:
Preprocess selected features to train the model on
You have learned how to use several types of Keras preprocessing layers. Next, you will:
Apply the preprocessing utility functions defined earlier on 13 numerical and categorical features from the PetFinder.my mini dataset.
Add all the feature inputs to a list.
As mentioned in the beginning, to train the model, you will use the PetFinder.my mini dataset's numerical ('PhotoAmt'
, 'Fee'
) and categorical ('Age'
, 'Type'
, 'Color1'
, 'Color2'
, 'Gender'
, 'MaturitySize'
, 'FurLength'
, 'Vaccinated'
, 'Sterilized'
, 'Health'
, 'Breed1'
) features.
Note: If your aim is to build an accurate model, try a larger dataset of your own, and think carefully about which features are the most meaningful to include, and how they should be represented.
Earlier, you used a small batch size to demonstrate the input pipeline. Let's now create a new input pipeline with a larger batch size of 256:
Normalize the numerical features (the number of pet photos and the adoption fee), and add them to one list of inputs called encoded_features
:
Turn the integer categorical values from the dataset (the pet age) into integer indices, perform multi-hot encoding, and add the resulting feature inputs to encoded_features
:
Repeat the same step for the string categorical values:
Create, compile, and train the model
The next step is to create a model using the Keras Functional API. For the first layer in your model, merge the list of feature inputs—encoded_features
—into one vector via concatenation with tf.keras.layers.concatenate
.
Configure the model with Keras Model.compile
:
Let's visualize the connectivity graph:
Next, train and test the model:
Perform inference
The model you have developed can now classify a row from a CSV file directly after you've included the preprocessing layers inside the model itself.
You can now save and reload the Keras model with Model.save
and Model.load_model
before performing inference on new data:
To get a prediction for a new sample, you can simply call the Keras Model.predict
method. There are just two things you need to do:
Wrap scalars into a list so as to have a batch dimension (
Model
s only process batches of data, not single samples).Call
tf.convert_to_tensor
on each feature.
Note: You will typically have better results with deep learning with larger and more complex datasets. When working with a small dataset, such as the simplified PetFinder.my one, you can use a decision tree or a random forest as a strong baseline. The goal of this tutorial is to demonstrate the mechanics of working with structured data, so you have a starting point when working with your own datasets in the future.
Next steps
To learn more about classifying structured data, try working with other datasets. To improve accuracy during training and testing your models, think carefully about which features to include in your model and how they should be represented.
Below are some suggestions for datasets:
TensorFlow Datasets: MovieLens: A set of movie ratings from a movie recommendation service.
TensorFlow Datasets: Wine Quality: Two datasets related to red and white variants of the Portuguese "Vinho Verde" wine. You can also find the Red Wine Quality dataset on Kaggle.
Kaggle: arXiv Dataset: A corpus of 1.7 million scholarly articles from arXiv, covering physics, computer science, math, statistics, electrical engineering, quantitative biology, and economics.