Path: blob/master/site/en-snapshot/tutorials/load_data/text.ipynb
25118 views
Copyright 2018 The TensorFlow Authors.
Load text
This tutorial demonstrates two ways to load and preprocess text.
First, you will use Keras utilities and preprocessing layers. These include
tf.keras.utils.text_dataset_from_directory
to turn data into atf.data.Dataset
andtf.keras.layers.TextVectorization
for data standardization, tokenization, and vectorization. If you are new to TensorFlow, you should start with these.Then, you will use lower-level utilities like
tf.data.TextLineDataset
to load text files, and TensorFlow Text APIs, such astext.UnicodeScriptTokenizer
andtext.case_fold_utf8
, to preprocess the data for finer-grain control.
Example 1: Predict the tag for a Stack Overflow question
As a first example, you will download a dataset of programming questions from Stack Overflow. Each question ("How do I sort a dictionary by value?") is labeled with exactly one tag (Python
, CSharp
, JavaScript
, or Java
). Your task is to develop a model that predicts the tag for a question. This is an example of multi-class classification—an important and widely applicable kind of machine learning problem.
Download and explore the dataset
Begin by downloading the Stack Overflow dataset using tf.keras.utils.get_file
, and exploring the directory structure:
The train/csharp
, train/java
, train/python
and train/javascript
directories contain many text files, each of which is a Stack Overflow question.
Print an example file and inspect the data:
Load the dataset
Next, you will load the data off disk and prepare it into a format suitable for training. To do so, you will use the tf.keras.utils.text_dataset_from_directory
utility to create a labeled tf.data.Dataset
. If you're new to tf.data
, it's a powerful collection of tools for building input pipelines. (Learn more in the tf.data: Build TensorFlow input pipelines guide.)
The tf.keras.utils.text_dataset_from_directory
API expects a directory structure as follows:
When running a machine learning experiment, it is a best practice to divide your dataset into three splits: training, validation, and test.
The Stack Overflow dataset has already been divided into training and test sets, but it lacks a validation set.
Create a validation set using an 80:20 split of the training data by using tf.keras.utils.text_dataset_from_directory
with validation_split
set to 0.2
(i.e. 20%):
As the previous cell output suggests, there are 8,000 examples in the training folder, of which you will use 80% (or 6,400) for training. You will learn in a moment that you can train a model by passing a tf.data.Dataset
directly to Model.fit
.
First, iterate over the dataset and print out a few examples, to get a feel for the data.
Note: To increase the difficulty of the classification problem, the dataset author replaced occurrences of the words Python, CSharp, JavaScript, or Java in the programming question with the word blank.
The labels are 0
, 1
, 2
or 3
. To check which of these correspond to which string label, you can inspect the class_names
property on the dataset:
Next, you will create a validation and a test set using tf.keras.utils.text_dataset_from_directory
. You will use the remaining 1,600 reviews from the training set for validation.
Note: When using the validation_split
and subset
arguments of tf.keras.utils.text_dataset_from_directory
, make sure to either specify a random seed or pass shuffle=False
, so that the validation and training splits have no overlap.
Prepare the dataset for training
Next, you will standardize, tokenize, and vectorize the data using the tf.keras.layers.TextVectorization
layer.
Standardization refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.
Tokenization refers to splitting strings into tokens (for example, splitting a sentence into individual words by splitting on whitespace).
Vectorization refers to converting tokens into numbers so they can be fed into a neural network.
All of these tasks can be accomplished with this layer. (You can learn more about each of these in the tf.keras.layers.TextVectorization
API docs.)
Note that:
The default standardization converts text to lowercase and removes punctuation (
standardize='lower_and_strip_punctuation'
).The default tokenizer splits on whitespace (
split='whitespace'
).The default vectorization mode is
'int'
(output_mode='int'
). This outputs integer indices (one per token). This mode can be used to build models that take word order into account. You can also use other modes—like'binary'
—to build bag-of-words models.
You will build two models to learn more about standardization, tokenization, and vectorization with TextVectorization
:
First, you will use the
'binary'
vectorization mode to build a bag-of-words model.Then, you will use the
'int'
mode with a 1D ConvNet.
For the 'int'
mode, in addition to maximum vocabulary size, you need to set an explicit maximum sequence length (MAX_SEQUENCE_LENGTH
), which will cause the layer to pad or truncate sequences to exactly output_sequence_length
values:
Next, call TextVectorization.adapt
to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.
Note: It's important to only use your training data when calling TextVectorization.adapt
, as using the test set would leak information.
Print the result of using these layers to preprocess data:
As shown above, TextVectorization
's 'binary'
mode returns an array denoting which tokens exist at least once in the input, while the 'int'
mode replaces each token by an integer, thus preserving their order.
You can lookup the token (string) that each integer corresponds to by calling TextVectorization.get_vocabulary
on the layer:
You are nearly ready to train your model.
As a final preprocessing step, you will apply the TextVectorization
layers you created earlier to the training, validation, and test sets:
Configure the dataset for performance
These are two important methods you should use when loading data to make sure that I/O does not become blocking.
Dataset.cache
keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.Dataset.prefetch
overlaps data preprocessing and model execution while training.
You can learn more about both methods, as well as how to cache data to disk in the Prefetching section of the Better performance with the tf.data API guide.
Train the model
It's time to create your neural network.
For the 'binary'
vectorized data, define a simple bag-of-words linear model, then configure and train it:
Next, you will use the 'int'
vectorized layer to build a 1D ConvNet:
Compare the two models:
Evaluate both models on the test data:
Note: This example dataset represents a rather simple classification problem. More complex datasets and problems bring out subtle but significant differences in preprocessing strategies and model architectures. Be sure to try out different hyperparameters and epochs to compare various approaches.
Export the model
In the code above, you applied tf.keras.layers.TextVectorization
to the dataset before feeding text to the model. If you want to make your model capable of processing raw strings (for example, to simplify deploying it), you can include the TextVectorization
layer inside your model.
To do so, you can create a new model using the weights you have just trained:
Now, your model can take raw strings as input and predict a score for each label using Model.predict
. Define a function to find the label with the maximum score:
Run inference on new data
Including the text preprocessing logic inside your model enables you to export a model for production that simplifies deployment, and reduces the potential for train/test skew.
There is a performance difference to keep in mind when choosing where to apply tf.keras.layers.TextVectorization
. Using it outside of your model enables you to do asynchronous CPU processing and buffering of your data when training on GPU. So, if you're training your model on the GPU, you probably want to go with this option to get the best performance while developing your model, then switch to including the TextVectorization
layer inside your model when you're ready to prepare for deployment.
Visit the Save and load models tutorial to learn more about saving models.
Example 2: Predict the author of Iliad translations
The following provides an example of using tf.data.TextLineDataset
to load examples from text files, and TensorFlow Text to preprocess the data. You will use three different English translations of the same work, Homer's Iliad, and train a model to identify the translator given a single line of text.
Download and explore the dataset
The texts of the three translations are by:
The text files used in this tutorial have undergone some typical preprocessing tasks like removing document header and footer, line numbers and chapter titles.
Download these lightly munged files locally:
Load the dataset
Previously, with tf.keras.utils.text_dataset_from_directory
all contents of a file were treated as a single example. Here, you will use tf.data.TextLineDataset
, which is designed to create a tf.data.Dataset
from a text file where each example is a line of text from the original file. TextLineDataset
is useful for text data that is primarily line-based (for example, poetry or error logs).
Iterate through these files, loading each one into its own dataset. Each example needs to be individually labeled, so use Dataset.map
to apply a labeler function to each one. This will iterate over every example in the dataset, returning (example, label
) pairs.
Next, you'll combine these labeled datasets into a single dataset using Dataset.concatenate
, and shuffle it with Dataset.shuffle
:
Print out a few examples as before. The dataset hasn't been batched yet, hence each entry in all_labeled_data
corresponds to one data point:
Prepare the dataset for training
Instead of using tf.keras.layers.TextVectorization
to preprocess the text dataset, you will now use the TensorFlow Text APIs to standardize and tokenize the data, build a vocabulary and use tf.lookup.StaticVocabularyTable
to map tokens to integers to feed to the model. (Learn more about TensorFlow Text).
Define a function to convert the text to lower-case and tokenize it:
TensorFlow Text provides various tokenizers. In this example, you will use the
text.UnicodeScriptTokenizer
to tokenize the dataset.You will use
Dataset.map
to apply the tokenization to the dataset.
You can iterate over the dataset and print out a few tokenized examples:
Next, you will build a vocabulary by sorting tokens by frequency and keeping the top VOCAB_SIZE
tokens:
To convert the tokens into integers, use the vocab
set to create a tf.lookup.StaticVocabularyTable
. You will map tokens to integers in the range [2
, vocab_size + 2
]. As with the TextVectorization
layer, 0
is reserved to denote padding and 1
is reserved to denote an out-of-vocabulary (OOV) token.
Finally, define a function to standardize, tokenize and vectorize the dataset using the tokenizer and lookup table:
You can try this on a single example to print the output:
Now run the preprocess function on the dataset using Dataset.map
:
Split the dataset into training and test sets
The Keras TextVectorization
layer also batches and pads the vectorized data. Padding is required because the examples inside of a batch need to be the same size and shape, but the examples in these datasets are not all the same size—each line of text has a different number of words.
tf.data.Dataset
supports splitting and padded-batching datasets:
Now, validation_data
and train_data
are not collections of (example, label
) pairs, but collections of batches. Each batch is a pair of (many examples, many labels) represented as arrays.
To illustrate this:
Since you use 0
for padding and 1
for out-of-vocabulary (OOV) tokens, the vocabulary size has increased by two:
Configure the datasets for better performance as before:
Train the model
You can train a model on this dataset as before:
Export the model
To make the model capable of taking raw strings as input, you will create a Keras TextVectorization
layer that performs the same steps as your custom preprocessing function. Since you have already trained a vocabulary, you can use TextVectorization.set_vocabulary
(instead of TextVectorization.adapt
), which trains a new vocabulary.
The loss and accuracy for the model on encoded validation set and the exported model on the raw validation set are the same, as expected.
Run inference on new data
Download more datasets using TensorFlow Datasets (TFDS)
You can download many more datasets from TensorFlow Datasets.
In this example, you will use the IMDB Large Movie Review dataset to train a model for sentiment classification:
Print a few examples:
You can now preprocess the data and train a model as before.
Note: You will use tf.keras.losses.BinaryCrossentropy
instead of tf.keras.losses.SparseCategoricalCrossentropy
for your model, since this is a binary classification problem.
Prepare the dataset for training
Create, configure and train the model
Export the model
Conclusion
This tutorial demonstrated several ways to load and preprocess text. As a next step, you can explore additional text preprocessing TensorFlow Text tutorials, such as:
You can also find new datasets on TensorFlow Datasets. And, to learn more about tf.data
, check out the guide on building input pipelines.