Path: blob/master/site/en-snapshot/datasets/external_tfrecord.md
25115 views
Load external tfrecord with TFDS
If you have a tf.train.Example
proto (inside .tfrecord
, .riegeli
,...), which has been generated by third party tools, that you would like to directly load with tfds API, then this page is for you.
In order to load your .tfrecord
files, you only need to:
Follow the TFDS naming convention.
Add metadata files (
dataset_info.json
,features.json
) along your tfrecord files.
Limitations:
tf.train.SequenceExample
is not supported, onlytf.train.Example
.You need to be able to express the
tf.train.Example
in terms oftfds.features
(see section bellow).
File naming convention
TFDS supports defining a template for file names, which provides flexibility to use different file naming schemes. The template is represented by a tfds.core.ShardedFileTemplate
and supports the following variables: {DATASET}
, {SPLIT}
, {FILEFORMAT}
, {SHARD_INDEX}
, {NUM_SHARDS}
, and {SHARD_X_OF_Y}
. For example, the default file naming scheme of TFDS is: {DATASET}-{SPLIT}.{FILEFORMAT}-{SHARD_X_OF_Y}
. For MNIST, this means that file names look as follows:
mnist-test.tfrecord-00000-of-00001
mnist-train.tfrecord-00000-of-00001
Add metadata
Provide the feature structure
For TFDS to be able to decode the tf.train.Example
proto, you need to provide the tfds.features
structure matching your specs. For example:
Corresponds to the following tf.train.Example
specs:
Specifying the features allow TFDS to automatically decode images, video,... Like any other TFDS datasets, features metadata (e.g. label names,...) will be exposed to the user (e.g. info.features['label'].names
).
If you control the generation pipeline
If you generate datasets outside of TFDS but still control the generation pipeline, you can use tfds.features.FeatureConnector.serialize_example
to encode your data from dict[np.ndarray]
to tf.train.Example
proto bytes
:
This will ensure feature compatibility with TFDS.
Similarly, a feature.deserialize_example
exists to decode the proto (example)
If you don't control the generation pipeline
If you want to see how tfds.features
are represented in a tf.train.Example
, you can examine this in colab:
To translate
tfds.features
into the human readable structure of thetf.train.Example
, you can callfeatures.get_serialized_info()
.To get the exact
FixedLenFeature
,... spec passed totf.io.parse_single_example
, you can usespec = features.tf_example_spec
Note: If you're using custom feature connector, make sure to implement to_json_content
/from_json_content
and test with self.assertFeature
(see feature connector guide)
Get statistics on splits
TFDS requires to know the exact number of examples within each shard. This is required for features like len(ds)
, or the subplit API: split='train[75%:]'
.
If you have this information, you can explicitly create a list of
tfds.core.SplitInfo
and skip to the next section:If you do not know this information, you can compute it using the
compute_split_info.py
script (or in your own script withtfds.folder_dataset.compute_split_info
). It will launch a beam pipeline which will read all shards on the given directory and compute the info.
Add metadata files
To automatically add the proper metadata files along your dataset, use tfds.folder_dataset.write_metadata
:
Once the function has been called once on your dataset directory, metadata files ( dataset_info.json
,...) have been added and your datasets are ready to be loaded with TFDS (see next section).
Load dataset with TFDS
Directly from folder
Once the metadata have been generated, datasets can be loaded using tfds.builder_from_directory
which returns a tfds.core.DatasetBuilder
with the standard TFDS API (like tfds.builder
):
Directly from multiple folders
It is also possible to load data from multiple folders. This can happen, for example, in reinforcement learning when multiple agents are each generating a separate dataset and you want to load all of them together. Other use cases are when a new dataset is produced on a regular basis, e.g. a new dataset per day, and you want to load data from a date range.
To load data from multiple folders, use tfds.builder_from_directories
, which returns a tfds.core.DatasetBuilder
with the standard TFDS API (like tfds.builder
):
Note: each folder must have its own metadata, because this contains information about the splits.
Folder structure (optional)
For better compatibility with TFDS, you can organize your data as <data_dir>/<dataset_name>[/<dataset_config>]/<dataset_version>
. For example:
This will make your datasets compatible with the tfds.load
/ tfds.builder
API, simply by providing data_dir/
: