GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/en-snapshot/datasets/external_tfrecord.md
²⁵¹¹⁵ views

Load external tfrecord with TFDS

If you have a tf.train.Example proto (inside .tfrecord, .riegeli,...), which has been generated by third party tools, that you would like to directly load with tfds API, then this page is for you.

In order to load your .tfrecord files, you only need to:

Follow the TFDS naming convention.
Add metadata files (dataset_info.json, features.json) along your tfrecord files.

Limitations:

tf.train.SequenceExample is not supported, only tf.train.Example.
You need to be able to express the tf.train.Example in terms of tfds.features (see section bellow).

File naming convention

TFDS supports defining a template for file names, which provides flexibility to use different file naming schemes. The template is represented by a tfds.core.ShardedFileTemplate and supports the following variables: {DATASET}, {SPLIT}, {FILEFORMAT}, {SHARD_INDEX}, {NUM_SHARDS}, and {SHARD_X_OF_Y}. For example, the default file naming scheme of TFDS is: {DATASET}-{SPLIT}.{FILEFORMAT}-{SHARD_X_OF_Y}. For MNIST, this means that file names look as follows:

mnist-test.tfrecord-00000-of-00001
mnist-train.tfrecord-00000-of-00001

Add metadata

Provide the feature structure

For TFDS to be able to decode the tf.train.Example proto, you need to provide the tfds.features structure matching your specs. For example:

features = tfds.features.FeaturesDict({
    'image':
        tfds.features.Image(
            shape=(256, 256, 3),
            doc='Picture taken by smartphone, downscaled.'),
    'label':
        tfds.features.ClassLabel(names=['dog', 'cat']),
    'objects':
        tfds.features.Sequence({
            'camera/K': tfds.features.Tensor(shape=(3,), dtype=tf.float32),
        }),
})

Corresponds to the following tf.train.Example specs:

{
    'image': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
    'label': tf.io.FixedLenFeature(shape=(), dtype=tf.int64),
    'objects/camera/K': tf.io.FixedLenSequenceFeature(shape=(3,), dtype=tf.int64),
}

Specifying the features allow TFDS to automatically decode images, video,... Like any other TFDS datasets, features metadata (e.g. label names,...) will be exposed to the user (e.g. info.features['label'].names).

If you control the generation pipeline

If you generate datasets outside of TFDS but still control the generation pipeline, you can use tfds.features.FeatureConnector.serialize_example to encode your data from dict[np.ndarray] to tf.train.Example proto bytes:

with tf.io.TFRecordWriter('path/to/file.tfrecord') as writer:
  for ex in all_exs:
    ex_bytes = features.serialize_example(data)
    writer.write(ex_bytes)

This will ensure feature compatibility with TFDS.

Similarly, a feature.deserialize_example exists to decode the proto (example)

If you don't control the generation pipeline

If you want to see how tfds.features are represented in a tf.train.Example, you can examine this in colab:

To translate tfds.features into the human readable structure of the tf.train.Example, you can call features.get_serialized_info().
To get the exact FixedLenFeature,... spec passed to tf.io.parse_single_example, you can use spec = features.tf_example_spec

Note: If you're using custom feature connector, make sure to implement to_json_content/from_json_content and test with self.assertFeature (see feature connector guide)

Get statistics on splits

TFDS requires to know the exact number of examples within each shard. This is required for features like len(ds), or the subplit API: split='train[75%:]'.

If you have this information, you can explicitly create a list of tfds.core.SplitInfo and skip to the next section:

split_infos = [
    tfds.core.SplitInfo(
        name='train',
        shard_lengths=[1024, ...],  # Num of examples in shard0, shard1,...
        num_bytes=0,  # Total size of your dataset (if unknown, set to 0)
    ),
    tfds.core.SplitInfo(name='test', ...),
]

If you do not know this information, you can compute it using the compute_split_info.py script (or in your own script with tfds.folder_dataset.compute_split_info). It will launch a beam pipeline which will read all shards on the given directory and compute the info.

Add metadata files

To automatically add the proper metadata files along your dataset, use tfds.folder_dataset.write_metadata:

tfds.folder_dataset.write_metadata(
    data_dir='/path/to/my/dataset/1.0.0/',
    features=features,
    # Pass the `out_dir` argument of compute_split_info (see section above)
    # You can also explicitly pass a list of `tfds.core.SplitInfo`.
    split_infos='/path/to/my/dataset/1.0.0/',
    # Pass a custom file name template or use None for the default TFDS
    # file name template.
    filename_template='{SPLIT}-{SHARD_X_OF_Y}.{FILEFORMAT}',

    # Optionally, additional DatasetInfo metadata can be provided
    # See:
    # https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo
    description="""Multi-line description."""
    homepage='http://my-project.org',
    supervised_keys=('image', 'label'),
    citation="""BibTex citation.""",
)

Once the function has been called once on your dataset directory, metadata files ( dataset_info.json,...) have been added and your datasets are ready to be loaded with TFDS (see next section).

Load dataset with TFDS

Directly from folder

Once the metadata have been generated, datasets can be loaded using tfds.builder_from_directory which returns a tfds.core.DatasetBuilder with the standard TFDS API (like tfds.builder):

builder = tfds.builder_from_directory('~/path/to/my_dataset/3.0.0/')

# Metadata are available as usual
builder.info.splits['train'].num_examples

# Construct the tf.data.Dataset pipeline
ds = builder.as_dataset(split='train[75%:]')
for ex in ds:
  ...

Directly from multiple folders

It is also possible to load data from multiple folders. This can happen, for example, in reinforcement learning when multiple agents are each generating a separate dataset and you want to load all of them together. Other use cases are when a new dataset is produced on a regular basis, e.g. a new dataset per day, and you want to load data from a date range.

To load data from multiple folders, use tfds.builder_from_directories, which returns a tfds.core.DatasetBuilder with the standard TFDS API (like tfds.builder):

builder = tfds.builder_from_directories(builder_dirs=[
    '~/path/my_dataset/agent1/1.0.0/',
    '~/path/my_dataset/agent2/1.0.0/',
    '~/path/my_dataset/agent3/1.0.0/',
])

# Metadata are available as usual
builder.info.splits['train'].num_examples

# Construct the tf.data.Dataset pipeline
ds = builder.as_dataset(split='train[75%:]')
for ex in ds:
  ...

Note: each folder must have its own metadata, because this contains information about the splits.

Folder structure (optional)

For better compatibility with TFDS, you can organize your data as <data_dir>/<dataset_name>[/<dataset_config>]/<dataset_version>. For example:

data_dir/
    dataset0/
        1.0.0/
        1.0.1/
    dataset1/
        config0/
            2.0.0/
        config1/
            2.0.0/

This will make your datasets compatible with the tfds.load / tfds.builder API, simply by providing data_dir/:

ds0 = tfds.load('dataset0', data_dir='data_dir/')
ds1 = tfds.load('dataset1/config0', data_dir='data_dir/')