Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/en-snapshot/datasets/overview.ipynb
25115 views
Kernel: Python 3

TensorFlow Datasets

TFDS provides a collection of ready-to-use datasets for use with TensorFlow, Jax, and other Machine Learning frameworks.

It handles downloading and preparing the data deterministically and constructing a tf.data.Dataset (or np.array).

Note: Do not confuse TFDS (this library) with tf.data (TensorFlow API to build efficient data pipelines). TFDS is a high level wrapper around tf.data. If you're not familiar with this API, we encourage you to read the official tf.data guide first.

Copyright 2018 The TensorFlow Datasets Authors, Licensed under the Apache License, Version 2.0

Installation

TFDS exists in two packages:

  • pip install tensorflow-datasets: The stable version, released every few months.

  • pip install tfds-nightly: Released every day, contains the last versions of the datasets.

This colab uses tfds-nightly:

!pip install -q tfds-nightly tensorflow matplotlib
import matplotlib.pyplot as plt import numpy as np import tensorflow as tf import tensorflow_datasets as tfds

Find available datasets

All dataset builders are subclass of tfds.core.DatasetBuilder. To get the list of available builders, use tfds.list_builders() or look at our catalog.

tfds.list_builders()

Load a dataset

tfds.load

The easiest way of loading a dataset is tfds.load. It will:

  1. Download the data and save it as tfrecord files.

  2. Load the tfrecord and create the tf.data.Dataset.

ds = tfds.load('mnist', split='train', shuffle_files=True) assert isinstance(ds, tf.data.Dataset) print(ds)

Some common arguments:

  • split=: Which split to read (e.g. 'train', ['train', 'test'], 'train[80%:]',...). See our split API guide.

  • shuffle_files=: Control whether to shuffle the files between each epoch (TFDS store big datasets in multiple smaller files).

  • data_dir=: Location where the dataset is saved ( defaults to ~/tensorflow_datasets/)

  • with_info=True: Returns the tfds.core.DatasetInfo containing dataset metadata

  • download=False: Disable download

tfds.builder

tfds.load is a thin wrapper around tfds.core.DatasetBuilder. You can get the same output using the tfds.core.DatasetBuilder API:

builder = tfds.builder('mnist') # 1. Create the tfrecord files (no-op if already exists) builder.download_and_prepare() # 2. Load the `tf.data.Dataset` ds = builder.as_dataset(split='train', shuffle_files=True) print(ds)

tfds build CLI

If you want to generate a specific dataset, you can use the tfds command line. For example:

tfds build mnist

See the doc for available flags.

Iterate over a dataset

As dict

By default, the tf.data.Dataset object contains a dict of tf.Tensors:

ds = tfds.load('mnist', split='train') ds = ds.take(1) # Only take a single example for example in ds: # example is `{'image': tf.Tensor, 'label': tf.Tensor}` print(list(example.keys())) image = example["image"] label = example["label"] print(image.shape, label)

To find out the dict key names and structure, look at the dataset documentation in our catalog. For example: mnist documentation.

As tuple (as_supervised=True)

By using as_supervised=True, you can get a tuple (features, label) instead for supervised datasets.

ds = tfds.load('mnist', split='train', as_supervised=True) ds = ds.take(1) for image, label in ds: # example is (image, label) print(image.shape, label)

As numpy (tfds.as_numpy)

Uses tfds.as_numpy to convert:

  • tf.Tensor -> np.array

  • tf.data.Dataset -> Iterator[Tree[np.array]] (Tree can be arbitrary nested Dict, Tuple)

ds = tfds.load('mnist', split='train', as_supervised=True) ds = ds.take(1) for image, label in tfds.as_numpy(ds): print(type(image), type(label), label)

As batched tf.Tensor (batch_size=-1)

By using batch_size=-1, you can load the full dataset in a single batch.

This can be combined with as_supervised=True and tfds.as_numpy to get the the data as (np.array, np.array):

image, label = tfds.as_numpy(tfds.load( 'mnist', split='test', batch_size=-1, as_supervised=True, )) print(type(image), image.shape)

Be careful that your dataset can fit in memory, and that all examples have the same shape.

Benchmark your datasets

Benchmarking a dataset is a simple tfds.benchmark call on any iterable (e.g. tf.data.Dataset, tfds.as_numpy,...).

ds = tfds.load('mnist', split='train') ds = ds.batch(32).prefetch(1) tfds.benchmark(ds, batch_size=32) tfds.benchmark(ds, batch_size=32) # Second epoch much faster due to auto-caching
  • Do not forget to normalize the results per batch size with the batch_size= kwarg.

  • In the summary, the first warmup batch is separated from the other ones to capture tf.data.Dataset extra setup time (e.g. buffers initialization,...).

  • Notice how the second iteration is much faster due to TFDS auto-caching.

  • tfds.benchmark returns a tfds.core.BenchmarkResult which can be inspected for further analysis.

Build end-to-end pipeline

To go further, you can look:

  • Our end-to-end Keras example to see a full training pipeline (with batching, shuffling,...).

  • Our performance guide to improve the speed of your pipelines (tip: use tfds.benchmark(ds) to benchmark your datasets).

Visualization

tfds.as_dataframe

tf.data.Dataset objects can be converted to pandas.DataFrame with tfds.as_dataframe to be visualized on Colab.

  • Add the tfds.core.DatasetInfo as second argument of tfds.as_dataframe to visualize images, audio, texts, videos,...

  • Use ds.take(x) to only display the first x examples. pandas.DataFrame will load the full dataset in-memory, and can be very expensive to display.

ds, info = tfds.load('mnist', split='train', with_info=True) tfds.as_dataframe(ds.take(4), info)

tfds.show_examples

tfds.show_examples returns a matplotlib.figure.Figure (only image datasets supported now):

ds, info = tfds.load('mnist', split='train', with_info=True) fig = tfds.show_examples(ds, info)

Access the dataset metadata

All builders include a tfds.core.DatasetInfo object containing the dataset metadata.

It can be accessed through:

  • The tfds.load API:

ds, info = tfds.load('mnist', with_info=True)
  • The tfds.core.DatasetBuilder API:

builder = tfds.builder('mnist') info = builder.info

The dataset info contains additional informations about the dataset (version, citation, homepage, description,...).

print(info)

Features metadata (label names, image shape,...)

Access the tfds.features.FeatureDict:

info.features

Number of classes, label names:

print(info.features["label"].num_classes) print(info.features["label"].names) print(info.features["label"].int2str(7)) # Human readable version (8 -> 'cat') print(info.features["label"].str2int('7'))

Shapes, dtypes:

print(info.features.shape) print(info.features.dtype) print(info.features['image'].shape) print(info.features['image'].dtype)

Split metadata (e.g. split names, number of examples,...)

Access the tfds.core.SplitDict:

print(info.splits)

Available splits:

print(list(info.splits.keys()))

Get info on individual split:

print(info.splits['train'].num_examples) print(info.splits['train'].filenames) print(info.splits['train'].num_shards)

It also works with the subsplit API:

print(info.splits['train[15%:75%]'].num_examples) print(info.splits['train[15%:75%]'].file_instructions)

Troubleshooting

Manual download (if download fails)

If download fails for some reason (e.g. offline,...). You can always manually download the data yourself and place it in the manual_dir (defaults to ~/tensorflow_datasets/downloads/manual/.

To find out which urls to download, look into:

Fixing NonMatchingChecksumError

TFDS ensure determinism by validating the checksums of downloaded urls. If NonMatchingChecksumError is raised, might indicate:

  • The website may be down (e.g. 503 status code). Please check the url.

  • For Google Drive URLs, try again later as Drive sometimes rejects downloads when too many people access the same URL. See bug

  • The original datasets files may have been updated. In this case the TFDS dataset builder should be updated. Please open a new Github issue or PR:

    • Register the new checksums with tfds build --register_checksums

    • Eventually update the dataset generation code.

    • Update the dataset VERSION

    • Update the dataset RELEASE_NOTES: What caused the checksums to change ? Did some examples changed ?

    • Make sure the dataset can still be built.

    • Send us a PR

Note: You can also inspect the downloaded file in ~/tensorflow_datasets/download/.

Citation

If you're using tensorflow-datasets for a paper, please include the following citation, in addition to any citation specific to the used datasets (which can be found in the dataset catalog).

@misc{TFDS, title = { {TensorFlow Datasets}, A collection of ready-to-use datasets}, howpublished = {\url{https://www.tensorflow.org/datasets}}, }