Path: blob/master/site/en-snapshot/datasets/tfless_tfds.ipynb
25115 views
Copyright 2023 The TensorFlow Datasets Authors.
TFDS for Jax and PyTorch
TFDS has always been framework-agnostic. For instance, you can easily load datasets in NumPy format for usage in Jax and PyTorch.
TensorFlow and its data loading solution (tf.data
) are first-class citizens in our API by design.
We extended TFDS to support TensorFlow-less NumPy-only data loading. This can be convenient for usage in ML frameworks such as Jax and PyTorch. Indeed, for the latter users, TensorFlow can:
reserve GPU/TPU memory;
increase build time in CI/CD;
take time to import at runtime.
TensorFlow is no longer a dependency to read datasets.
ML pipelines need a data loader to load examples, decode them, and present them to the model. Data loaders use the "source/sampler/loader" paradigm:
The data source is responsible for accessing and decoding examples from a TFDS dataset on the fly.
The index sampler is responsible for determining the order in which records are processed. This is important to implement global transformations (e.g., global shuffling, sharding, repeating for multiple epochs) before reading any records.
The data loader orchestrates the loading by leveraging the data source and the index sampler. It allows performance optimization (e.g., pre-fetching, multiprocessing or multithreading).
TL;DR
tfds.data_source
is an API to create data sources:
for fast prototyping in pure-Python pipelines;
to manage data-intensive ML pipelines at scale.
Setup
Let's install and import the needed dependencies:
Data sources
Data sources are basically Python sequences. So they need to implement the following protocol:
Warning: the API is still under active development. Notably, at this point, __getitem__
must support both int
and list[int]
in input. In the future, it will probably only support int
as per the standard.
The underlying file format needs to support efficient random access. At the moment, TFDS relies on array_record
.
array_record
is a new file format derived from Riegeli, achieving a new frontier of IO efficiency. In particular, ArrayRecord supports parallel read, write, and random access by record index. ArrayRecord builds on top of Riegeli and supports the same compression algorithms.
fashion_mnist
is a common dataset for computer vision. To retrieve an ArrayRecord-based data source with TFDS, simply use:
tfds.data_source
is a convenient wrapper. It is equivalent to:
This outputs a dictionary of data sources:
Once download_and_prepare
has run, and you generated the record files, we don't need TensorFlow anymore. Everything will happen in Python/NumPy!
Let's check this by uninstalling TensorFlow and re-loading the data source in another subprocess:
In future versions, we are also going to make the dataset preparation TensorFlow-free.
A data source has a length:
Accessing the first element of the dataset:
...is just as cheap as accessing any other element. This is the definition of random access:
Features now use NumPy DTypes (rather than TensorFlow DTypes). You can inspect the features with:
You'll find more information about the features in our documentation. Here we can notably retrieve the shape of the images, and the number of classes:
Use in pure Python
You can consume data sources in Python by iterating over them:
If you inspect elements, you will also notice that all features are already decoded using NumPy. Behind the scenes, we use OpenCV by default because it is fast. If you don't have OpenCV installed, we default to Pillow to provide lightweight and fast image decoding.
Note: Currently, the feature is only available for Tensor
, Image
and Scalar
features. The Audio
and Video
features will come soon. Stay tuned!
Use with PyTorch
PyTorch uses the source/sampler/loader paradigm. In Torch, "data sources" are called "datasets". torch.utils.data
contains all the details you need to know to build efficient input pipelines in Torch.
TFDS data sources can be used as regular map-style datasets.
First we install and import Torch:
We already defined data sources for training and testing (respectively, ds['train']
and ds['test']
). We can now define the sampler and the loaders:
Using PyTorch, we train and evaluate a simple logistic regression on the first examples:
Coming soon: use with JAX
We are working closely with Grain. Grain is an open-source, fast and deterministic data loader for Python. So stay tuned!
Read more
For more information, please refer to tfds.data_source
API doc.