Path: blob/master/site/en-snapshot/datasets/dataset_collections.ipynb
25115 views
Copyright 2022 The TensorFlow Authors.
Dataset Collections
Overview
Dataset collections provide a simple way to group together an arbitrary number of existing TFDS datasets, and to perform simple operations over them.
They can be useful, for example, to group together different datasets related to the same task, or for easy benchmarking of models over a fixed number of different tasks.
Setup
To get started, install a few packages:
Import TensorFlow and the Tensorflow Datasets package into your development environment:
Dataset collections provide a simple way to group together an arbitrary number of existing datasets from Tensorflow Datasets (TFDS), and to perform simple operations over them.
They can be useful, for example, to group together different datasets related to the same task, or for easy benchmarking of models over a fixed number of different tasks.
Find available dataset collections
All dataset collection builders are a subclass of tfds.core.dataset_collection_builder.DatasetCollection
.
To get the list of available builders, use tfds.list_dataset_collections()
.
Load and inspect a dataset collection
The easiest way of loading a dataset collection is to instantiate a DatasetCollectionLoader
object using the tfds.dataset_collection
command.
Specific dataset collection versions can be loaded following the same syntax as with TFDS datasets:
A dataset collection loader can display information about the collection:
The dataset loader can also display information about the datasets contained in the collection:
Loading datasets from a dataset collection
The easiest way to load one dataset from a collection is to use a DatasetCollectionLoader
object's load_dataset
method, which loads the required dataset by calling tfds.load
.
This call returns a dictionary of split names and the corresponding tf.data.Dataset
s:
load_dataset
accepts the following optional parameters:
split
: which split(s) to load. It accepts a single split (split="test"
) or a list of splits: (split=["train", "test"]
). If not specified, it will load all splits for the given dataset.loader_kwargs
: keyword arguments to be passed to thetfds.load
function. Refer to thetfds.load
documentation for a comprehensive overview of the different loading options.
Loading multiple datasets from a dataset collection
The easiest way to load multiple datasets from a collection is to use the DatasetCollectionLoader
object's load_datasets
method, which loads the required datasets by calling tfds.load
.
It returns a dictionary of dataset names, each one of which is associated with a dictionary of split names and the corresponding tf.data.Dataset
s, as in the following example:
The load_all_datasets
method loads all available datasets for a given collection:
The load_datasets
method accepts the following optional parameters:
split
: which split(s) to load. It accepts a single split(split="test")
or a list of splits:(split=["train", "test"])
. If not specified, it will load all splits for the given dataset.loader_kwargs
: keyword arguments to be passed to thetfds.load
function. Refer to thetfds.load
documentation for a comprehensive overview of the different loading options.
Specifying loader_kwargs
The loader_kwargs
are optional keyword arguments to be passed to the tfds.load
function. They can be specified in three ways:
When initializing the
DatasetCollectionLoader
class:
Using
DatasetCollectioLoader
'sset_loader_kwargs
method:
As optional parameters to the
load_dataset
,load_datasets
andload_all_datasets
methods.
Feedback
We are continuously trying to improve the dataset creation workflow, but can only do so if we are aware of the issues. Which issues, errors did you encountered while creating the dataset collection? Was there a part which was confusing, boilerplate or wasn't working the first time? Please share your feedback on GitHub.