Path: blob/master/site/en-snapshot/datasets/add_dataset_collection.md
25115 views
Add a new dataset collection
Follow this guide to create a new dataset collection (either in TFDS or in your own repository).
Overview
To add a new dataset collection my_collection
to TFDS, users need to generate a my_collection
folder containing the following files:
As a convention, new dataset collections should be added to the tensorflow_datasets/dataset_collections/
folder in the TFDS repository.
Write your dataset collection
All dataset collections are implemented subclasses of tfds.core.dataset_collection_builder.DatasetCollection
.
Here is a minimal example of a dataset collection builder, defined in the file my_collection.py
:
The next sections describe the 2 abstract methods to overwrite.
info
: dataset collection metadata
The info
method returns the dataset_collection_builder.DatasetCollectionInfo
containing the collection's metadata.
The dataset collection info contains four fields:
name: the name of the dataset collection.
description: a markdown-formatted description of the dataset collection. There are two ways to define a dataset collection's description: (1) As a (multi-line) string directly in the collection's
my_collection.py
file - similarly as it is already done for TFDS datasets; (2) In adescription.md
file, which must be placed in the dataset collection folder.release_notes: a mapping from the dataset collection's version to the corresponding release notes.
citation: An optional (list of) BibTeX citation(s) for the dataset collection. There are two ways to define a dataset collection's citation: (1) As a (multi-line) string directly in the collection's
my_collection.py
file - similarly as it is already done for TFDS datasets; (2) In acitations.bib
file, which must be placed in the dataset collection folder.
datasets
: define the datasets in the collection
The datasets
method returns the TFDS datasets in the collection.
It is defined as a dictionary of versions, which describe the evolution of the dataset collection.
For each version, the included TFDS datasets are stored as a dictionary from dataset names to naming.DatasetReference
. For example:
The naming.references_for
method provides a more compact way to express the same as above:
Unit-test your dataset collection
DatasetCollectionTestBase is a base test class for dataset collections. It provides a number of simple checks to guarantee that the dataset collection is correctly registered, and its datasets exist in TFDS.
The only class attribute to set is DATASET_COLLECTION_CLASS
, which specifies the class object of dataset collection to test.
Additionally, users can set the following class attributes:
VERSION
: The version of the dataset collection used to run the test (defaults to the latest version).DATASETS_TO_TEST
: List containing the datasets to test existence for in TFDS (defaults to all datasets in the collection).CHECK_DATASETS_VERSION
: Whether to check for the existence of the versioned datasets in the dataset collection, or for their default versions (defaults to true).
The simplest valid test for a dataset collection would be:
Run the following command to test the dataset collection.
Feedback
We are continuously trying to improve the dataset creation workflow, but can only do so if we are aware of the issues. Which issues or errors did you encounter while creating the dataset collection? Was there a part which was confusing, or wasn't working the first time?
Please share your feedback on GitHub.