Path: blob/master/site/en-snapshot/datasets/common_gotchas.md
25115 views
Common implementation gotchas
This page describe the common implementation gotcha when implementing a new dataset.
Legacy SplitGenerator
should be avoided
The old tfds.core.SplitGenerator
API is deprecated.
Should be replaced by:
Rationale: The new API is less verbose and more explicit. The old API will be removed in future version.
New datasets should be self-contained in a folder
When adding a dataset inside the tensorflow_datasets/
repository, please make sure to follow the dataset-as-folder structure (all checksums, dummy data, implementation code self-contained in a folder).
Old datasets (bad):
<category>/<ds_name>.py
New datasets (good):
<category>/<ds_name>/<ds_name>.py
Use the TFDS CLI (tfds new
, or gtfds new
for googlers) to generate the template.
Rationale: Old structure required absolute paths for checksums, fake data and was distributing the dataset files in many places. It was making it harder to implement datasets outside the TFDS repository. For consistency, the new structure should be used everywhere now.
Description lists should be formatted as markdown
The DatasetInfo.description
str
is formatted as markdown. Markdown lists require an empty line before the first item:
Rationale: Badly formatted description create visual artifacts in our catalog documentation. Without the empty lines, the above text would be rendered as:
Some text. 1. Item 1 2. Item 1 3. Item 1 Some other text
Forgot ClassLabel names
When using tfds.features.ClassLabel
, try to provide the human-readable labels str
with names=
or names_file=
(instead of num_classes=10
).
Rationale: Human readable labels are used in many places:
Allow to yield
str
directly in_generate_examples
:yield {'label': 'dog'}
Exposed in the users like
info.features['label'].names
(conversion method.str2int('dog')
,... also available)Used in the visualization utils
tfds.show_examples
,tfds.as_dataframe
Forgot image shape
When using tfds.features.Image
, tfds.features.Video
, if the images have static shape, they should be explicitly specified:
Rationale: It allow static shape inference (e.g. ds.element_spec['image'].shape
), which is required for batching (batching images of unknown shape would require resizing them first).
Prefer more specific type instead of tfds.features.Tensor
When possible, prefer the more specific types tfds.features.ClassLabel
, tfds.features.BBoxFeatures
,... instead of the generic tfds.features.Tensor
.
Rationale: In addition of being more semantically correct, specific features provides additional metadata to users and are detected by tools.
Lazy imports in global space
Lazy imports should not be called from the global space. For example the following is wrong:
Rationale: Using lazy imports in the global scope would import the module for all tfds users, defeating the purpose of lazy imports.
Dynamically computing train/test splits
If the dataset does not provide official splits, neither should TFDS. The following should be avoided:
Rationale: TFDS try to provide datasets as close as the original data. The sub-split API should be used instead to let users dynamically create the subsplits they want:
Python style guide
Prefer to use pathlib API
Instead of the tf.io.gfile
API, it is preferable to use the pathlib API. All dl_manager
methods returns pathlib-like objects compatible with GCS, S3,...
Rationale: pathlib API is a modern object oriented file API which remove boilerplate. Using .read_text()
/ .read_bytes()
also guarantee the files are correctly closed.
If the method is not using self
, it should be a function
If a class method is not using self
, it should be a simple function (defined outside the class).
Rationale: It makes it explicit to the reader that the function do not have side effects, nor hidden input/output:
Lazy imports in Python
We lazily import big modules like TensorFlow. Lazy imports defer the actual import of the module to the first usage of the module. So users who don't need this big module will never import it.
Under the hood, the LazyModule
class acts as a factory, that will only actually import the module when an attribute is accessed (__getattr__
).
You can also use it conveniently with a context manager: