Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/en-snapshot/datasets/dataset_collections.ipynb
25115 views
Kernel: Python 3
#@title Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

Dataset Collections

Overview

Dataset collections provide a simple way to group together an arbitrary number of existing TFDS datasets, and to perform simple operations over them.

They can be useful, for example, to group together different datasets related to the same task, or for easy benchmarking of models over a fixed number of different tasks.

Setup

To get started, install a few packages:

# Use tfds-nightly to ensure access to the latest features. !pip install -q tfds-nightly tensorflow !pip install -U conllu

Import TensorFlow and the Tensorflow Datasets package into your development environment:

import pprint import tensorflow as tf import tensorflow_datasets as tfds

Dataset collections provide a simple way to group together an arbitrary number of existing datasets from Tensorflow Datasets (TFDS), and to perform simple operations over them.

They can be useful, for example, to group together different datasets related to the same task, or for easy benchmarking of models over a fixed number of different tasks.

Find available dataset collections

All dataset collection builders are a subclass of tfds.core.dataset_collection_builder.DatasetCollection.

To get the list of available builders, use tfds.list_dataset_collections().

tfds.list_dataset_collections()

Load and inspect a dataset collection

The easiest way of loading a dataset collection is to instantiate a DatasetCollectionLoader object using the tfds.dataset_collection command.

collection_loader = tfds.dataset_collection('xtreme')

Specific dataset collection versions can be loaded following the same syntax as with TFDS datasets:

collection_loader = tfds.dataset_collection('xtreme:1.0.0')

A dataset collection loader can display information about the collection:

collection_loader.print_info()

The dataset loader can also display information about the datasets contained in the collection:

collection_loader.print_datasets()

Loading datasets from a dataset collection

The easiest way to load one dataset from a collection is to use a DatasetCollectionLoader object's load_dataset method, which loads the required dataset by calling tfds.load.

This call returns a dictionary of split names and the corresponding tf.data.Datasets:

splits = collection_loader.load_dataset("ner") pprint.pprint(splits)

load_dataset accepts the following optional parameters:

  • split: which split(s) to load. It accepts a single split (split="test") or a list of splits: (split=["train", "test"]). If not specified, it will load all splits for the given dataset.

  • loader_kwargs: keyword arguments to be passed to the tfds.load function. Refer to the tfds.load documentation for a comprehensive overview of the different loading options.

Loading multiple datasets from a dataset collection

The easiest way to load multiple datasets from a collection is to use the DatasetCollectionLoader object's load_datasets method, which loads the required datasets by calling tfds.load.

It returns a dictionary of dataset names, each one of which is associated with a dictionary of split names and the corresponding tf.data.Datasets, as in the following example:

datasets = collection_loader.load_datasets(['xnli', 'bucc']) pprint.pprint(datasets)

The load_all_datasets method loads all available datasets for a given collection:

all_datasets = collection_loader.load_all_datasets() pprint.pprint(all_datasets)

The load_datasets method accepts the following optional parameters:

  • split: which split(s) to load. It accepts a single split (split="test") or a list of splits: (split=["train", "test"]). If not specified, it will load all splits for the given dataset.

  • loader_kwargs: keyword arguments to be passed to the tfds.load function. Refer to the tfds.load documentation for a comprehensive overview of the different loading options.

Specifying loader_kwargs

The loader_kwargs are optional keyword arguments to be passed to the tfds.load function. They can be specified in three ways:

  1. When initializing the DatasetCollectionLoader class:

collection_loader = tfds.dataset_collection('xtreme', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))
  1. Using DatasetCollectioLoader's set_loader_kwargs method:

collection_loader.set_loader_kwargs(dict(split='train', batch_size=10, try_gcs=False))
  1. As optional parameters to the load_dataset, load_datasets and load_all_datasets methods.

dataset = collection_loader.load_dataset('ner', loader_kwargs=dict(split='train', batch_size=10, try_gcs=False))

Feedback

We are continuously trying to improve the dataset creation workflow, but can only do so if we are aware of the issues. Which issues, errors did you encountered while creating the dataset collection? Was there a part which was confusing, boilerplate or wasn't working the first time? Please share your feedback on GitHub.