Path: blob/master/site/en-snapshot/datasets/format_specific_dataset_builders.md
25115 views
Format-specific Dataset Builders
[TOC]
This guide documents all format-specific dataset builders currently available in TFDS.
Format-specific dataset builders are subclasses of tfds.core.GeneratorBasedBuilder
which take care of most data processing for a specific data format.
Datasets based on tf.data.Dataset
If you want to create a TFDS dataset from a dataset that's in tf.data.Dataset
(reference) format, then you can use tfds.dataset_builders.TfDataBuilder
(see API docs).
We envision two typical uses of this class:
Creating experimental datasets in a notebook-like environment
Defining a dataset builder in code
Creating a new dataset from a notebook
Suppose you are working in a notebook, loaded some data as a tf.data.Dataset
, applied various transformations (map, filter, etc) and now you want to store this data and easily share it with teammates or load it in other notebooks. Instead of having to define a new dataset builder class, you can also instantiate a tfds.dataset_builders.TfDataBuilder
and call download_and_prepare
to store your dataset as a TFDS dataset.
Because it's a TFDS dataset, you can version it, use configs, have different splits, and document it for easier use later. This means that you also have to tell TFDS what the features are in your dataset.
Here's a dummy example of how you can use it.
The download_and_prepare
method will iterate over the input tf.data.Dataset
s and store the corresponding TFDS dataset in /my/folder/my_dataset/single_number/1.0.0
, which will contain both the train and test splits.
The config
argument is optional and can be useful if you want to store different configs under the same dataset.
The data_dir
argument can be used to store the generated TFDS dataset in a different folder, for example in your own sandbox if you don't want to share this with others (yet). Note that when doing this, you also need to pass the data_dir
to tfds.load
. If the data_dir
argument is not specified, then the default TFDS data dir will be used.
Loading your dataset
After the TFDS dataset has been stored, it can be loaded from other scripts or by teammates if they have access to the data:
Adding a new version or config
After iterating further on your dataset, you may have added or changed some of the transformations of the source data. To store and share this dataset, you can easily store this as a new version.
Defining a new dataset builder class
You can also define a new DatasetBuilder
based on this class.
CoNLL
The format
CoNLL is a popular format used to represent annotated text data.
CoNLL-formatted data usually contain one token with its linguistic annotations per line; within the same line, annotations are usually separated by spaces or tabs. Empty lines represent sentence boundaries.
Consider as an example the following sentence from the conll2003 dataset, which follows the CoNLL annotation format:
ConllDatasetBuilder
To add a new CoNLL-based dataset to TFDS, you can base your dataset builder class on tfds.dataset_builders.ConllDatasetBuilder
. This base class contains common code to deal with the specificities of CoNLL datasets (iterating over the column-based format, precompiled lists of features and tags, ...).
tfds.dataset_builders.ConllDatasetBuilder
implements a CoNLL-specific GeneratorBasedBuilder
. Refer to the following class as a minimal example of a CoNLL dataset builder:
As for standard dataset builders, it requires to overwrite the class methods _info
and _split_generators
. Depending on the dataset, you might need to update also conll_dataset_builder_utils.py to include the features and the list of tags specific to your dataset.
The _generate_examples
method should not require further overwriting, unless your dataset needs specific implementation.
Examples
Consider conll2003 as an example of a dataset implemented using the CoNLL-specific dataset builder.
CLI
The easiest way to write a new CoNLL-based dataset is to use the TFDS CLI:
CoNLL-U
The format
CoNLL-U is a popular format used to represent annotated text data.
CoNLL-U enhances the CoNLL format by adding a number of features, such as support for multi-token words. CoNLL-U formatted data usually contain one token with its linguistic annotations per line; within the same line, annotations are usually separated by single tab characters. Empty lines represent sentence boundaries.
Typically, each CoNLL-U annotated word line contains the following fields, as reported in the official documentation:
ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
FORM: Word form or punctuation symbol.
LEMMA: Lemma or stem of word form.
UPOS: Universal part-of-speech tag.
XPOS: Language-specific part-of-speech tag; underscore if not available.
FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
HEAD: Head of the current word, which is either a value of ID or zero (0).
DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
MISC: Any other annotation.
Consider as an example the following CoNLL-U annotated sentence from the official documentation:
ConllUDatasetBuilder
To add a new CoNLL-U based dataset to TFDS, you can base your dataset builder class on tfds.dataset_builders.ConllUDatasetBuilder
. This base class contains common code to deal with the specificities of CoNLL-U datasets (iterating over the column-based format, precompiled lists of features and tags, ...).
tfds.dataset_builders.ConllUDatasetBuilder
implements a CoNLL-U specific GeneratorBasedBuilder
. Refer to the following class as a minimal example of a CoNLL-U dataset builder:
As for standard dataset builders, it requires to overwrite the class methods _info
and _split_generators
. Depending on the dataset, you might need to update also conllu_dataset_builder_utils.py to include the features and the list of tags specific to your dataset.
The _generate_examples
method should not require further overwriting, unless your dataset needs specific implementation. Note that, if your dataset requires specific preprocessing - for example if it considers non-classic universal dependency features - you might need to update the process_example_fn
attribute of your generate_examples
function (see the xtreme_pos daset as an example).
Examples
Consider the following datasets, which use the CoNNL-U specific dataset builder, as examples:
CLI
The easiest way to write a new CoNLL-U based dataset is to use the TFDS CLI: