Path: blob/master/site/en-snapshot/datasets/determinism.ipynb
25115 views
Copyright 2020 The TensorFlow Authors.
TFDS and determinism
This document explains:
The TFDS guarantees on determinism
In which order does TFDS read examples
Various caveats and gotchas
Setup
Datasets
Some context is needed to understand how TFDS reads the data.
During generation, TFDS write the original data into standardized .tfrecord
files. For big datasets, multiple .tfrecord
files are created, each containing multiple examples. We call each .tfrecord
file a shard.
This guide uses imagenet which has 1024 shards:
Finding the dataset examples ids
You can skip to the following section if you only want to know about determinism.
Each dataset example is uniquely identified by an id
(e.g. 'imagenet2012-train.tfrecord-01023-of-01024__32'
). You can recover this id
by passing read_config.add_tfds_id = True
which will add a 'tfds_id'
key in the dict from the tf.data.Dataset
.
In this tutorial, we define a small util which will print the example ids of the dataset (converted in integer to be more human-readable):
Determinism when reading
This section explains deterministim guarantee of tfds.load
.
With shuffle_files=False
(default)
By default TFDS yield examples deterministically (shuffle_files=False
)
For performance, TFDS read multiple shards at the same time using tf.data.Dataset.interleave. We see in this example that TFDS switch to shard 2 after reading 16 examples (..., 14, 15, 1251, 1252, ...
). More on interleave bellow.
Similarly, the subsplit API is also deterministic:
If you're training for more than one epoch, the above setup is not recommended as all epochs will read the shards in the same order (so randomness is limited to the ds = ds.shuffle(buffer)
buffer size).
With shuffle_files=True
With shuffle_files=True
, shards are shuffled for each epoch, so reading is not deterministic anymore.
Note: Setting shuffle_files=True
also disable deterministic
in tf.data.Options
to give some performance boost. So even small datasets which only have a single shard (like mnist), become non-deterministic.
See recipe below to get deterministic file shuffling.
Determinism caveat: interleave args
Changing read_config.interleave_cycle_length
, read_config.interleave_block_length
will change the examples order.
TFDS relies on tf.data.Dataset.interleave to only load a few shards at once, improving the performance and reducing memory usage.
The example order is only guaranteed to be the same for a fixed value of interleave args. See interleave doc to understand what cycle_length
and block_length
correspond too.
cycle_length=16
,block_length=16
(default, same as above):
cycle_length=3
,block_length=2
:
In the second example, we see that the dataset read 2 (block_length=2
) examples in a shard, then switch to the next shard. Every 2 * 3 (cycle_length=3
) examples, it goes back to the first shard (shard0-ex0, shard0-ex1, shard1-ex0, shard1-ex1, shard2-ex0, shard2-ex1, shard0-ex2, shard0-ex3, shard1-ex2, shard1-ex3, shard2-ex2,...
).
Subsplit and example order
Each example has an id 0, 1, ..., num_examples-1
. The subsplit API select a slice of examples (e.g. train[:x]
select 0, 1, ..., x-1
).
However, within the subsplit, examples are not read in increasing id order (due to shards and interleave).
More specifically, ds.take(x)
and split='train[:x]'
are not equivalent !
This can be seen easily in the above interleave example where examples come from different shards.
After the 16 (block_length) examples, .take(25)
switches to the next shard while train[:25]
continue reading examples in from the first shard.
Recipes
Get deterministic file shuffling
There are 2 ways to have deterministic shuffling:
Setting the
shuffle_seed
. Note: This requires changing the seed at each epoch, otherwise shards will be read in the same order between epoch.
Using
experimental_interleave_sort_fn
: This gives full control over which shards are read and in which order, rather than relying onds.shuffle
order.
Get deterministic preemptable pipeline
This one is more complicated. There is no easy, satisfactory solution.
Without
ds.shuffle
and with deterministic shuffling, in theory it should be possible to count the examples which have been read and deduce which examples have been read within in each shard (as a function ofcycle_length
,block_length
and shard order). Then theskip
,take
for each shard could be injected throughexperimental_interleave_sort_fn
.With
ds.shuffle
it's likely impossible without replaying the full training pipeline. It would require saving theds.shuffle
buffer state to deduce which examples have been read. Examples could be non-continuous (e.g.shard5_ex2
,shard5_ex4
read but notshard5_ex3
).With
ds.shuffle
, one way would be to save all shards_ids/example_ids read (deduced fromtfds_id
), then deducing the file instructions from that.
The simplest case for 1.
is to have .skip(x).take(y)
match train[x:x+y]
match. It requires:
Set
cycle_length=1
(so shards are read sequentially)Set
shuffle_files=False
Do not use
ds.shuffle
It should only be used on huge dataset where the training is only 1 epoch. Examples would be read in the default shuffle order.
Find which shards/examples are read for a given subsplit
With the tfds.core.DatasetInfo
, you have direct access to the read instructions.