Introduction to tensorflow datasets
TFDS is a handy way to handle large datasets as a stream of minibatches, suitable for large scale training and parallel evaluation. It can be used by tensorflow and JAX code, as we illustrate below. (See the official colab for details.)
Manipulating data without using TFDS
We first illustrate how to make streams of minibatches using vanilla numpy code. TFDS will then let us eliminate a lot of this boilerplate. As an example, let's package some small labeled datasets into two dictionaries, for train and test.
Now we make one pass (epoch) over the data, computing random minibatches of size 30. There are 100 examples total, but with a batch size of 30, we don't use all the data. We can solve such "boundary effects" later.
Using TFDS
Using pre-packaged datasets
There are many standard datasets available from https://www.tensorflow.org/datasets. We give some examples below.
Downloading and preparing dataset mnist/3.0.1 (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...
Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
<class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
['image', 'label']
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)
Streams and iterators
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-86-f77f6b952706> in <module>()
18 print(b['X'].shape)
19
---> 20 b = my_stream.next()
21 print(type(b))
22 print(b['X'].shape)
AttributeError: 'generator' object has no attribute 'next'
Worked example
For efficiently processing datastreams, see this webpage.
Data visualization
Downloading and preparing dataset iris/2.0.0 (download: 4.44 KiB, generated: Unknown size, total: 4.44 KiB) to /root/tensorflow_datasets/iris/2.0.0...
Shuffling and writing examples to /root/tensorflow_datasets/iris/2.0.0.incompleteO65KB6/iris-train.tfrecord
Dataset iris downloaded and prepared to /root/tensorflow_datasets/iris/2.0.0. Subsequent calls will reuse this data.
tfds.core.DatasetInfo(
name='iris',
version=2.0.0,
description='This is perhaps the best known database to be found in the pattern recognition
literature. Fisher's paper is a classic in the field and is referenced
frequently to this day. (See Duda & Hart, for example.) The data set contains
3 classes of 50 instances each, where each class refers to a type of iris
plant. One class is linearly separable from the other 2; the latter are NOT
linearly separable from each other.',
homepage='https://archive.ics.uci.edu/ml/datasets/iris',
features=FeaturesDict({
'features': Tensor(shape=(4,), dtype=tf.float32),
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
}),
total_num_examples=150,
splits={
'train': 150,
},
supervised_keys=('features', 'label'),
citation="""@misc{Dua:2019 ,
author = "Dua, Dheeru and Graff, Casey",
year = "2017",
title = "{UCI} Machine Learning Repository",
url = "http://archive.ics.uci.edu/ml",
institution = "University of California, Irvine, School of Information and Computer Sciences"
}""",
redistribution_info=,
)
<class 'tensorflow_datasets.core.as_dataframe.StyledDataFrame'>
features label
0 [5.1, 3.4, 1.5, 0.2] 0
1 [7.7, 3.0, 6.1, 2.3] 2
2 [5.7, 2.8, 4.5, 1.3] 1
3 [6.8, 3.2, 5.9, 2.3] 2
Downloading and preparing dataset cifar10/3.0.2 (download: 162.17 MiB, generated: 132.40 MiB, total: 294.58 MiB) to /root/tensorflow_datasets/cifar10/3.0.2...
Shuffling and writing examples to /root/tensorflow_datasets/cifar10/3.0.2.incompleteM2APQM/cifar10-train.tfrecord
Shuffling and writing examples to /root/tensorflow_datasets/cifar10/3.0.2.incompleteM2APQM/cifar10-test.tfrecord
Dataset cifar10 downloaded and prepared to /root/tensorflow_datasets/cifar10/3.0.2. Subsequent calls will reuse this data.
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-70-d648afc052fa> in <module>()
----> 1 import tensorflow_data_validation
2
3 tfds.show_statistics(info)
ModuleNotFoundError: No module named 'tensorflow_data_validation'
---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------
Graveyard
Here we store some code we don't need (for now).