Manipulating datasets
In this colab, we briefly discuss ways to access and manipulate common datasets that are used in the ML literature. Most of these are used for supervised learning experiments.
Tabular datasets
The UCI ML repository contains many smallish datasets, mostly tabular.
Kaggle also hosts many interesting datasets.
Sklearn has many small datasets builtin, making them easy to use for prototyping, as we illustrate below.
Tensorflow datasets
TFDS is a handy way to handle large datasets as a stream of minibatches, suitable for large scale training and parallel evaluation. It can be used by tensorflow and JAX code, as we illustrate below. (See the official colab for details.)
Minibatching without using TFDS
We first illustrate how to make streams of minibatches using vanilla numpy code. TFDS will then let us eliminate a lot of this boilerplate. As an example, let's package some small labeled datasets into two dictionaries, for train and test.
Now we make one pass (epoch) over the data, computing random minibatches of size 30. There are 100 examples total, but with a batch size of 30, we don't use all the data. We can solve such "boundary effects" later.
Minibatching with TFDS
Below we show how to convert a numpy array into a TFDS. We shuffle the records and convert to minibatches, and then repeat these batches indefinitely to create an infinite stream, which we can convert to a python iterator. We pass this iterator of batches to our training loop.
Preprocessing the data
We can process the data before creating minibatches. We can also use pre-fetching to speed things up (see this TF tutorial for details.) We illustrate this below for MNIST.
Vision datasets
MNIST
There are many standard versions of MNIST, some of which are available from https://www.tensorflow.org/datasets. We give some examples below.
Downloading and preparing dataset mnist/3.0.1 (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...
Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
<class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>
<class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>
['image', 'label']
(28, 28, 1) tf.Tensor(4, shape=(), dtype=int64)
CIFAR
The CIFAR dataset is commonly used for prototyping. The CIFAR-10 version consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. The dataset is divided into five training batches and one test batch, each with 10000 images. There is also a 100 class version.
An easy way to get this data is to use TFDS, as we show below.
Downloading and preparing dataset cifar10/3.0.2 (download: 162.17 MiB, generated: 132.40 MiB, total: 294.58 MiB) to /root/tensorflow_datasets/cifar10/3.0.2...
Shuffling and writing examples to /root/tensorflow_datasets/cifar10/3.0.2.incompleteM2APQM/cifar10-train.tfrecord
Shuffling and writing examples to /root/tensorflow_datasets/cifar10/3.0.2.incompleteM2APQM/cifar10-test.tfrecord
Dataset cifar10 downloaded and prepared to /root/tensorflow_datasets/cifar10/3.0.2. Subsequent calls will reuse this data.
Imagenet
A lot of vision experiments use the Imagenet dataset, with 1000 classes and ~1M images. However, this takes a long time to download and process. The FastAI team made a smaller version called ImageNette, that only has 10 classes of size 160 or 320 pixels (largest dimension). This is good for prototyping, and the images tend to be easier to interpret that CIFAR. A version of the raw data, in a more convenient format (all images 224x224, no dependence on FastAI library) can be found here. It is also bundled into TFDS, as we show below.
tfds.core.DatasetInfo(
name='imagenette',
version=0.1.0,
description='Imagenette is a subset of 10 easily classified classes from the Imagenet
dataset. It was originally prepared by Jeremy Howard of FastAI. The objective
behind putting together a small version of the Imagenet dataset was mainly
because running new ideas/algorithms/experiments on the whole Imagenet take a
lot of time.
This version of the dataset allows researchers/practitioners to quickly try out
ideas and share with others. The dataset comes in three variants:
* Full size
* 320 px
* 160 px
Note: The v2 config correspond to the new 70/30 train/valid split (released
in Dec 6 2019).',
homepage='https://github.com/fastai/imagenette',
features=FeaturesDict({
'image': Image(shape=(None, None, 3), dtype=tf.uint8),
'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
}),
total_num_examples=13394,
splits={
'train': 12894,
'validation': 500,
},
supervised_keys=('image', 'label'),
citation="""@misc{imagenette,
author = "Jeremy Howard",
title = "imagenette",
url = "https://github.com/fastai/imagenette/"
}""",
redistribution_info=,
)
Downloading and preparing dataset imagenette/full-size/0.1.0 (download: 1.45 GiB, generated: Unknown size, total: 1.45 GiB) to /root/tensorflow_datasets/imagenette/full-size/0.1.0...
Shuffling and writing examples to /root/tensorflow_datasets/imagenette/full-size/0.1.0.incompleteAGT375/imagenette-train.tfrecord
Dataset imagenette downloaded and prepared to /root/tensorflow_datasets/imagenette/full-size/0.1.0. Subsequent calls will reuse this data.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-7cd3f9c957a2> in <module>()
6
7 imagenette_builder.download_and_prepare()
----> 8 datasets = imagenette.as_dataset(as_supervised=True)
NameError: name 'imagenette' is not defined
Language datasets
Various datasets are used in the natural language processing (NLP) communities.
TODO: fill in.
Graveyard
Here we store some scratch code that you can ignore,