Path: blob/master/guides/ipynb/keras_tuner/distributed_tuning.ipynb
3283 views
Distributed hyperparameter tuning
Authors: Tom O'Malley, Haifeng Jin
Date created: 2019/10/24
Last modified: 2021/06/02
Description: Tuning the hyperparameters of the models with multiple GPUs and multiple machines.
Introduction
KerasTuner makes it easy to perform distributed hyperparameter search. No changes to your code are needed to scale up from running single-threaded locally to running on dozens or hundreds of workers in parallel. Distributed KerasTuner uses a chief-worker model. The chief runs a service to which the workers report results and query for the hyperparameters to try next. The chief should be run on a single-threaded CPU instance (or alternatively as a separate process on one of the workers).
Configuring distributed mode
Configuring distributed mode for KerasTuner only requires setting three environment variables:
KERASTUNER_TUNER_ID: This should be set to "chief" for the chief process. Other workers should be passed a unique ID (by convention, "tuner0", "tuner1", etc).
KERASTUNER_ORACLE_IP: The IP address or hostname that the chief service should run on. All workers should be able to resolve and access this address.
KERASTUNER_ORACLE_PORT: The port that the chief service should run on. This can be freely chosen, but must be a port that is accessible to the other workers. Instances communicate via the gRPC protocol.
The same code can be run on all workers. Additional considerations for distributed mode are:
All workers should have access to a centralized file system to which they can write their results.
All workers should be able to access the necessary training and validation data needed for tuning.
To support fault-tolerance,
overwrite
should be kept asFalse
inTuner.__init__
(False
is the default).
Example bash script for chief service (sample code for run_tuning.py
at bottom of page):
Example bash script for worker:
Data parallelism with tf.distribute
KerasTuner also supports data parallelism via tf.distribute. Data parallelism and distributed tuning can be combined. For example, if you have 10 workers with 4 GPUs on each worker, you can run 10 parallel trials with each trial training on 4 GPUs by using tf.distribute.MirroredStrategy. You can also run each trial on TPUs via tf.distribute.TPUStrategy. Currently tf.distribute.MultiWorkerMirroredStrategy is not supported, but support for this is on the roadmap.
Example code
When the enviroment variables described above are set, the example below will run distributed tuning and use data parallelism within each trial via tf.distribute
. The example loads MNIST from tensorflow_datasets
and uses Hyperband for the hyperparameter search.