Path: blob/master/guides/keras_tuner/distributed_tuning.py
3293 views
"""1Title: Distributed hyperparameter tuning2Authors: Tom O'Malley, Haifeng Jin3Date created: 2019/10/244Last modified: 2021/06/025Description: Tuning the hyperparameters of the models with multiple GPUs and multiple machines.6Accelerator: None7"""89"""shell10pip install keras-tuner -q11"""1213"""14## Introduction1516KerasTuner makes it easy to perform distributed hyperparameter search. No17changes to your code are needed to scale up from running single-threaded18locally to running on dozens or hundreds of workers in parallel. Distributed19KerasTuner uses a chief-worker model. The chief runs a service to which the20workers report results and query for the hyperparameters to try next. The chief21should be run on a single-threaded CPU instance (or alternatively as a separate22process on one of the workers).2324### Configuring distributed mode2526Configuring distributed mode for KerasTuner only requires setting three27environment variables:2829**KERASTUNER_TUNER_ID**: This should be set to "chief" for the chief process.30Other workers should be passed a unique ID (by convention, "tuner0", "tuner1",31etc).3233**KERASTUNER_ORACLE_IP**: The IP address or hostname that the chief service34should run on. All workers should be able to resolve and access this address.3536**KERASTUNER_ORACLE_PORT**: The port that the chief service should run on. This37can be freely chosen, but must be a port that is accessible to the other38workers. Instances communicate via the [gRPC](https://www.grpc.io) protocol.3940The same code can be run on all workers. Additional considerations for41distributed mode are:4243- All workers should have access to a centralized file system to which they can44write their results.45- All workers should be able to access the necessary training and validation46data needed for tuning.47- To support fault-tolerance, `overwrite` should be kept as `False` in48`Tuner.__init__` (`False` is the default).4950Example bash script for chief service (sample code for `run_tuning.py` at51bottom of page):5253```54export KERASTUNER_TUNER_ID="chief"55export KERASTUNER_ORACLE_IP="127.0.0.1"56export KERASTUNER_ORACLE_PORT="8000"57python run_tuning.py58```5960Example bash script for worker:6162```63export KERASTUNER_TUNER_ID="tuner0"64export KERASTUNER_ORACLE_IP="127.0.0.1"65export KERASTUNER_ORACLE_PORT="8000"66python run_tuning.py67```68"""6970"""71### Data parallelism with `tf.distribute`7273KerasTuner also supports data parallelism via74[tf.distribute](https://www.tensorflow.org/tutorials/distribute/keras). Data75parallelism and distributed tuning can be combined. For example, if you have 1076workers with 4 GPUs on each worker, you can run 10 parallel trials with each77trial training on 4 GPUs by using78[tf.distribute.MirroredStrategy](79https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy).80You can also run each trial on TPUs via81[tf.distribute.TPUStrategy](82https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/TPUStrategy).83Currently84[tf.distribute.MultiWorkerMirroredStrategy](85https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy)86is not supported, but support for this is on the roadmap.878889### Example code9091When the environment variables described above are set, the example below will92run distributed tuning and use data parallelism within each trial via93`tf.distribute`. The example loads MNIST from `tensorflow_datasets` and uses94[Hyperband](https://arxiv.org/abs/1603.06560) for the hyperparameter95search.96"""979899import keras100import keras_tuner101import tensorflow as tf102import numpy as np103104105def build_model(hp):106"""Builds a convolutional model."""107inputs = keras.Input(shape=(28, 28, 1))108x = inputs109for i in range(hp.Int("conv_layers", 1, 3, default=3)):110x = keras.layers.Conv2D(111filters=hp.Int("filters_" + str(i), 4, 32, step=4, default=8),112kernel_size=hp.Int("kernel_size_" + str(i), 3, 5),113activation="relu",114padding="same",115)(x)116117if hp.Choice("pooling" + str(i), ["max", "avg"]) == "max":118x = keras.layers.MaxPooling2D()(x)119else:120x = keras.layers.AveragePooling2D()(x)121122x = keras.layers.BatchNormalization()(x)123x = keras.layers.ReLU()(x)124125if hp.Choice("global_pooling", ["max", "avg"]) == "max":126x = keras.layers.GlobalMaxPooling2D()(x)127else:128x = keras.layers.GlobalAveragePooling2D()(x)129outputs = keras.layers.Dense(10, activation="softmax")(x)130131model = keras.Model(inputs, outputs)132133optimizer = hp.Choice("optimizer", ["adam", "sgd"])134model.compile(135optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"]136)137return model138139140tuner = keras_tuner.Hyperband(141hypermodel=build_model,142objective="val_accuracy",143max_epochs=2,144factor=3,145hyperband_iterations=1,146distribution_strategy=tf.distribute.MirroredStrategy(),147directory="results_dir",148project_name="mnist",149overwrite=True,150)151152(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()153154# Reshape the images to have the channel dimension.155x_train = (x_train.reshape(x_train.shape + (1,)) / 255.0)[:1000]156y_train = y_train.astype(np.int64)[:1000]157x_test = (x_test.reshape(x_test.shape + (1,)) / 255.0)[:100]158y_test = y_test.astype(np.int64)[:100]159160tuner.search(161x_train,162y_train,163steps_per_epoch=600,164validation_data=(x_test, y_test),165validation_steps=100,166callbacks=[keras.callbacks.EarlyStopping("val_accuracy")],167)168169170