Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
keras-team
GitHub Repository: keras-team/keras-io
Path: blob/master/guides/keras_tuner/distributed_tuning.py
3293 views
1
"""
2
Title: Distributed hyperparameter tuning
3
Authors: Tom O'Malley, Haifeng Jin
4
Date created: 2019/10/24
5
Last modified: 2021/06/02
6
Description: Tuning the hyperparameters of the models with multiple GPUs and multiple machines.
7
Accelerator: None
8
"""
9
10
"""shell
11
pip install keras-tuner -q
12
"""
13
14
"""
15
## Introduction
16
17
KerasTuner makes it easy to perform distributed hyperparameter search. No
18
changes to your code are needed to scale up from running single-threaded
19
locally to running on dozens or hundreds of workers in parallel. Distributed
20
KerasTuner uses a chief-worker model. The chief runs a service to which the
21
workers report results and query for the hyperparameters to try next. The chief
22
should be run on a single-threaded CPU instance (or alternatively as a separate
23
process on one of the workers).
24
25
### Configuring distributed mode
26
27
Configuring distributed mode for KerasTuner only requires setting three
28
environment variables:
29
30
**KERASTUNER_TUNER_ID**: This should be set to "chief" for the chief process.
31
Other workers should be passed a unique ID (by convention, "tuner0", "tuner1",
32
etc).
33
34
**KERASTUNER_ORACLE_IP**: The IP address or hostname that the chief service
35
should run on. All workers should be able to resolve and access this address.
36
37
**KERASTUNER_ORACLE_PORT**: The port that the chief service should run on. This
38
can be freely chosen, but must be a port that is accessible to the other
39
workers. Instances communicate via the [gRPC](https://www.grpc.io) protocol.
40
41
The same code can be run on all workers. Additional considerations for
42
distributed mode are:
43
44
- All workers should have access to a centralized file system to which they can
45
write their results.
46
- All workers should be able to access the necessary training and validation
47
data needed for tuning.
48
- To support fault-tolerance, `overwrite` should be kept as `False` in
49
`Tuner.__init__` (`False` is the default).
50
51
Example bash script for chief service (sample code for `run_tuning.py` at
52
bottom of page):
53
54
```
55
export KERASTUNER_TUNER_ID="chief"
56
export KERASTUNER_ORACLE_IP="127.0.0.1"
57
export KERASTUNER_ORACLE_PORT="8000"
58
python run_tuning.py
59
```
60
61
Example bash script for worker:
62
63
```
64
export KERASTUNER_TUNER_ID="tuner0"
65
export KERASTUNER_ORACLE_IP="127.0.0.1"
66
export KERASTUNER_ORACLE_PORT="8000"
67
python run_tuning.py
68
```
69
"""
70
71
"""
72
### Data parallelism with `tf.distribute`
73
74
KerasTuner also supports data parallelism via
75
[tf.distribute](https://www.tensorflow.org/tutorials/distribute/keras). Data
76
parallelism and distributed tuning can be combined. For example, if you have 10
77
workers with 4 GPUs on each worker, you can run 10 parallel trials with each
78
trial training on 4 GPUs by using
79
[tf.distribute.MirroredStrategy](
80
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy).
81
You can also run each trial on TPUs via
82
[tf.distribute.TPUStrategy](
83
https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/TPUStrategy).
84
Currently
85
[tf.distribute.MultiWorkerMirroredStrategy](
86
https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy)
87
is not supported, but support for this is on the roadmap.
88
89
90
### Example code
91
92
When the environment variables described above are set, the example below will
93
run distributed tuning and use data parallelism within each trial via
94
`tf.distribute`. The example loads MNIST from `tensorflow_datasets` and uses
95
[Hyperband](https://arxiv.org/abs/1603.06560) for the hyperparameter
96
search.
97
"""
98
99
100
import keras
101
import keras_tuner
102
import tensorflow as tf
103
import numpy as np
104
105
106
def build_model(hp):
107
"""Builds a convolutional model."""
108
inputs = keras.Input(shape=(28, 28, 1))
109
x = inputs
110
for i in range(hp.Int("conv_layers", 1, 3, default=3)):
111
x = keras.layers.Conv2D(
112
filters=hp.Int("filters_" + str(i), 4, 32, step=4, default=8),
113
kernel_size=hp.Int("kernel_size_" + str(i), 3, 5),
114
activation="relu",
115
padding="same",
116
)(x)
117
118
if hp.Choice("pooling" + str(i), ["max", "avg"]) == "max":
119
x = keras.layers.MaxPooling2D()(x)
120
else:
121
x = keras.layers.AveragePooling2D()(x)
122
123
x = keras.layers.BatchNormalization()(x)
124
x = keras.layers.ReLU()(x)
125
126
if hp.Choice("global_pooling", ["max", "avg"]) == "max":
127
x = keras.layers.GlobalMaxPooling2D()(x)
128
else:
129
x = keras.layers.GlobalAveragePooling2D()(x)
130
outputs = keras.layers.Dense(10, activation="softmax")(x)
131
132
model = keras.Model(inputs, outputs)
133
134
optimizer = hp.Choice("optimizer", ["adam", "sgd"])
135
model.compile(
136
optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"]
137
)
138
return model
139
140
141
tuner = keras_tuner.Hyperband(
142
hypermodel=build_model,
143
objective="val_accuracy",
144
max_epochs=2,
145
factor=3,
146
hyperband_iterations=1,
147
distribution_strategy=tf.distribute.MirroredStrategy(),
148
directory="results_dir",
149
project_name="mnist",
150
overwrite=True,
151
)
152
153
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
154
155
# Reshape the images to have the channel dimension.
156
x_train = (x_train.reshape(x_train.shape + (1,)) / 255.0)[:1000]
157
y_train = y_train.astype(np.int64)[:1000]
158
x_test = (x_test.reshape(x_test.shape + (1,)) / 255.0)[:100]
159
y_test = y_test.astype(np.int64)[:100]
160
161
tuner.search(
162
x_train,
163
y_train,
164
steps_per_epoch=600,
165
validation_data=(x_test, y_test),
166
validation_steps=100,
167
callbacks=[keras.callbacks.EarlyStopping("val_accuracy")],
168
)
169
170