CoCalc -- distributed

GitHub Repository: keras-team/keras-io
Path: blob/master/guides/keras_tuner/distributed_tuning.py
³²⁹³ views
1
"""
2
Title: Distributed hyperparameter tuning
3
Authors: Tom O'Malley, Haifeng Jin
4
Date created: 2019/10/24
5
Last modified: 2021/06/02
6
Description: Tuning the hyperparameters of the models with multiple GPUs and multiple machines.
7
Accelerator: None
8
"""
9

10
"""shell
11
pip install keras-tuner -q
12
"""
13

14
"""
15
## Introduction
16

17
KerasTuner makes it easy to perform distributed hyperparameter search. No
18
changes to your code are needed to scale up from running single-threaded
19
locally to running on dozens or hundreds of workers in parallel. Distributed
20
KerasTuner uses a chief-worker model. The chief runs a service to which the
21
workers report results and query for the hyperparameters to try next. The chief
22
should be run on a single-threaded CPU instance (or alternatively as a separate
23
process on one of the workers).
24

25
### Configuring distributed mode
26

27
Configuring distributed mode for KerasTuner only requires setting three
28
environment variables:
29

30
**KERASTUNER_TUNER_ID**: This should be set to "chief" for the chief process.
31
Other workers should be passed a unique ID (by convention, "tuner0", "tuner1",
32
etc).
33

34
**KERASTUNER_ORACLE_IP**: The IP address or hostname that the chief service
35
should run on. All workers should be able to resolve and access this address.
36

37
**KERASTUNER_ORACLE_PORT**: The port that the chief service should run on. This
38
can be freely chosen, but must be a port that is accessible to the other
39
workers. Instances communicate via the [gRPC](https://www.grpc.io) protocol.
40

41
The same code can be run on all workers. Additional considerations for
42
distributed mode are:
43

44
- All workers should have access to a centralized file system to which they can
45
write their results.
46
- All workers should be able to access the necessary training and validation
47
data needed for tuning.
48
- To support fault-tolerance, `overwrite` should be kept as `False` in
49
`Tuner.__init__` (`False` is the default).
50

51
Example bash script for chief service (sample code for `run_tuning.py` at
52
bottom of page):
53

54
```
55
export KERASTUNER_TUNER_ID="chief"
56
export KERASTUNER_ORACLE_IP="127.0.0.1"
57
export KERASTUNER_ORACLE_PORT="8000"
58
python run_tuning.py
59
```
60

61
Example bash script for worker:
62

63
```
64
export KERASTUNER_TUNER_ID="tuner0"
65
export KERASTUNER_ORACLE_IP="127.0.0.1"
66
export KERASTUNER_ORACLE_PORT="8000"
67
python run_tuning.py
68
```
69
"""
70

71
"""
72
### Data parallelism with `tf.distribute`
73

74
KerasTuner also supports data parallelism via
75
[tf.distribute](https://www.tensorflow.org/tutorials/distribute/keras). Data
76
parallelism and distributed tuning can be combined. For example, if you have 10
77
workers with 4 GPUs on each worker, you can run 10 parallel trials with each
78
trial training on 4 GPUs by using
79
[tf.distribute.MirroredStrategy](
80
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy).
81
You can also run each trial on TPUs via
82
[tf.distribute.TPUStrategy](
83
https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/TPUStrategy).
84
Currently
85
[tf.distribute.MultiWorkerMirroredStrategy](
86
https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy)
87
is not supported, but support for this is on the roadmap.
88

89

90
### Example code
91

92
When the environment variables described above are set, the example below will
93
run distributed tuning and use data parallelism within each trial via
94
`tf.distribute`. The example loads MNIST from `tensorflow_datasets` and uses
95
[Hyperband](https://arxiv.org/abs/1603.06560) for the hyperparameter
96
search.
97
"""
98

99

100
import keras
101
import keras_tuner
102
import tensorflow as tf
103
import numpy as np
104

105

106
def build_model(hp):
107
    """Builds a convolutional model."""
108
    inputs = keras.Input(shape=(28, 28, 1))
109
    x = inputs
110
    for i in range(hp.Int("conv_layers", 1, 3, default=3)):
111
        x = keras.layers.Conv2D(
112
            filters=hp.Int("filters_" + str(i), 4, 32, step=4, default=8),
113
            kernel_size=hp.Int("kernel_size_" + str(i), 3, 5),
114
            activation="relu",
115
            padding="same",
116
        )(x)
117

118
        if hp.Choice("pooling" + str(i), ["max", "avg"]) == "max":
119
            x = keras.layers.MaxPooling2D()(x)
120
        else:
121
            x = keras.layers.AveragePooling2D()(x)
122

123
        x = keras.layers.BatchNormalization()(x)
124
        x = keras.layers.ReLU()(x)
125

126
    if hp.Choice("global_pooling", ["max", "avg"]) == "max":
127
        x = keras.layers.GlobalMaxPooling2D()(x)
128
    else:
129
        x = keras.layers.GlobalAveragePooling2D()(x)
130
    outputs = keras.layers.Dense(10, activation="softmax")(x)
131

132
    model = keras.Model(inputs, outputs)
133

134
    optimizer = hp.Choice("optimizer", ["adam", "sgd"])
135
    model.compile(
136
        optimizer, loss="sparse_categorical_crossentropy", metrics=["accuracy"]
137
    )
138
    return model
139

140

141
tuner = keras_tuner.Hyperband(
142
    hypermodel=build_model,
143
    objective="val_accuracy",
144
    max_epochs=2,
145
    factor=3,
146
    hyperband_iterations=1,
147
    distribution_strategy=tf.distribute.MirroredStrategy(),
148
    directory="results_dir",
149
    project_name="mnist",
150
    overwrite=True,
151
)
152

153
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
154

155
# Reshape the images to have the channel dimension.
156
x_train = (x_train.reshape(x_train.shape + (1,)) / 255.0)[:1000]
157
y_train = y_train.astype(np.int64)[:1000]
158
x_test = (x_test.reshape(x_test.shape + (1,)) / 255.0)[:100]
159
y_test = y_test.astype(np.int64)[:100]
160

161
tuner.search(
162
    x_train,
163
    y_train,
164
    steps_per_epoch=600,
165
    validation_data=(x_test, y_test),
166
    validation_steps=100,
167
    callbacks=[keras.callbacks.EarlyStopping("val_accuracy")],
168
)
169

170
Product

Resources

Company