GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/guide/migrate/fault_tolerance.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2021 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

내결함성 메커니즘 마이그레이션하기

내결함성은 매개변수 및 모델과 같은 추적 가능한 객체의 상태를 주기적으로 저장하는 메커니즘을 말합니다. 훈련하는 동안 프로그램/머신 오류가 발생한 경우 이를 사용하여 복구할 수 있습니다.

이 가이드에서는 먼저 tf.estimator.RunConfig를 사용하여 메트릭 저장 설정을 지정하고 TensorFlow 1에서 tf.estimator.Estimator를 사용하여 훈련에 내결함성을 추가하는 방법을 보여줍니다. 그런 다음 Tensorflow 2에서 훈련에 내결함성을 구현하는 방법 2가지를 배우게 됩니다.

Keras Model.fit API를 사용하는 경우 해당 API로 tf.keras.callbacks.BackupAndRestore 콜백을 전달할 수 있습니다.
사용자 정의 훈련 루프(tf.GradientTape 사용)를 사용하는 경우 tf.train.Checkpoint 및 tf.train.CheckpointManager API를 사용하여 체크포인트를 임의로 저장할 수 있습니다.

이 두 가지 메서드 모두 체크포인트 파일의 훈련 상태를 백업하고 복원합니다.

설치하기

tf.keras.callbacks.BackupAndRestore의 save_freq 인수를 사용하여 특정 단계에서 체크포인트의 빈도를 저장하는 기능이 TensorFlow 2.10부터 도입되었으므로 tf-nightly를 설치합니다.

In [ ]:

!pip install tf-nightly

In [ ]:

import tensorflow.compat.v1 as tf1
import tensorflow as tf
import numpy as np
import tempfile
import time

In [ ]:

mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

TensorFlow 1: `tf.estimator.RunConfig`를 사용하여 체크포인트 저장하기

TensorFlow 1에서는 tf.estimator.RunConfig를 구성하여 모든 단계마다 체크포인트를 저장하도록 tf.estimator를 구성할 수 있습니다.

이 예제에서는 다섯 번째 체크포인트를 진행하는 동안 인위적으로 오류를 발생시키는 후크를 먼저 작성합니다.

In [ ]:

class InterruptHook(tf1.train.SessionRunHook):
  # A hook for artificially interrupting training.
  def begin(self):
    self._step = -1

  def before_run(self, run_context):
    self._step += 1

  def after_run(self, run_context, run_values):
    if self._step == 5:
      raise RuntimeError('Interruption')

다음으로 모든 체크포인트를 저장하고 MNIST 데이터세트를 사용하도록 tf.estimator.Estimator를 구성합니다.

In [ ]:

feature_columns = [tf1.feature_column.numeric_column("x", shape=[28, 28])]
config = tf1.estimator.RunConfig(save_summary_steps=1,
                                 save_checkpoints_steps=1)

path = tempfile.mkdtemp()

classifier = tf1.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[256, 32],
    optimizer=tf1.train.AdamOptimizer(0.001),
    n_classes=10,
    dropout=0.2,
    model_dir=path,
    config = config
)

train_input_fn = tf1.estimator.inputs.numpy_input_fn(
    x={"x": x_train},
    y=y_train.astype(np.int32),
    num_epochs=10,
    batch_size=50,
    shuffle=True,
)

모델 훈련을 시작합니다. 앞에서 정의한 후크로 의해 인위적인 예외가 발생합니다.

In [ ]:

try:
  classifier.train(input_fn=train_input_fn,
                   hooks=[InterruptHook()],
                   max_steps=10)
except Exception as e:
  print(f'{type(e).__name__}:{e}')

마지막으로 저장한 체크포인트를 사용하여 tf.estimator.Estimator를 다시 빌드하고 훈련을 계속 진행합니다.

In [ ]:

classifier = tf1.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=[256, 32],
    optimizer=tf1.train.AdamOptimizer(0.001),
    n_classes=10,
    dropout=0.2,
    model_dir=path,
    config = config
)
classifier.train(input_fn=train_input_fn,
                   max_steps = 10)

TensorFlow 2: 콜백 및 `Model.fit`으로 백업 및 복원하기

TensorFlow 2에서는 훈련에 Keras Model.fit API를 사용하는 경우 tf.keras.callbacks.BackupAndRestore 콜백을 제공하여 내결함성 기능을 추가할 수 있습니다.

이를 보여주기 위해 우선적으로 네 번째 epoch 체크포인트를 진행하는 동안 인위적으로 오류를 발생시키는 Keras Callback 클래스를 정의합니다.

In [ ]:

class InterruptAtEpoch(tf.keras.callbacks.Callback):
  # A callback for artificially interrupting training.
  def __init__(self, interrupting_epoch=3):
    self.interrupting_epoch = interrupting_epoch

  def on_epoch_end(self, epoch, log=None):
    if epoch == self.interrupting_epoch:
      raise RuntimeError('Interruption')

그런 다음 간단한 Keras 모델을 정의 및 인스턴스화하고, 손실 함수를 정의하고, Model.compile을 호출하고, epoch 경계에서 임시 디렉터리에 체크포인트를 저장하는 tf.keras.callbacks.BackupAndRestore 콜백을 설정합니다.

In [ ]:

def create_model():
  return tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10)
  ])
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model = create_model()
model.compile(optimizer='adam',
              loss=loss,
              metrics=['accuracy'])
log_dir = tempfile.mkdtemp()
backup_restore_callback = tf.keras.callbacks.BackupAndRestore(
    backup_dir = log_dir)

Model.fit을 사용하여 모델 훈련을 시작합니다. 훈련을 진행하는 동안 위에서 인스턴스화한 tf.keras.callbacks.BackupAndRestore 덕분에 체크포인트가 저장되지만 InterruptAtEpoch 클래스는 인위적으로 예외를 발생시켜 네 번째 epoch 이후에 실패를 시뮬레이션합니다.

In [ ]:

try:
  model.fit(x=x_train,
            y=y_train,
            epochs=10,
            steps_per_epoch=100,
            validation_data=(x_test, y_test),
            callbacks=[backup_restore_callback, InterruptAtEpoch()])
except Exception as e:
  print(f'{type(e).__name__}:{e}')

그런 다음 Keras 모델을 인스턴스화하고 Model.compile을 호출한 다음 이전에 저장한 체크포인트의 Model.fit을 사용하여 모델을 계속 훈련합니다.

In [ ]:

model = create_model()
model.compile(optimizer='adam',
              loss=loss,
              metrics=['accuracy'],
              steps_per_execution=10)
model.fit(x=x_train,
            y=y_train,
            epochs=10,
            steps_per_epoch=100,
            validation_data=(x_test, y_test),
            callbacks=[backup_restore_callback])

140번째 단계에서 인위적으로 오류를 발생시키는 다른 Callback 클래스를 정의합니다.

In [ ]:

class InterruptAtStep(tf.keras.callbacks.Callback):
  # A callback for artificially interrupting training.
  def __init__(self, interrupting_step=140):
    self.total_step_count = 0
    self.interrupting_step = interrupting_step

  def on_batch_begin(self, batch, logs=None):
    self.total_step_count += 1

  def on_batch_end(self, batch, logs=None):
    if self.total_step_count == self.interrupting_step:
      print("\nInterrupting at step count", self.total_step_count)
      raise RuntimeError('Interruption')

참고: 이 섹션에서는 Tensorflow 2.10이 릴리스될 때까지 tf-nightly에서만 사용할 수 있는 특성을 사용합니다.

체크포인트가 30단계마다 저장되도록 하려면 BackupAndRestore 콜백의 save_freq를 30으로 설정합니다. InterruptAtStep이 epoch 1 및 40단계(총 단계 수 140)에서 실패를 시뮬레이션하기 위해 인위적으로 예외를 발생시킵니다. 체크포인트는 epoch 1과 20단계에서 마지막으로 저장될 것입니다.

In [ ]:

log_dir_2 = tempfile.mkdtemp()

backup_restore_callback = tf.keras.callbacks.BackupAndRestore(
    backup_dir = log_dir_2, save_freq=30
)
model = create_model()
model.compile(optimizer='adam',
              loss=loss,
              metrics=['accuracy'])
try:
  model.fit(x=x_train,
            y=y_train,
            epochs=10,
            steps_per_epoch=100,
            validation_data=(x_test, y_test),
            callbacks=[backup_restore_callback, InterruptAtStep()])
except Exception as e:
  print(f'{type(e).__name__}:{e}')

그런 다음 Keras 모델을 인스턴스화하고 Model.compile을 호출한 다음 이전에 저장한 체크포인트의 Model.fit을 사용하여 모델을 계속 훈련합니다. 훈련은 epoch 2와 21단계부터 시작합니다.

In [ ]:

model = create_model()
model.compile(optimizer='adam',
              loss=loss,
              metrics=['accuracy'],
              steps_per_execution=10)
model.fit(x=x_train,
            y=y_train,
            epochs=10,
            steps_per_epoch=100,
            validation_data=(x_test, y_test),
            callbacks=[backup_restore_callback])

TensorFlow 2: 사용자 정의 훈련 루프를 사용하여 수동 체크포인트 작성하기

TensorFlow 2에서 사용자 정의 훈련 루프를 사용하는 경우 tf.train.Checkpoint 및 tf.train.CheckpointManager API로 내결함성 메커니즘을 구현할 수 있습니다.

이 예제는 다음을 수행하는 방법을 보여줍니다.

저장하려는 추적 가능한 객체를 속성으로 설정한 체크포인트를 수동으로 생성하려면 tf.train.Checkpoint 객체를 사용합니다.
여러 체크포인트를 관리하려면 tf.train.CheckpointManager를 사용합니다.

먼저 Keras 모델, 옵티마이저, 손실 함수를 정의하고 인스턴스화합니다. 그런 다음 추적 가능한 상태가 있는 두 객체(모델 및 옵티마이저)를 관리하는 Checkpoint와 임시 디렉터리에서 여러 체크포인트를 기록하고 유지하는 CheckpointManager를 생성합니다.

In [ ]:

model = create_model()
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
log_dir = tempfile.mkdtemp()
epochs = 5
steps_per_epoch = 5

checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
checkpoint_manager = tf.train.CheckpointManager(
            checkpoint, log_dir, max_to_keep=2)

이제 새 epoch가 시작될 때마다 첫 번째 epoch 이후 마지막 체크포인트를 로드하는 사용자 정의 훈련 루프를 구현합니다.

In [ ]:

for epoch in range(epochs):
  if epoch > 0:
      tf.train.load_checkpoint(save_path)
  print(f"\nStart of epoch {epoch}")

  for step in range(steps_per_epoch):
    with tf.GradientTape() as tape:

      logits = model(x_train, training=True)
      loss_value = loss_fn(y_train, logits)

      grads = tape.gradient(loss_value, model.trainable_weights)
      optimizer.apply_gradients(zip(grads, model.trainable_weights))

    save_path = checkpoint_manager.save()
    print(f"Checkpoint saved to {save_path}")
    print(f"Training loss at step {step}: {loss_value}")

다음 단계

TensorFlow 2의 내결함성 및 체크포인트에 대해 자세히 알아보려면 다음 문서를 고려합니다.

tf.keras.callbacks.BackupAndRestore 콜백 API 설명서.
tf.train.Checkpoint 및 tf.train.CheckpointManager API 설명서.
체크포인트 작성 섹션 등 체크포인트 훈련하기 가이드.

분산 훈련과 관련된 다음 자료도 유용할 수 있습니다.

Keras를 사용하는 다중 작업자 훈련 가이드의 내결함성 섹션.
매개변수 서버 훈련 가이드의 작업 실패 처리하기 섹션.

Copyright 2021 The TensorFlow Authors.

내결함성 메커니즘 마이그레이션하기

설치하기

TensorFlow 1: `tf.estimator.RunConfig`를 사용하여 체크포인트 저장하기

TensorFlow 2: 콜백 및 `Model.fit`으로 백업 및 복원하기

TensorFlow 2: 사용자 정의 훈련 루프를 사용하여 수동 체크포인트 작성하기

다음 단계

Product

Resources

Company

Copyright 2021 The TensorFlow Authors.

내결함성 메커니즘 마이그레이션하기

설치하기

TensorFlow 1: tf.estimator.RunConfig를 사용하여 체크포인트 저장하기

TensorFlow 2: 콜백 및 Model.fit으로 백업 및 복원하기

TensorFlow 2: 사용자 정의 훈련 루프를 사용하여 수동 체크포인트 작성하기

다음 단계

TensorFlow 1: `tf.estimator.RunConfig`를 사용하여 체크포인트 저장하기

TensorFlow 2: 콜백 및 `Model.fit`으로 백업 및 복원하기