GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ko/guide/tpu.ipynb
²⁵¹¹⁵ views

Kernel: Python 3

Copyright 2018 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

TPU 사용하기

본 가이드는 전용 고속 네트워크 인터페이스로 연결된 TPU 장치 모음인 TPU(Tensor Processing Unit) 및 TPU Pod에 대한 기본 훈련을 tf.keras와 사용자 정의 훈련 루프를 사용하여 수행하는 방법을 보여줍니다.

TPU는 머신러닝 워크로드를 가속화하는 데 사용하기 위해 Google이 특정 용도에 맞도록 제작한 주문형 반도체인 ASIC(Application-Specific Integrated Circuit)입니다. Google Colab, TPU 리서치 클라우드, 클라우드 TPU를 통해 사용할 수 있습니다.

설정

이 Colab 노트북을 실행하기 전에 노트북 설정을 확인하여 하드웨어 가속기가 TPU인지 확인하세요. 런타임 > 런타임 유형 변경 > 하드웨어 가속기 > TPU로 들어가면 됩니다.

TensorFlow 데이터세트 등 몇 가지 필요한 라이브러리를 가져옵니다.

In [ ]:

import tensorflow as tf

import os
import tensorflow_datasets as tfds

TPU 초기화

TPU는 일반적으로 사용자의 Python 프로그램을 실행하는 로컬 프로세스와 다른 Cloud TPU 작업자입니다. 따라서 원격 클러스터에 연결하고 TPU를 초기화하려면 일부 초기화 작업을 수행해야 합니다. tf.distribute.cluster_resolver.TPUClusterResolver에 대한 tpu 인수는 Colab 전용 특수 주소입니다. Google Compute Engine(GCE)에서 코드를 실행하는 경우, Cloud TPU의 이름을 대신 전달해야 합니다.

참고: TPU 초기화 코드는 프로그램의 시작 부분에 있어야 합니다.

In [ ]:

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

수동 기기 배치

TPU가 초기화된 후 수동 기기 배치를 사용하여 단일 TPU 기기에 계산을 배치할 수 있습니다.

In [ ]:

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

with tf.device('/TPU:0'):
  c = tf.matmul(a, b)

print("c device: ", c.device)
print(c)

배포 전략

일반적으로 데이터 병렬 방식으로 여러 TPU에서 모델을 실행합니다. 모델을 여러 TPU(및 여러 GPU 또는 여러 머신)에 배포하기 위해 TensorFlow는 tf.distribute.Strategy API를 제공합니다. 배포 전략은 교체할 수 있으며 모델은 지정한 (TPU) 장치에서 실행됩니다. 자세한 내용은 TensorFlow를 사용하여 분산 훈련하기 가이드를 참고하세요.

tf.distribute.TPUStrategy 옵션을 사용하면 동기식 분산 훈련을 구현하게 됩니다. TPU는 TPUStrategy에서 사용하는 효율적인 전체 축소 및 기타 집합적 연산의 자체적 구현을 여러 TPU 코어에 제공합니다.

이를 시연하기 위해 tf.distribute.TPUStrategy 객체를 생성합니다.

In [ ]:

strategy = tf.distribute.TPUStrategy(resolver)

모든 TPU 코어에서 실행할 수 있도록 계산을 복제하려면 이를 strategy.run API에 전달할 수 있습니다. 다음은 모든 코어가 동일한 입력 (a, b)를 받고 각 코어에 대해 독립적으로 행렬 곱셈을 수행하는 예입니다. 출력은 모든 복제본의 값이 됩니다.

In [ ]:

@tf.function
def matmul_fn(x, y):
  z = tf.matmul(x, y)
  return z

z = strategy.run(matmul_fn, args=(a, b))
print(z)

TPU 기반 분류

기본 개념을 다뤘으니 좀 더 구체적인 예를 살펴보겠습니다. 이 섹션에서는 배포 전략(tf.distribute.TPUStrategy)을 사용하여 Cloud TPU에서 Keras 모델을 훈련하는 방법을 보여줍니다.

Keras 모델 정의하기

MNIST 데이터세트에서 이미지 분류를 위한 Sequential Keras 모델을 정의하는 것으로 시작합니다. CPU나 GPU에서 훈련할 때 사용하는 것과 다르지 않습니다. Keras 모델 생성은 Strategy.scope 안에서 이루어져야 각 TPU 기기에서 변수를 생성할 수 있습니다. 코드의 다른 부분은 Strategy 범위 내에 있지 않아도 됩니다.

In [ ]:

def create_model():
  regularizer = tf.keras.regularizers.L2(1e-5)
  return tf.keras.Sequential(
      [tf.keras.layers.Conv2D(256, 3, input_shape=(28, 28, 1),
                              activation='relu',
                              kernel_regularizer=regularizer),
       tf.keras.layers.Conv2D(256, 3,
                              activation='relu',
                              kernel_regularizer=regularizer),
       tf.keras.layers.Flatten(),
       tf.keras.layers.Dense(256,
                             activation='relu',
                             kernel_regularizer=regularizer),
       tf.keras.layers.Dense(128,
                             activation='relu',
                             kernel_regularizer=regularizer),
       tf.keras.layers.Dense(10,
                             kernel_regularizer=regularizer)])

이 모델은 각 레이어의 가중치에 L2 정규화 항을 배치하기에 아래의 사용자 정의 훈련 루프가 Model.losses에서 이러한 L2 정규화 항을 선택하는 방법을 보여줄 수 있습니다.

데이터세트 로드하기

Cloud TPU를 사용할 때는 tf.data.Dataset API를 효율적으로 사용하는 것이 중요합니다. 입력 파이프라인 성능 가이드에서 데이터세트 성능에 대해 자세히 알아볼 수 있습니다.

TPU 노드를 사용하는 경우 TensorFlow Dataset로 읽은 모든 데이터 파일을 Google Cloud Storage(GCS) 버킷에 저장해야 합니다. TPU VM을 사용하는 경우 원하는 위치에 데이터를 저장할 수 있습니다. TPU 노드 및 TPU VM에 대한 자세한 정보는 TPU 시스템 아키텍처 문서를 참조합니다.

대부분의 사용 사례에서 데이터를 TFRecord 형식으로 변환하고 tf.data.TFRecordDataset을 사용하여 읽는 것이 좋습니다. 이 작업을 수행하는 방법에 대한 자세한 내용 TFRecord 및 tf.Example 튜토리얼을 확인하세요. 이것은 엄격한 요구 사항은 아니며 tf.data.FixedLengthRecordDataset 또는 tf.data.TextLineDataset와 같은 다른 데이터세트 판독기를 사용할 수 있습니다.

tf.data.Dataset.cache를 사용하여 전체 작은 데이터세트를 메모리에 로드할 수 있습니다.

사용된 데이터 형식에 관계없이 100MB 정도의 큰 파일을 사용하는 것이 좋습니다. 파일을 여는 오버헤드가 상당히 높기 때문에 네트워크 설정에서 특히 중요합니다.

아래 코드와 같이 Tensorflow 데이터세트 tfds.load 모듈을 사용하여 MNIST 학습 및 테스트 데이터의 복사본을 가져와야 합니다. try_gcs는 공개 GCS 버킷에서 사용 가능한 사본을 사용하도록 지정됩니다. 이를 지정하지 않으면 TPU가 다운로드한 데이터에 액세스할 수 없습니다.

In [ ]:

def get_dataset(batch_size, is_training=True):
  split = 'train' if is_training else 'test'
  dataset, info = tfds.load(name='mnist', split=split, with_info=True,
                            as_supervised=True, try_gcs=True)

  # Normalize the input data.
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255.0
    return image, label

  dataset = dataset.map(scale)

  # Only shuffle and repeat the dataset in training. The advantage of having an
  # infinite dataset for training is to avoid the potential last partial batch
  # in each epoch, so that you don't need to think about scaling the gradients
  # based on the actual batch size.
  if is_training:
    dataset = dataset.shuffle(10000)
    dataset = dataset.repeat()

  dataset = dataset.batch(batch_size)

  return dataset

Keras 고급 API를 사용하여 모델 훈련하기

Keras Model.fit과 Model.compile API로 모델을 훈련할 수 있습니다. 이 단계에서는 TPU 관련 사항이 없습니다. TPUStrategy 대신 여러 GPU와 MirroredStrategy를 사용할 때와 마찬가지로 코드를 작성합니다. Keras를 사용한 분산 훈련 튜토리얼에서 자세히 알아볼 수 있습니다.

In [ ]:

with strategy.scope():
  model = create_model()
  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['sparse_categorical_accuracy'])

batch_size = 200
steps_per_epoch = 60000 // batch_size
validation_steps = 10000 // batch_size

train_dataset = get_dataset(batch_size, is_training=True)
test_dataset = get_dataset(batch_size, is_training=False)

model.fit(train_dataset,
          epochs=5,
          steps_per_epoch=steps_per_epoch,
          validation_data=test_dataset,
          validation_steps=validation_steps)

Python 오버헤드를 줄이고 TPU의 성능을 극대화하기 위해 steps_per_execution 인수를 Keras Model.compile로 전달합니다. 이 예제에서는 처리량이 약 50% 증가합니다.

In [ ]:

with strategy.scope():
  model = create_model()
  model.compile(optimizer='adam',
                # Anything between 2 and `steps_per_epoch` could help here.
                steps_per_execution = 50,
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['sparse_categorical_accuracy'])

model.fit(train_dataset,
          epochs=5,
          steps_per_epoch=steps_per_epoch,
          validation_data=test_dataset,
          validation_steps=validation_steps)

사용자 지정 훈련 루프를 사용하여 모델 훈련하기

tf.function과 tf.distribute API를 직접 사용하여 모델을 생성하고 훈련할 수도 있습니다. Strategy.experimental_distribute_datasets_from_function API를 사용하여 데이터세트 함수가 지정된 tf.data.Dataset를 배포할 수 있습니다. 아래 예제에서 Dataset에 전달된 배치 크기는 전역 배치 크기가 아닌 복제본당 배치 크기입니다. 자세히 알아보려면 tf.distribute.Strategy를 사용한 사용자 정의 훈련 튜토리얼을 확인합니다.

먼저 모델, 데이터세트 및 tf.functions를 생성합니다.

In [ ]:

# Create the model, optimizer and metrics inside the `tf.distribute.Strategy`
# scope, so that the variables can be mirrored on each device.
with strategy.scope():
  model = create_model()
  optimizer = tf.keras.optimizers.Adam()
  training_loss = tf.keras.metrics.Mean('training_loss', dtype=tf.float32)
  training_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      'training_accuracy', dtype=tf.float32)

# Calculate per replica batch size, and distribute the `tf.data.Dataset`s
# on each TPU worker.
per_replica_batch_size = batch_size // strategy.num_replicas_in_sync

train_dataset = strategy.experimental_distribute_datasets_from_function(
    lambda _: get_dataset(per_replica_batch_size, is_training=True))

@tf.function
def train_step(iterator):
  """The step function for one training step."""

  def step_fn(inputs):
    """The computation to run on each TPU device."""
    images, labels = inputs
    with tf.GradientTape() as tape:
      logits = model(images, training=True)
      per_example_loss = tf.keras.losses.sparse_categorical_crossentropy(
          labels, logits, from_logits=True)
      loss = tf.nn.compute_average_loss(per_example_loss)
      model_losses = model.losses
      if model_losses:
        loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))

    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(list(zip(grads, model.trainable_variables)))
    training_loss.update_state(loss * strategy.num_replicas_in_sync)
    training_accuracy.update_state(labels, logits)

  strategy.run(step_fn, args=(next(iterator),))

그런 다음 훈련 루프를 실행합니다.

In [ ]:

steps_per_eval = 10000 // batch_size

train_iterator = iter(train_dataset)
for epoch in range(5):
  print('Epoch: {}/5'.format(epoch))

  for step in range(steps_per_epoch):
    train_step(train_iterator)
  print('Current step: {}, training loss: {}, training accuracy: {}%'.format(
      optimizer.iterations.numpy(),
      round(float(training_loss.result()), 4),
      round(float(training_accuracy.result()) * 100, 2)))
  training_loss.reset_states()
  training_accuracy.reset_states()

`tf.function` 내부의 여러 단계로 성능 개선하기

tf.function 내에서 여러 단계를 실행하여 성능을 개선할 수 있습니다. 이는 tf.function 내부의 tf.range로 Strategy.run 호출을 래핑하여 달성할 수 있으며 AutoGraph는 이를 TPU 작업자의 tf.while_loop로 전환합니다. tf.function으로 성능 향상 가이드에서 tf.function에 대해 자세히 알아볼 수 있습니다.

개선된 성능에도 불구하고 tf.function 내에서 단일 단계를 실행하는 것과 비교하여 이 방법에는 손해되는 부분이 생깁니다. tf.function에서 여러 단계를 실행하는 것은 유연성이 떨어집니다. 즉, 단계 내에서 어떤 부분을 강제 실행하거나 임의 Python 코드를 실행할 수 없습니다.

In [ ]:

@tf.function
def train_multiple_steps(iterator, steps):
  """The step function for one training step."""

  def step_fn(inputs):
    """The computation to run on each TPU device."""
    images, labels = inputs
    with tf.GradientTape() as tape:
      logits = model(images, training=True)
      per_example_loss = tf.keras.losses.sparse_categorical_crossentropy(
          labels, logits, from_logits=True)
      loss = tf.nn.compute_average_loss(per_example_loss)
      model_losses = model.losses
      if model_losses:
        loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(list(zip(grads, model.trainable_variables)))
    training_loss.update_state(loss * strategy.num_replicas_in_sync)
    training_accuracy.update_state(labels, logits)

  for _ in tf.range(steps):
    strategy.run(step_fn, args=(next(iterator),))

# Convert `steps_per_epoch` to `tf.Tensor` so the `tf.function` won't get
# retraced if the value changes.
train_multiple_steps(train_iterator, tf.convert_to_tensor(steps_per_epoch))

print('Current step: {}, training loss: {}, training accuracy: {}%'.format(
      optimizer.iterations.numpy(),
      round(float(training_loss.result()), 4),
      round(float(training_accuracy.result()) * 100, 2)))

다음 단계

Cloud TPU 및 사용 방법에 대해 자세히 알아보려면 다음 안내를 따릅니다.

Google Cloud TPU: Google Cloud TPU 홈페이지
Google Cloud TPU 문서: 다음을 포함하는 Google Cloud TPU 문서:
- Cloud TPU 소개: Cloud TPU 작업에 대한 개요
- Cloud TPU 빠른 시작: TensorFlow 및 기타 기본 머신러닝 프레임워크를 사용하는 Cloud TPU VM 작업에 대한 빠른 시작 소개
Google Cloud TPU Colab 노트북: 엔드 투 엔드 훈련 예시
Google Cloud TPU 성능 가이드: 애플리케이션에 적합하게 Cloud TPU 구성 매개변수를 조정하여 Cloud TPU 성능 강화
TensorFlow를 사용한 분산 훈련: 모범 사례를 보여주는 예제와 함께 tf.distribute.TPUStrategy를 포함한 배포 전략의 사용 방법
TPU 임베딩: TensorFlow에는 tf.tpu.experimental.embedding을 통해 TPU에서 임베딩을 훈련하기 위한 특별 지원이 포함되어 있습니다. 또한 TensorFlow Recommenders에는 tfrs.layers.embedding.TPUEmbedding이 있습니다. 임베딩은 기능 사이의 복잡한 유사성과 관계를 캡처하여 효율적이고 밀도 있는 표현을 제공합니다. TensorFlow의 TPU 특화적 임베딩 지원을 통해 단일 TPU 장치의 메모리보다 큰 임베딩을 훈련하고 TPU에서 밀도가 낮은 비정형 입력을 사용할 수 있습니다.
TPU 리서치 클라우드(TRC): TRC를 통해 연구원은 1,000개 이상의 클라우드 TPU 장치 클러스터에 대한 액세스를 신청할 수 있습니다.

Copyright 2018 The TensorFlow Authors.

TPU 사용하기

설정

TPU 초기화

수동 기기 배치

배포 전략

TPU 기반 분류

Keras 모델 정의하기

데이터세트 로드하기

Keras 고급 API를 사용하여 모델 훈련하기

사용자 지정 훈련 루프를 사용하여 모델 훈련하기

`tf.function` 내부의 여러 단계로 성능 개선하기

다음 단계

Product

Resources

Company

Copyright 2018 The TensorFlow Authors.

TPU 사용하기

설정

TPU 초기화

수동 기기 배치

배포 전략

TPU 기반 분류

Keras 모델 정의하기

데이터세트 로드하기

Keras 고급 API를 사용하여 모델 훈련하기

사용자 지정 훈련 루프를 사용하여 모델 훈련하기

tf.function 내부의 여러 단계로 성능 개선하기

다음 단계

`tf.function` 내부의 여러 단계로 성능 개선하기