GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ja/guide/tpu.ipynb
²⁵¹¹⁵ views

Kernel: Python 3

Copyright 2018 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

TPU の使用

このガイドでは、Tensor Processing Units（TPU）と TPU Pod（専用の高速ネットワークインターフェースで接続された TPU デバイスのコレクション）で tf.keras を使用して基本的なトレーニングとカスタムトレーニングループを行う方法を実演します。

TPU は、Google がカスタム開発した特定用途向け集積回路（ASIC）で、機械学習ワークロードを高速化するために使用できます。Google Colab、TPU Research Cloud、Cloud TPU から入手できます。

セットアップ

この Colab ノートブックをダウンロードする前に、Runtime > Change runtime type > Hardware accelerator > TPU でノートブックの設定を確認し、ハードウェアアクセラレータが TPU であることを確認してください。

TensorFlow データセットを含むいくつかの必要なライブラリをインポートします。

In [ ]:

import tensorflow as tf

import os
import tensorflow_datasets as tfds

TPU の初期化

TPU は通常 Cloud TPU ワーカーであり、これはユーザーの Python プログラムを実行するローカルプロセスとは異なります。そのため、リモートクラスタに接続して TPU を初期化するための初期化作業が必要となります。tf.distribute.cluster_resolver.TPUClusterResolver の tpu 引数は、Colab だけの特別なアドレスであることに注意してください。Google Compute Engine（GCE）で実行している場合は、ご利用の CloudTPU の名前を渡す必要があります。

注意: TPU の初期化コードはプログラムのはじめにある必要があります。

In [ ]:

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

手動でデバイスを配置する

TPU が初期されたら、計算を単一の TPU デバイスに配置するために、手動によるデバイスの配置を使用できます。

In [ ]:

a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

with tf.device('/TPU:0'):
  c = tf.matmul(a, b)

print("c device: ", c.device)
print(c)

分散ストラテジー

モデルは通常、複数の TPU で並行して実行されます。複数の TPU（または、複数の GPU や複数のマシン）を使用してモデルを分散させるために、TensorFlow では、tf.distribute.Strategy API が用意されています。分散ストラテジーを置き換えると、指定された任意の（TPU）デバイスでモデルが実行するようになります。詳細については、分散ストラテジーガイドをご覧ください。

tf.distribute.TPUStrategy は同期分散型トレーニングを実装します。TPU は複数の TPU コアに、効率的な all-reduce の独自の実装とほかの集合演算を提供しており、TPUStrategy で使用されています。

これを実演するために、tf.distribute.TPUStrategy オブジェクトを作成します。

In [ ]:

strategy = tf.distribute.TPUStrategy(resolver)

計算を複製してすべての TPU コアで実行できるようにするには、計算を strategy.run API に渡します。次の例では、すべてのコアが同じ入力 (a, b) を受け入れて、各コアで独立して行列の乗算を実行しています。出力は、すべてのレプリカからの値となります。

In [ ]:

@tf.function
def matmul_fn(x, y):
  z = tf.matmul(x, y)
  return z

z = strategy.run(matmul_fn, args=(a, b))
print(z)

TPU での分類

基本的な概念を説明したので、より具体的な例を考察しましょう。このセクションでは、分散ストラテジー tf.distribute.TPUStrategy を使用して Cloud TPU で Keras モデルをトレーニングする方法を説明します。

Keras モデルを定義する

MNIST データセットで画像の分類を行う Sequential Keras モデルの定義から始めましょう。CPU または GPU でトレーニングする場合に使用するものと変わりません。Keras モデルの作成は Strategy.scope 内で行う必要があることに注意してください。そうすることで、変数が各 TPU デバイスに作成されるようになります。コードの他の部分は、Strategy スコープ内にある必要はありません。

In [ ]:

def create_model():
  regularizer = tf.keras.regularizers.L2(1e-5)
  return tf.keras.Sequential(
      [tf.keras.layers.Conv2D(256, 3, input_shape=(28, 28, 1),
                              activation='relu',
                              kernel_regularizer=regularizer),
       tf.keras.layers.Conv2D(256, 3,
                              activation='relu',
                              kernel_regularizer=regularizer),
       tf.keras.layers.Flatten(),
       tf.keras.layers.Dense(256,
                             activation='relu',
                             kernel_regularizer=regularizer),
       tf.keras.layers.Dense(128,
                             activation='relu',
                             kernel_regularizer=regularizer),
       tf.keras.layers.Dense(10,
                             kernel_regularizer=regularizer)])

このモデルは L2 正則化の項を各レイヤーの重みに配置するため、以下のカスタムトレーニングループで Model.losses からそれらを取得する方法を示すことができます。

データセットを読み込む

Cloud TPU を使用する際は、tf.data.Dataset API を効率的に使用できることが非常に重要となります。データセットのパフォーマンスについての詳細は、入力パイプラインのパフォーマンスガイドを参照してください。

TPU ノードを使用している場合は、TensorFlow Dataset によって読み取られたすべてのデータファイルを Google Cloud Storage（GCS）バケットに保存する必要があります。TPU VM を使用している場合は、希望する場所にデータを保存できます。TPU ノードと TPU VM の詳細については、TPU システムアーキテクチャのドキュメントを参照してください。

ほとんどの使用事例では、データを TFRecord 形式に変換し、tf.data.TFRecordDataset を使って読み取ることをお勧めします。このやり方については、「TFRecord および tf.Example のチュートリアル」を参照してください。これは絶対要件ではないため、ほかのデータセットリーダー (tf.data.FixedLengthRecordDataset または tf.data.TextLineDataset) を使用することもできます。

小さなデータセットは、tf.data.Dataset.cache を使ってすべてをメモリに読み込むことができます。

データ形式にかかわらず、100 MB 程度の大きなファイルを使用することをお勧めします。このネットワーク化された設定においては、ファイルを開くタスクのオーバーヘッドが著しく高くなるため、特に重要なことです。

以下のコードに示される通り、Tensorflow データセットの tfds.load モジュールを使用して、MNIST トレーニングとテストデータのコピーを取得する必要があります。try_gcs は、パブリック GCS バケットで提供されているコピーを使用するように指定されています。これを指定しない場合、TPU はダウンロードされたデータにアクセスできません。

In [ ]:

def get_dataset(batch_size, is_training=True):
  split = 'train' if is_training else 'test'
  dataset, info = tfds.load(name='mnist', split=split, with_info=True,
                            as_supervised=True, try_gcs=True)

  # Normalize the input data.
  def scale(image, label):
    image = tf.cast(image, tf.float32)
    image /= 255.0
    return image, label

  dataset = dataset.map(scale)

  # Only shuffle and repeat the dataset in training. The advantage of having an
  # infinite dataset for training is to avoid the potential last partial batch
  # in each epoch, so that you don't need to think about scaling the gradients
  # based on the actual batch size.
  if is_training:
    dataset = dataset.shuffle(10000)
    dataset = dataset.repeat()

  dataset = dataset.batch(batch_size)

  return dataset

Keras の高位 API を使用してモデルをトレーニングする

Keras の Model.fit と Model.fit API を使用してモデルをトレーニングできます。ここでは、TPU 固有のステップはないため、複数の GPU と MirroredStrategy（TPUStrategy ではなく）を使用している場合と同じようにコードを記述します。詳細については、「Keras を使用した分散トレーニング」チュートリアルを参照してください。

In [ ]:

with strategy.scope():
  model = create_model()
  model.compile(optimizer='adam',
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['sparse_categorical_accuracy'])

batch_size = 200
steps_per_epoch = 60000 // batch_size
validation_steps = 10000 // batch_size

train_dataset = get_dataset(batch_size, is_training=True)
test_dataset = get_dataset(batch_size, is_training=False)

model.fit(train_dataset,
          epochs=5,
          steps_per_epoch=steps_per_epoch,
          validation_data=test_dataset,
          validation_steps=validation_steps)

Python のオーバーヘッドを緩和し、TPU のパフォーマンスを最大化するには、引数 steps_per_execution を Keras Model.compile に渡します。この例では、スループットが約 50% 増加します。

In [ ]:

with strategy.scope():
  model = create_model()
  model.compile(optimizer='adam',
                # Anything between 2 and `steps_per_epoch` could help here.
                steps_per_execution = 50,
                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                metrics=['sparse_categorical_accuracy'])

model.fit(train_dataset,
          epochs=5,
          steps_per_epoch=steps_per_epoch,
          validation_data=test_dataset,
          validation_steps=validation_steps)

カスタムトレーニングループを使用してモデルをトレーニングする

tf.function と tf.distribute API を直接使用しても、モデルを作成してトレーニングすることができます。strategy.experimental_distribute_datasets_from_function API は、データセット関数を指定して tf.data.Dataset を分散させるために使用されます。以下の例では、 Dataset に渡されるバッチサイズは、グローバルバッチサイズではなく、レプリカごとのバッチサイズであることに注意してください。詳細については、「tf.distribute.Strategy によるカスタムトレーニング」チュートリアルをご覧ください。

最初に、モデル、データセット、および tf.function を作成します。

In [ ]:

# Create the model, optimizer and metrics inside the `tf.distribute.Strategy`
# scope, so that the variables can be mirrored on each device.
with strategy.scope():
  model = create_model()
  optimizer = tf.keras.optimizers.Adam()
  training_loss = tf.keras.metrics.Mean('training_loss', dtype=tf.float32)
  training_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      'training_accuracy', dtype=tf.float32)

# Calculate per replica batch size, and distribute the `tf.data.Dataset`s
# on each TPU worker.
per_replica_batch_size = batch_size // strategy.num_replicas_in_sync

train_dataset = strategy.experimental_distribute_datasets_from_function(
    lambda _: get_dataset(per_replica_batch_size, is_training=True))

@tf.function
def train_step(iterator):
  """The step function for one training step."""

  def step_fn(inputs):
    """The computation to run on each TPU device."""
    images, labels = inputs
    with tf.GradientTape() as tape:
      logits = model(images, training=True)
      per_example_loss = tf.keras.losses.sparse_categorical_crossentropy(
          labels, logits, from_logits=True)
      loss = tf.nn.compute_average_loss(per_example_loss)
      model_losses = model.losses
      if model_losses:
        loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))

    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(list(zip(grads, model.trainable_variables)))
    training_loss.update_state(loss * strategy.num_replicas_in_sync)
    training_accuracy.update_state(labels, logits)

  strategy.run(step_fn, args=(next(iterator),))

次に、トレーニングループを実行します。

In [ ]:

steps_per_eval = 10000 // batch_size

train_iterator = iter(train_dataset)
for epoch in range(5):
  print('Epoch: {}/5'.format(epoch))

  for step in range(steps_per_epoch):
    train_step(train_iterator)
  print('Current step: {}, training loss: {}, training accuracy: {}%'.format(
      optimizer.iterations.numpy(),
      round(float(training_loss.result()), 4),
      round(float(training_accuracy.result()) * 100, 2)))
  training_loss.reset_states()
  training_accuracy.reset_states()

`tf.function` 内の複数のステップでパフォーマンスを改善する

tf.function 内で複数のステップを実行することで、パフォーマンスを改善できます。これは、tf.function 内の tf.range で Strategy.run 呼び出しをラッピングすることで実現されます。AutoGraph は、TPU ワーカーの tf.while_loop に変換します。tf.function の詳細については、tf.function ガイドを参照してください。

パフォーマンスは改善されますが、tf.function 内の単一のステップに比べれば、この方法にはトレードオフがあります。tf.function で複数のステップを実行すると柔軟性に劣り、ステップ内での Eager execution や任意の Python コードを実行できません。

In [ ]:

@tf.function
def train_multiple_steps(iterator, steps):
  """The step function for one training step."""

  def step_fn(inputs):
    """The computation to run on each TPU device."""
    images, labels = inputs
    with tf.GradientTape() as tape:
      logits = model(images, training=True)
      per_example_loss = tf.keras.losses.sparse_categorical_crossentropy(
          labels, logits, from_logits=True)
      loss = tf.nn.compute_average_loss(per_example_loss)
      model_losses = model.losses
      if model_losses:
        loss += tf.nn.scale_regularization_loss(tf.add_n(model_losses))
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(list(zip(grads, model.trainable_variables)))
    training_loss.update_state(loss * strategy.num_replicas_in_sync)
    training_accuracy.update_state(labels, logits)

  for _ in tf.range(steps):
    strategy.run(step_fn, args=(next(iterator),))

# Convert `steps_per_epoch` to `tf.Tensor` so the `tf.function` won't get
# retraced if the value changes.
train_multiple_steps(train_iterator, tf.convert_to_tensor(steps_per_epoch))

print('Current step: {}, training loss: {}, training accuracy: {}%'.format(
      optimizer.iterations.numpy(),
      round(float(training_loss.result()), 4),
      round(float(training_accuracy.result()) * 100, 2)))

次のステップ

Cloud TPU とその使用方法の詳細については、次を参照してください。

Google Cloud TPU: Google Cloud TPU ホームページ。
Google Cloud TPU ドキュメント: 以下を含む Google Cloud TPU ドキュメント:
- Cloud TPU の基礎: Cloud TPU の操作の概要。
- Cloud TPU クイックスタート: TensorFlow やその他の主要な機械学習フレームワークを使用して Cloud TPU VM を操作するためのクイックスタート。
Google Cloud TPU Colab ノートブック: エンドツーエンドのトレーニング例
Google Cloud TPU パフォーマンスガイド: アプリケーションに合った Cloud TPU 構成パラメータの調整により、Cloud TPU パフォーマンスをさらに改善します。
TensorFlow での分散型トレーニング: tf.distribute.TPUStrategy などの分散ストラテジーの使用方法とベストプラクティスを示す例
TPU 埋め込み: TensorFlow には、tf.tpu.experimental.embedding による TPU でのトレーニング埋め込みの特別なサポートが含まれています。さらに、TensorFlow Recommenders には tfrs.layers.embedding.TPUEmbedding があります。埋め込みは、効率的で密な表現を提供し、特徴間の複雑な類似性と関係を捉えます。TensorFlow の TPU 固有の埋め込みサポートにより、単一の TPU デバイスのメモリよりも大きい埋め込みをトレーニングし、TPU で疎で不規則な入力を使用できます。
TPU Research Cloud (TRC): TRC では、研究者は 1,000 を超える Cloud TPU デバイスのクラスタへのアクセスを申請できます。

Copyright 2018 The TensorFlow Authors.

TPU の使用

セットアップ

TPU の初期化

手動でデバイスを配置する

分散ストラテジー

TPU での分類

Keras モデルを定義する

データセットを読み込む

Keras の高位 API を使用してモデルをトレーニングする

カスタムトレーニングループを使用してモデルをトレーニングする

`tf.function` 内の複数のステップでパフォーマンスを改善する

次のステップ

Product

Resources

Company

Copyright 2018 The TensorFlow Authors.

TPU の使用

セットアップ

TPU の初期化

手動でデバイスを配置する

分散ストラテジー

TPU での分類

Keras モデルを定義する

データセットを読み込む

Keras の高位 API を使用してモデルをトレーニングする

カスタムトレーニングループを使用してモデルをトレーニングする

tf.function 内の複数のステップでパフォーマンスを改善する

次のステップ

`tf.function` 内の複数のステップでパフォーマンスを改善する