GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ja/tutorials/generative/data_compression.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2022 The TensorFlow Compression Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

学習データの圧縮

概要

このノートブックでは、ニューラルネットワークと TensorFlow Compression を使って非可逆データ圧縮を行う方法を説明します。

非可逆圧縮には、レート、サンプルの安藤かに必要な期待されるビット数、およびサンプルの再構築における期待誤差を示すひずみ間のトレードオフが伴います。

以下の例では、オートエンコーダのようなモデルを使用して、MNIST データセットの画像を圧縮します。この手法は、『End-to-end Optimized Image Compression』という論文を基としています。

学習データの圧縮に関する背景情報については、古典的なデータ圧縮に精通した人を対象としたこちらの論文か、機械学習分野のユーザーを対象としたこちらの調査をご覧ください。

セットアップ

pip で Tensorflow Compression をインストールします。

In [ ]:

%%bash
# Installs the latest version of TFC compatible with the installed TF version.

read MAJOR MINOR <<< "$(pip show tensorflow | perl -p -0777 -e 's/.*Version: (\d+)\.(\d+).*/\1 \2/sg')"
pip install "tensorflow-compression<$MAJOR.$(($MINOR+1))"

ライブラリ依存関係をインポートします。

In [ ]:

import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_compression as tfc
import tensorflow_datasets as tfds

トレーナーモデルを定義する

このモデルはオートエンコーダに似ているため、またトレーニングと推論中に異なる機能を実行する必要があるため、このセットアップは、たとえば分類器などとは少し異なります。

トレーニングモデルは、以下の 3 つで構成されています。

分析（またはエンコーダ）変換: 画像を潜在空間に変換します。
合成（またはデコーダ）変換: 潜在空間から画像空間に変換します。
事前確率とエントロピーモデル: 潜在空間の周辺分布をモデル化します。

まず、変換を定義します。

In [ ]:

def make_analysis_transform(latent_dims):
  """Creates the analysis (encoder) transform."""
  return tf.keras.Sequential([
      tf.keras.layers.Conv2D(
          20, 5, use_bias=True, strides=2, padding="same",
          activation="leaky_relu", name="conv_1"),
      tf.keras.layers.Conv2D(
          50, 5, use_bias=True, strides=2, padding="same",
          activation="leaky_relu", name="conv_2"),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(
          500, use_bias=True, activation="leaky_relu", name="fc_1"),
      tf.keras.layers.Dense(
          latent_dims, use_bias=True, activation=None, name="fc_2"),
  ], name="analysis_transform")

In [ ]:

def make_synthesis_transform():
  """Creates the synthesis (decoder) transform."""
  return tf.keras.Sequential([
      tf.keras.layers.Dense(
          500, use_bias=True, activation="leaky_relu", name="fc_1"),
      tf.keras.layers.Dense(
          2450, use_bias=True, activation="leaky_relu", name="fc_2"),
      tf.keras.layers.Reshape((7, 7, 50)),
      tf.keras.layers.Conv2DTranspose(
          20, 5, use_bias=True, strides=2, padding="same",
          activation="leaky_relu", name="conv_1"),
      tf.keras.layers.Conv2DTranspose(
          1, 5, use_bias=True, strides=2, padding="same",
          activation="leaky_relu", name="conv_2"),
  ], name="synthesis_transform")

トレーナーは、両方の変換のインスタンスと事前確率のパラメータを保有します。

その call メソッドは、以下を計算するようにセットアップされます。

レート: 数字のバッチを表現するために必要なビット数の推定
ひずみ: 元の数字と再構築された数字のピクセルの平均絶対差

In [ ]:

class MNISTCompressionTrainer(tf.keras.Model):
  """Model that trains a compressor/decompressor for MNIST."""

  def __init__(self, latent_dims):
    super().__init__()
    self.analysis_transform = make_analysis_transform(latent_dims)
    self.synthesis_transform = make_synthesis_transform()
    self.prior_log_scales = tf.Variable(tf.zeros((latent_dims,)))

  @property
  def prior(self):
    return tfc.NoisyLogistic(loc=0., scale=tf.exp(self.prior_log_scales))

  def call(self, x, training):
    """Computes rate and distortion losses."""
    # Ensure inputs are floats in the range (0, 1).
    x = tf.cast(x, self.compute_dtype) / 255.
    x = tf.reshape(x, (-1, 28, 28, 1))

    # Compute latent space representation y, perturb it and model its entropy,
    # then compute the reconstructed pixel-level representation x_hat.
    y = self.analysis_transform(x)
    entropy_model = tfc.ContinuousBatchedEntropyModel(
        self.prior, coding_rank=1, compression=False)
    y_tilde, rate = entropy_model(y, training=training)
    x_tilde = self.synthesis_transform(y_tilde)

    # Average number of bits per MNIST digit.
    rate = tf.reduce_mean(rate)

    # Mean absolute difference across pixels.
    distortion = tf.reduce_mean(abs(x - x_tilde))

    return dict(rate=rate, distortion=distortion)

レートとひずみを計算する

では、トレーニングセットの画像を 1 つ使用して、順を追って説明します。トレーニングと検証用の MNIST データセットを読み込みます。

In [ ]:

training_dataset, validation_dataset = tfds.load(
    "mnist",
    split=["train", "test"],
    shuffle_files=True,
    as_supervised=True,
    with_info=False,
)

1 つの画像 $x$ を抽出します。

In [ ]:

(x, _), = validation_dataset.take(1)

plt.imshow(tf.squeeze(x))
print(f"Data type: {x.dtype}")
print(f"Shape: {x.shape}")

潜在の表現 $y$ を取得するには、float32 にキャストして batch 次元を追加し、それを分析変換に通す必要があります。

In [ ]:

x = tf.cast(x, tf.float32) / 255.
x = tf.reshape(x, (-1, 28, 28, 1))
y = make_analysis_transform(10)(x)

print("y:", y)

潜在は、テスト時に量子化されます。これをトレーニング中に区別可能な方法でモデル化するために、 $(-.5, .5)$ の間隔で一様ノイズを追加し、その結果を $\tilde y$ をとします。これは、『End-to-end Optimized Image Compression』論文で使用されているのと同じです。

In [ ]:

y_tilde = y + tf.random.uniform(y.shape, -.5, .5)

print("y_tilde:", y_tilde)

「事前確率」は、ノイズを含む潜在の周辺分布をモデル化するためにトレーニングする分布の密度です。たとえば、潜在次元ごとに異なるスケールを持つ独立した一連のロジスティック分布であることがあります。tfc.NoisyLogistic は、潜在には追加ノイズがあるという事実を考慮します。スケールがゼロに近づくにつれ、ロジスティック分布はディラックのデルタ（スパイク）に近づくものですが、追加ノイズにより、「ノイズの多い」分布は一様分布に近づきます。

In [ ]:

prior = tfc.NoisyLogistic(loc=0., scale=tf.linspace(.01, 2., 10))

_ = tf.linspace(-6., 6., 501)[:, None]
plt.plot(_, prior.prob(_));

トレーニング中、tfc.ContinuousBatchedEntropyModel は一様ノイズを追加し、そのノイズと事前確率を使用して（区別可能な）レート（潜在表現をエンコードするために必要な平均ビット数）の上限を計算します。この上限は、損失として最小化できます。

In [ ]:

entropy_model = tfc.ContinuousBatchedEntropyModel(
    prior, coding_rank=1, compression=False)
y_tilde, rate = entropy_model(y, training=True)

print("rate:", rate)
print("y_tilde:", y_tilde)

最後に、ノイズのある潜在が合成変換を通過し、画像の再構築 $\tilde x$ が生成されます。明らかに、変換はトレーニングされていないため、この再構築にはあまり利用価値がありません。

In [ ]:

x_tilde = make_synthesis_transform()(y_tilde)

# Mean absolute difference across pixels.
distortion = tf.reduce_mean(abs(x - x_tilde))
print("distortion:", distortion)

x_tilde = tf.saturate_cast(x_tilde[0] * 255, tf.uint8)
plt.imshow(tf.squeeze(x_tilde))
print(f"Data type: {x_tilde.dtype}")
print(f"Shape: {x_tilde.shape}")

数字のバッチごとに MNISTCompressionTrainer を呼び出すと、レートとそのバッチの平均としてのひずみが生成されます。

In [ ]:

(example_batch, _), = validation_dataset.batch(32).take(1)
trainer = MNISTCompressionTrainer(10)
example_output = trainer(example_batch)

print("rate: ", example_output["rate"])
print("distortion: ", example_output["distortion"])

次のセクションでは、これらの 2 つの損失で勾配降下を行うようにモデルをセットアップします。

モデルをトレーニングする

レートとひずみのラグアンジアン、つまりレートとひずみの和を最適化するようにトレーナーをコンパイルします。ここで、いずれかの項はラグランジュ関数パラメータ $\lambda$ で重み付けされます。

この損失関数は、モデルのさまざまな箇所に異なる影響を与えます。

分析変換は、レートとひずみの目的のトレードオフを達成する潜在表現を生成するようにトレーニングされます。
合成変換は、特定の潜在表現でひずみを最小化するようにトレーニングされます。
事前確率のパラメータは、特定の潜在表現でレートを最小化するようにトレーニングされます。これは、事前確率を最大尤度において潜在の周辺分布に適合するのと同じです。

In [ ]:

def pass_through_loss(_, x):
  # Since rate and distortion are unsupervised, the loss doesn't need a target.
  return x

def make_mnist_compression_trainer(lmbda, latent_dims=50):
  trainer = MNISTCompressionTrainer(latent_dims)
  trainer.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    # Just pass through rate and distortion as losses/metrics.
    loss=dict(rate=pass_through_loss, distortion=pass_through_loss),
    metrics=dict(rate=pass_through_loss, distortion=pass_through_loss),
    loss_weights=dict(rate=1., distortion=lmbda),
  )
  return trainer

次に、モデルをトレーニングします。ここでは、画像を圧縮するだけであるため、人間による注釈付けは必要ありません。そのため、map を使って注釈を削除し、レートとひずみの「ダミー」ターゲットを追加します。

In [ ]:

def add_rd_targets(image, label):
  # Training is unsupervised, so labels aren't necessary here. However, we
  # need to add "dummy" targets for rate and distortion.
  return image, dict(rate=0., distortion=0.)

def train_mnist_model(lmbda):
  trainer = make_mnist_compression_trainer(lmbda)
  trainer.fit(
      training_dataset.map(add_rd_targets).batch(128).prefetch(8),
      epochs=15,
      validation_data=validation_dataset.map(add_rd_targets).batch(128).cache(),
      validation_freq=1,
      verbose=1,
  )
  return trainer

trainer = train_mnist_model(lmbda=2000)

MNIST 画像を圧縮する

テスト時の圧縮と解凍用に、トレーニング済みのモデルを以下の 2 つに分割します。

エンコーダ側には、分析変換とエントロピーモデルが含まれます。
デコーダ側には、合成変換と同じエントロピーモデルが含まれます。

テスト時には、潜在に追加ノイズが含まれませんが、量子化されてから非可逆的に圧縮されるため、それらに新しい名前を指定します。それらと再構築の $\hat x$ と $\hat y$ をそれぞれに呼び出します（『End-to-end Optimized Image Compression』に従います）。

In [ ]:

class MNISTCompressor(tf.keras.Model):
  """Compresses MNIST images to strings."""

  def __init__(self, analysis_transform, entropy_model):
    super().__init__()
    self.analysis_transform = analysis_transform
    self.entropy_model = entropy_model

  def call(self, x):
    # Ensure inputs are floats in the range (0, 1).
    x = tf.cast(x, self.compute_dtype) / 255.
    y = self.analysis_transform(x)
    # Also return the exact information content of each digit.
    _, bits = self.entropy_model(y, training=False)
    return self.entropy_model.compress(y), bits

In [ ]:

class MNISTDecompressor(tf.keras.Model):
  """Decompresses MNIST images from strings."""

  def __init__(self, entropy_model, synthesis_transform):
    super().__init__()
    self.entropy_model = entropy_model
    self.synthesis_transform = synthesis_transform

  def call(self, string):
    y_hat = self.entropy_model.decompress(string, ())
    x_hat = self.synthesis_transform(y_hat)
    # Scale and cast back to 8-bit integer.
    return tf.saturate_cast(tf.round(x_hat * 255.), tf.uint8)

compression=True でインスタンス化すると、エントロピーモデルは、学習した事前確率をレンジコーディングアルゴリズムのテーブルに変換します。compress() を呼び出すと、このアルゴリズムが呼び出され、潜在空間ベクトルをビットシーケンスに変換します。各バイナリ文字列の長さは、潜在の情報コンテンツに近似します（事前確率の下の潜在の負の対数尤度）。

圧縮と解凍のエントロピーモデルは、同じインスタンスである必要があります。これは、レンジコーディングテーブルが両側でまったく同じである必要があるためです。そうでない場合、解凍エラーが発生します。

In [ ]:

def make_mnist_codec(trainer, **kwargs):
  # The entropy model must be created with `compression=True` and the same
  # instance must be shared between compressor and decompressor.
  entropy_model = tfc.ContinuousBatchedEntropyModel(
      trainer.prior, coding_rank=1, compression=True, **kwargs)
  compressor = MNISTCompressor(trainer.analysis_transform, entropy_model)
  decompressor = MNISTDecompressor(entropy_model, trainer.synthesis_transform)
  return compressor, decompressor

compressor, decompressor = make_mnist_codec(trainer)

検証データセットから 16 個の画像を取得します。skip の引数を変えることで、さまざまなサブセットを選択できます。

In [ ]:

(originals, _), = validation_dataset.batch(16).skip(3).take(1)

これらを文字列に圧縮し、それぞれの情報コンテンツをビットで追跡します。

In [ ]:

strings, entropies = compressor(originals)

print(f"String representation of first digit in hexadecimal: 0x{strings[0].numpy().hex()}")
print(f"Number of bits actually needed to represent it: {entropies[0]:0.2f}")

画像を文字列から解凍します。

In [ ]:

reconstructions = decompressor(strings)

各 16 個の元の数字を圧縮されたバイナリ表現と再構築された数字と共に表示します。

In [ ]:

#@title

def display_digits(originals, strings, entropies, reconstructions):
  """Visualizes 16 digits together with their reconstructions."""
  fig, axes = plt.subplots(4, 4, sharex=True, sharey=True, figsize=(12.5, 5))
  axes = axes.ravel()
  for i in range(len(axes)):
    image = tf.concat([
        tf.squeeze(originals[i]),
        tf.zeros((28, 14), tf.uint8),
        tf.squeeze(reconstructions[i]),
    ], 1)
    axes[i].imshow(image)
    axes[i].text(
        .5, .5, f"→ 0x{strings[i].numpy().hex()} →\n{entropies[i]:0.2f} bits",
        ha="center", va="top", color="white", fontsize="small",
        transform=axes[i].transAxes)
    axes[i].axis("off")
  plt.subplots_adjust(wspace=0, hspace=0, left=0, right=1, bottom=0, top=1)

In [ ]:

display_digits(originals, strings, entropies, reconstructions)

エンコードされた文字列の長さが各数字の情報コンテンツと異なることに注目してください。

これは、レンジコーディングプロセスが離散確率を使用し、少量のオーバーヘッドがあるためです。そのため、特に短い文字列の場合、対応するものは近似でしかありません。ただし、レンジコーディングは漸近的に最適です。限界では、予想されるビット数は、トレーニングモデルのレート項が上限であるクロスエントロピー（予想される情報コンテンツ）に近づきます。

レートとひずみのトレードオフ

上記では、モデルは、各数字を表現するために使用されるビットの平均数と再構築で発生した誤差の間の特定のトレードオフのためにトレーニングされました（lmbda=2000 で指定）。

異なる値でこの実験を繰り返した場合はどうなるでしょうか？

まずは、 $\lambda$ を 500 に減らしてみましょう。

In [ ]:

def train_and_visualize_model(lmbda):
  trainer = train_mnist_model(lmbda=lmbda)
  compressor, decompressor = make_mnist_codec(trainer)
  strings, entropies = compressor(originals)
  reconstructions = decompressor(strings)
  display_digits(originals, strings, entropies, reconstructions)

train_and_visualize_model(lmbda=500)

コードのビットレートが下がり、数字の信頼性も下がります。ただし、ほとんどの数字は認識可能のままです。

さらに $\lambda$ を減らしてみましょう。

In [ ]:

train_and_visualize_model(lmbda=300)

数字当たり 1 バイトの順に、文字列がさらに短くなり始めました。ただし、これにはコストが伴い、さらに多くの数字が認識できなくなってしまいました。

これは、このモデルが人間による誤差の認識に左右されず、ピクセル値の観点で絶対偏差を測定していることを示します。画像の品質をさらに高めるには、ピクセル損失を知覚損失に置き換える必要があります。

デコーダーを生成モデルとして使用する

デコーダーにランダムなビットを供給すると、これは、モデルが数字を表すことを学習した分布から効果的にサンプリングされます。

まず、入力文字列が完全にデコードされていないかどうかを検出するサニティチェックを行わずに、コンプレッサー/デコンプレッサーを再インスタンス化します。

In [ ]:

compressor, decompressor = make_mnist_codec(trainer, decode_sanity_check=False)

次に、十分な長さのランダムな文字列をデコンプレッサに入力して、それらから数字をデコード/サンプリングできるようにします。

In [ ]:

import os

strings = tf.constant([os.urandom(8) for _ in range(16)])
samples = decompressor(strings)

fig, axes = plt.subplots(4, 4, sharex=True, sharey=True, figsize=(5, 5))
axes = axes.ravel()
for i in range(len(axes)):
  axes[i].imshow(tf.squeeze(samples[i]))
  axes[i].axis("off")
plt.subplots_adjust(wspace=0, hspace=0, left=0, right=1, bottom=0, top=1)