GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ja/tutorials/structured_data/imbalanced_data.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2019 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

不均衡データの分類

このチュートリアルでは、1 つのクラスの例の数が他のクラスの例の数を大幅に上回る、非常に不均衡なデータセットを分類する方法を示します。Kaggle でホストされているクレジットカード不正検出データセットを使用します。目的は、合計 284,807 件のトランザクションからわずか 492 件の不正なトランザクションを検出することです。Keras を使用してモデルを定義し、クラスの重み付けを使用してモデルが不均衡なデータから学習できるようにします。

このチュートリアルには、次の完全なコードが含まれています。

Pandas を使用して CSV ファイルを読み込む。
トレーニングセット、検証セット、テストセットを作成する。
Keras を使用してモデルの定義してトレーニングする（クラスの重みの設定を含む）。
様々なメトリクス（適合率や再現率を含む）を使用してモデルを評価する。
不均衡データを扱うための一般的なテクニックを試す。
- クラスの重み付け
- オーバーサンプリング

Setup

In [ ]:

import tensorflow as tf
from tensorflow import keras

import os
import tempfile

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import sklearn
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [ ]:

mpl.rcParams['figure.figsize'] = (12, 10)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

データの処理と調査

Kaggle Credit Card Fraud データセットをダウンロードする

Pandas は、構造化データの読み込みと処理を支援するユーティリティが多数含まれる Python ライブラリです。Pandas を使用し、URL から CSV を Pandas DataFrame にダウンロードします。

注意: このデータセットは、Worldline と ULB (Université Libre de Bruxelles) の機械学習グループによるビッグデータマイニングと不正検出に関する共同研究で収集および分析されたものです。関連トピックに関する現在および過去のプロジェクトの詳細は、こちらと DefeatFraud プロジェクトのページをご覧ください。

In [ ]:

file = tf.keras.utils
raw_df = pd.read_csv('https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv')
raw_df.head()

In [ ]:

raw_df[['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V26', 'V27', 'V28', 'Amount', 'Class']].describe()

クラスラベルの不均衡を調べる

データセットの不均衡を見てみましょう。

In [ ]:

neg, pos = np.bincount(raw_df['Class'])
total = neg + pos
print('Examples:\n    Total: {}\n    Positive: {} ({:.2f}% of total)\n'.format(
    total, pos, 100 * pos / total))

これは、陽性サンプルの割合が少ないことを示しています。

データをクリーニング、分割、正規化する

生データにはいくつかの問題があります。まず、TimeカラムとAmountカラムはむらがあり過ぎてそのままでは使用できません。Timeカラムは意味が明確ではないため削除し、Amountカラムのログを取って範囲を縮小します。

In [ ]:

cleaned_df = raw_df.copy()

# You don't want the `Time` column.
cleaned_df.pop('Time')

# The `Amount` column covers a huge range. Convert to log-space.
eps = 0.001 # 0 => 0.1¢
cleaned_df['Log Amount'] = np.log(cleaned_df.pop('Amount')+eps)

データセットをトレーニングセット、検証セット、テストセットに分割します。検証セットはモデルを適合させる間に使用され、損失とメトリクスを評価しますが、モデルはこのデータに適合しません。テストセットはトレーニング段階では全く使用されず、モデルがどの程度新しいデータを一般化したかを評価するために最後にだけ使用されます。これはトレーニングデータ不足による過学習が重大な懸念事項である不均衡データセットでは特に重要です。

In [ ]:

# Use a utility from sklearn to split and shuffle your dataset.
train_df, test_df = train_test_split(cleaned_df, test_size=0.2)
train_df, val_df = train_test_split(train_df, test_size=0.2)

# Form np arrays of labels and features.
train_labels = np.array(train_df.pop('Class'))
bool_train_labels = train_labels != 0
val_labels = np.array(val_df.pop('Class'))
test_labels = np.array(test_df.pop('Class'))

train_features = np.array(train_df)
val_features = np.array(val_df)
test_features = np.array(test_df)

sklearn の StandardScaler を使用して入力特徴を正規化します。これで平均は 0、標準偏差は 1 に設定されます。

注意: StandardScalerはtrain_featuresを使用する場合にのみ適合し、モデルが検証セットやテストセットでピークを迎えることがないようにします。

In [ ]:

scaler = StandardScaler()
train_features = scaler.fit_transform(train_features)

val_features = scaler.transform(val_features)
test_features = scaler.transform(test_features)

train_features = np.clip(train_features, -5, 5)
val_features = np.clip(val_features, -5, 5)
test_features = np.clip(test_features, -5, 5)


print('Training labels shape:', train_labels.shape)
print('Validation labels shape:', val_labels.shape)
print('Test labels shape:', test_labels.shape)

print('Training features shape:', train_features.shape)
print('Validation features shape:', val_features.shape)
print('Test features shape:', test_features.shape)

警告: モデルをデプロイする場合には、前処理の計算を保存することが非常に重要です。最も簡単なのは、それらをレイヤーとして実装し、エクスポート前にモデルに加える方法です。

データ分散を確認する

次に、いくつかの特徴における陽性の例と陰性の例の分散を比較します。この時点で自問すべき点は、次のとおりです。

それらの分散には意味がありますか？
- はい。入力を正規化したので、ほとんどが+/- 2の範囲内に集中しています。
分散間の差は見られますか？
- はい。陽性の例には、はるかに高い極値が含まれています。

In [ ]:

pos_df = pd.DataFrame(train_features[ bool_train_labels], columns=train_df.columns)
neg_df = pd.DataFrame(train_features[~bool_train_labels], columns=train_df.columns)

sns.jointplot(x=pos_df['V5'], y=pos_df['V6'],
              kind='hex', xlim=(-5,5), ylim=(-5,5))
plt.suptitle("Positive distribution")

sns.jointplot(x=neg_df['V5'], y=neg_df['V6'],
              kind='hex', xlim=(-5,5), ylim=(-5,5))
_ = plt.suptitle("Negative distribution")

モデルとメトリクスを定義する

密に接続された非表示レイヤー、過学習を防ぐドロップアウトレイヤー、取引が不正である確率を返す出力シグモイドレイヤーを持つ単純なニューラルネットワークを作成する関数を定義します。

In [ ]:

METRICS = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
      keras.metrics.AUC(name='prc', curve='PR'), # precision-recall curve
]

def make_model(metrics=METRICS, output_bias=None):
  if output_bias is not None:
    output_bias = tf.keras.initializers.Constant(output_bias)
  model = keras.Sequential([
      keras.layers.Dense(
          16, activation='relu',
          input_shape=(train_features.shape[-1],)),
      keras.layers.Dropout(0.5),
      keras.layers.Dense(1, activation='sigmoid',
                         bias_initializer=output_bias),
  ])

  model.compile(
      optimizer=keras.optimizers.Adam(learning_rate=1e-3),
      loss=keras.losses.BinaryCrossentropy(),
      metrics=metrics)

  return model

有用なメトリクスを理解する

上記で定義したメトリクスのいくつかは、モデルで計算できるため、パフォーマンス評価の際に有用なことに着目してください。

偽陰性と偽陽性は誤って分類されたサンプルです。
真陰性と真陽性は正しく分類されたサンプルです。
正解率は正しく分類された例の割合です。

$\frac{\text{true samples}}{\text{total samples}}$

適合率は正しく分類された予測陽性の割合です。

$\frac{\text{true positives}}{\text{true positives + false positives}}$

再現率は正しく分類された実際の陽性の割合です。

$\frac{\text{true positives}}{\text{true positives + false negatives}}$

AUC は受信者動作特性曲線 (ROC-AUC) の曲線下の面積を指します。この指標は、分類器がランダムな正のサンプルをランダムな負のサンプルよりも高くランク付けする確率に等しくなります。
AUPRC は適合率-再現率曲線の曲線下の面積を指します。この指標は、さまざまな確率しきい値の適合率と再現率のペアを計算します。

注意: 精度は、このタスクに役立つ指標ではありません。常に False を予測することで、このタスクの精度を 99.8％以上にすることができるからです。

詳細は以下を参照してください。

ベースラインモデル

モデルを構築する

次に、前に定義した関数を使用してモデルを作成し、トレーニングします。モデルはデフォルトよりも大きいバッチサイズ 2048 を使って適合されていることに注目してください。これは、各バッチに必ずいくつかの陽性サンプルが含まれるようにするために重要です。もし、バッチサイズが小さすぎると、学習できる不正取引が全くないという可能性があります。

注意: このモデルはクラスの不均衡をうまく処理できません。後ほどこのチュートリアル内で改善します。

In [ ]:

EPOCHS = 100
BATCH_SIZE = 2048

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_prc', 
    verbose=1,
    patience=10,
    mode='max',
    restore_best_weights=True)

In [ ]:

model = make_model()
model.summary()

モデルをテスト実行します。

In [ ]:

model.predict(train_features[:10])

オプション: 正しい初期バイアスを設定する

これら初期の推測はあまり良いとは言えません。データセットは不均衡であることが分かっています。それを反映できるように、出力レイヤーのバイアスを設定します。（参照: ニューラルネットワークのトレーニングのレシピ: 「init well」）これは初期収束に有用です。

デフォルトのバイアス初期化では、損失はmath.log(2) = 0.69314程度になります。

In [ ]:

results = model.evaluate(train_features, train_labels, batch_size=BATCH_SIZE, verbose=0)
print("Loss: {:0.4f}".format(results[0]))

設定する正しいバイアスは、以下から導き出すことができます。

p_0 = pos/(pos + neg) = 1/(1+e^{-b_0})

In [ ]:

initial_bias = np.log([pos/neg])
initial_bias

それを初期バイアスとして設定すると、モデルははるかに合理的な初期推測ができるようになります。

これはpos/total = 0.0018に近い値になるはずです。

In [ ]:

model = make_model(output_bias=initial_bias)
model.predict(train_features[:10])

この初期化では、初期損失はおおよそ次のようになります。

-p_0log(p_0)-(1-p_0)log(1-p_0) = 0.01317

In [ ]:

results = model.evaluate(train_features, train_labels, batch_size=BATCH_SIZE, verbose=0)
print("Loss: {:0.4f}".format(results[0]))

この初期の損失は、単純な初期化を行った場合の約 50 分の 1 です。

この方法だと、陽性の例がないことを学習するだけのためにモデルが最初の数エポックを費やす必要がありません。また、これによって、トレーニング中の損失のプロットが読みやすくなります。

初期の重みをチェックポイントする

さまざまなトレーニングの実行を比較しやすくするために、この初期モデルの重みをチェックポイントファイルに保持し、トレーニングの前に各モデルにロードします。

In [ ]:

initial_weights = os.path.join(tempfile.mkdtemp(), 'initial_weights')
model.save_weights(initial_weights)

バイアス修正が有効であることを確認する

先に進む前に、慎重なバイアス初期化が実際に役立ったかどうかを素早く確認します。

この慎重な初期化を行った場合と行わなかった場合でモデルを 20 エポックトレーニングしてから損失を比較します。

In [ ]:

model = make_model()
model.load_weights(initial_weights)
model.layers[-1].bias.assign([0.0])
zero_bias_history = model.fit(
    train_features,
    train_labels,
    batch_size=BATCH_SIZE,
    epochs=20,
    validation_data=(val_features, val_labels), 
    verbose=0)

In [ ]:

model = make_model()
model.load_weights(initial_weights)
careful_bias_history = model.fit(
    train_features,
    train_labels,
    batch_size=BATCH_SIZE,
    epochs=20,
    validation_data=(val_features, val_labels), 
    verbose=0)

In [ ]:

def plot_loss(history, label, n):
  # Use a log scale on y-axis to show the wide range of values.
  plt.semilogy(history.epoch, history.history['loss'],
               color=colors[n], label='Train ' + label)
  plt.semilogy(history.epoch, history.history['val_loss'],
               color=colors[n], label='Val ' + label,
               linestyle="--")
  plt.xlabel('Epoch')
  plt.ylabel('Loss')

In [ ]:

plot_loss(zero_bias_history, "Zero Bias", 0)
plot_loss(careful_bias_history, "Careful Bias", 1)

上の図を見れば一目瞭然ですが、検証損失に関しては、この問題ではこのように慎重に初期化することによって、明確なアドバンテージを得ることができます。

モデルをトレーニングする

In [ ]:

model = make_model()
model.load_weights(initial_weights)
baseline_history = model.fit(
    train_features,
    train_labels,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    callbacks=[early_stopping],
    validation_data=(val_features, val_labels))

トレーニング履歴を確認する

このセクションでは、トレーニングと検証のセットでモデルの精度と損失のプロットを作成します。これらは、過適合をチェックするのに役立ちます。詳細については、過適合と学習不足チュートリアルを参照してください。

さらに、上で作成した任意のメトリクスのプロットを作成することができます。例として、下記には偽陰性が含まれています。

In [ ]:

def plot_metrics(history):
  metrics = ['loss', 'prc', 'precision', 'recall']
  for n, metric in enumerate(metrics):
    name = metric.replace("_"," ").capitalize()
    plt.subplot(2,2,n+1)
    plt.plot(history.epoch, history.history[metric], color=colors[0], label='Train')
    plt.plot(history.epoch, history.history['val_'+metric],
             color=colors[0], linestyle="--", label='Val')
    plt.xlabel('Epoch')
    plt.ylabel(name)
    if metric == 'loss':
      plt.ylim([0, plt.ylim()[1]])
    elif metric == 'auc':
      plt.ylim([0.8,1])
    else:
      plt.ylim([0,1])

    plt.legend()

In [ ]:

plot_metrics(baseline_history)

注意: 一般的に、検証曲線はトレーニング曲線よりも優れています。これは主に、モデルを評価する際にドロップアウトレイヤーがアクティブでないということに起因します。

メトリクスを評価する

混同行列を使用して、実際のラベルと予測されたラベルを要約できます。ここで、X 軸は予測されたラベルであり、Y 軸は実際のラベルです。

In [ ]:

train_predictions_baseline = model.predict(train_features, batch_size=BATCH_SIZE)
test_predictions_baseline = model.predict(test_features, batch_size=BATCH_SIZE)

In [ ]:

def plot_cm(labels, predictions, p=0.5):
  cm = confusion_matrix(labels, predictions > p)
  plt.figure(figsize=(5,5))
  sns.heatmap(cm, annot=True, fmt="d")
  plt.title('Confusion matrix @{:.2f}'.format(p))
  plt.ylabel('Actual label')
  plt.xlabel('Predicted label')

  print('Legitimate Transactions Detected (True Negatives): ', cm[0][0])
  print('Legitimate Transactions Incorrectly Detected (False Positives): ', cm[0][1])
  print('Fraudulent Transactions Missed (False Negatives): ', cm[1][0])
  print('Fraudulent Transactions Detected (True Positives): ', cm[1][1])
  print('Total Fraudulent Transactions: ', np.sum(cm[1]))

テストデータセットでモデルを評価し、上記で作成した行列の結果を表示します。

In [ ]:

baseline_results = model.evaluate(test_features, test_labels,
                                  batch_size=BATCH_SIZE, verbose=0)
for name, value in zip(model.metrics_names, baseline_results):
  print(name, ': ', value)
print()

plot_cm(test_labels, test_predictions_baseline)

モデルがすべてを完璧に予測した場合は、これは対角行列になり、主な対角線から外れた値が不正確な予測を示してゼロになります。この場合、行列は偽陽性が比較的少ないことを示し、これは誤ってフラグが立てられた正当な取引が比較的少ないことを意味します。しかし、偽陽性の数が増えればコストがかかる可能性はありますが、偽陰性の数はさらに少なくした方が良いでしょう。偽陽性は顧客にカード利用履歴の確認を求めるメールを送信する可能性があるのに対し、偽陰性は不正な取引を成立させてしまう可能性があるため、このトレードオフはむしろ望ましいといえます。

ROC をプロットする

次に、ROC をプロットします。このプロットは、出力しきい値を調整するだけでモデルが到達できるパフォーマンス範囲が一目で分かるので有用です。

In [ ]:

def plot_roc(name, labels, predictions, **kwargs):
  fp, tp, _ = sklearn.metrics.roc_curve(labels, predictions)

  plt.plot(100*fp, 100*tp, label=name, linewidth=2, **kwargs)
  plt.xlabel('False positives [%]')
  plt.ylabel('True positives [%]')
  plt.xlim([-0.5,20])
  plt.ylim([80,100.5])
  plt.grid(True)
  ax = plt.gca()
  ax.set_aspect('equal')

In [ ]:

plot_roc("Train Baseline", train_labels, train_predictions_baseline, color=colors[0])
plot_roc("Test Baseline", test_labels, test_predictions_baseline, color=colors[0], linestyle='--')
plt.legend(loc='lower right');

AUPRC をプロットする

AUPRC をプロットします。補間された適合率-再現率曲線の下の領域は、分類しきい値のさまざまな値に対して（再現率、適合率）点をプロットすることにより取得できます。計算方法によっては、PR AUC はモデルの平均適合率と同等になる場合があります。

In [ ]:

def plot_prc(name, labels, predictions, **kwargs):
    precision, recall, _ = sklearn.metrics.precision_recall_curve(labels, predictions)

    plt.plot(precision, recall, label=name, linewidth=2, **kwargs)
    plt.xlabel('Precision')
    plt.ylabel('Recall')
    plt.grid(True)
    ax = plt.gca()
    ax.set_aspect('equal')

In [ ]:

plot_prc("Train Baseline", train_labels, train_predictions_baseline, color=colors[0])
plot_prc("Test Baseline", test_labels, test_predictions_baseline, color=colors[0], linestyle='--')
plt.legend(loc='lower right');

適合率は比較的高いように見えますが、再現率と ROC 曲線の下の曲線下面積 (AUC) は、期待するほど高いものではありません。適合率と再現率の両方を最大化しようとすると、分類器はしばしば課題に直面します。不均衡データセットを扱う場合は特にそうです。大切な問題のコンテキストでは異なるタイプのエラーにかかるコストを考慮することが重要です。この例では、偽陰性（不正な取引が見逃されている）は金銭的コストを伴う可能性がある一方で、偽陽性（取引が不正であると誤ってフラグが立てられている）はユーザーの幸福度を低下させる可能性があります。

クラスの重み

クラスの重みを計算する

最終目的は不正な取引を特定することですが、処理する陽性サンプルがそれほど多くないので、利用可能な数少ない例の分類器に大きな重み付けをします。これを行うには、パラメータを介して各クラスの重みを Keras に渡します。これにより、モデルは十分に表現されていないクラスの例にも「より注意を払う」ようになります。

In [ ]:

# Scaling by total/2 helps keep the loss to a similar magnitude.
# The sum of the weights of all examples stays the same.
weight_for_0 = (1 / neg) * (total / 2.0)
weight_for_1 = (1 / pos) * (total / 2.0)

class_weight = {0: weight_for_0, 1: weight_for_1}

print('Weight for class 0: {:.2f}'.format(weight_for_0))
print('Weight for class 1: {:.2f}'.format(weight_for_1))

クラスの重みでモデルをトレーニングする

次に、クラスの重みでモデルを再トレーニングして評価し、それが予測にどのように影響するかを確認します。

注意: class_weights を使用すると、損失の範囲が変更されます。オプティマイザにもよりますが、これはトレーニングの安定性に影響を与える可能性があります。tf.keras.optimizers.SGD のように、ステップサイズが勾配の大きさに依存するオプティマイザは失敗する可能性があります。ここで使用されているオプティマイザ tf.keras.optimizers.Adam は、スケーリングの変更による影響を受けません。また、重み付けのため、総損失は 2 つのモデル間で比較できないことに注意してください。

In [ ]:

weighted_model = make_model()
weighted_model.load_weights(initial_weights)

weighted_history = weighted_model.fit(
    train_features,
    train_labels,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    callbacks=[early_stopping],
    validation_data=(val_features, val_labels),
    # The class weights go here
    class_weight=class_weight)

トレーニング履歴を確認する

In [ ]:

plot_metrics(weighted_history)

メトリクスを評価する

In [ ]:

train_predictions_weighted = weighted_model.predict(train_features, batch_size=BATCH_SIZE)
test_predictions_weighted = weighted_model.predict(test_features, batch_size=BATCH_SIZE)

In [ ]:

weighted_results = weighted_model.evaluate(test_features, test_labels,
                                           batch_size=BATCH_SIZE, verbose=0)
for name, value in zip(weighted_model.metrics_names, weighted_results):
  print(name, ': ', value)
print()

plot_cm(test_labels, test_predictions_weighted)

ここでは、クラスの重みを使用すると偽陽性が多くなるため、正解率と適合率が低くなりますが、逆にモデルがより多くの真陽性を検出したため、再現率と AUC が高くなっていることが分かります。このモデルは正解率は低いものの、再現率が高くなるので、より多くの不正取引を特定します。もちろん、両タイプのエラーにはコストがかかります。（あまりにも多くの正当な取引を不正取引としてフラグを立ててユーザーに迷惑をかけたくはないはずです。）アプリケーションのこういった異なるタイプのエラー間のトレードオフは、慎重に検討してください

ROC をプロットする

In [ ]:

plot_roc("Train Baseline", train_labels, train_predictions_baseline, color=colors[0])
plot_roc("Test Baseline", test_labels, test_predictions_baseline, color=colors[0], linestyle='--')

plot_roc("Train Weighted", train_labels, train_predictions_weighted, color=colors[1])
plot_roc("Test Weighted", test_labels, test_predictions_weighted, color=colors[1], linestyle='--')


plt.legend(loc='lower right');

AUPRC をプロットする

In [ ]:

plot_prc("Train Baseline", train_labels, train_predictions_baseline, color=colors[0])
plot_prc("Test Baseline", test_labels, test_predictions_baseline, color=colors[0], linestyle='--')

plot_prc("Train Weighted", train_labels, train_predictions_weighted, color=colors[1])
plot_prc("Test Weighted", test_labels, test_predictions_weighted, color=colors[1], linestyle='--')


plt.legend(loc='lower right');

オーバーサンプリング

マイノリティクラスをオーバーサンプリングする

関連したアプローチとして、マイノリティクラスをオーバーサンプリングしてデータセットを再サンプルするという方法があります。

In [ ]:

pos_features = train_features[bool_train_labels]
neg_features = train_features[~bool_train_labels]

pos_labels = train_labels[bool_train_labels]
neg_labels = train_labels[~bool_train_labels]

NumPy を使用する

陽性の例から適切な数のランダムインデックスを選択して、手動でデータセットのバランスをとることができます。

In [ ]:

ids = np.arange(len(pos_features))
choices = np.random.choice(ids, len(neg_features))

res_pos_features = pos_features[choices]
res_pos_labels = pos_labels[choices]

res_pos_features.shape

In [ ]:

resampled_features = np.concatenate([res_pos_features, neg_features], axis=0)
resampled_labels = np.concatenate([res_pos_labels, neg_labels], axis=0)

order = np.arange(len(resampled_labels))
np.random.shuffle(order)
resampled_features = resampled_features[order]
resampled_labels = resampled_labels[order]

resampled_features.shape

`tf.data`を使用する

もしtf.dataを使用している場合、バランスの取れた例を作成する最も簡単な方法は、positiveとnegativeのデータセットから開始し、それらをマージすることです。その他の例については、tf.data ガイドをご覧ください。

In [ ]:

BUFFER_SIZE = 100000

def make_ds(features, labels):
  ds = tf.data.Dataset.from_tensor_slices((features, labels))#.cache()
  ds = ds.shuffle(BUFFER_SIZE).repeat()
  return ds

pos_ds = make_ds(pos_features, pos_labels)
neg_ds = make_ds(neg_features, neg_labels)

各データセットは(feature, label)のペアを提供します。

In [ ]:

for features, label in pos_ds.take(1):
  print("Features:\n", features.numpy())
  print()
  print("Label: ", label.numpy())

tf.data.Dataset.sample_from_datasets を使用し、この 2 つをマージします。

In [ ]:

resampled_ds = tf.data.Dataset.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
resampled_ds = resampled_ds.batch(BATCH_SIZE).prefetch(2)

In [ ]:

for features, label in resampled_ds.take(1):
  print(label.numpy().mean())

このデータセットを使用するには、エポックごとのステップ数が必要です。

この場合の「エポック」の定義はあまり明確ではありません。それぞれの陰性の例を 1 度見るのに必要なバッチ数だとしましょう。

In [ ]:

resampled_steps_per_epoch = np.ceil(2.0*neg/BATCH_SIZE)
resampled_steps_per_epoch

オーバーサンプリングデータをトレーニングする

ここで、クラスの重みを使用する代わりに、再サンプルされたデータセットを使用してモデルをトレーニングし、それらの手法がどう比較されるかを確認してみましょう。

注意: 陽性の例を複製することでデータのバランスをとっているため、データセットの総サイズは大きくなり、各エポックではより多くのトレーニングステップが実行されます。

In [ ]:

resampled_model = make_model()
resampled_model.load_weights(initial_weights)

# Reset the bias to zero, since this dataset is balanced.
output_layer = resampled_model.layers[-1] 
output_layer.bias.assign([0])

val_ds = tf.data.Dataset.from_tensor_slices((val_features, val_labels)).cache()
val_ds = val_ds.batch(BATCH_SIZE).prefetch(2) 

resampled_history = resampled_model.fit(
    resampled_ds,
    epochs=EPOCHS,
    steps_per_epoch=resampled_steps_per_epoch,
    callbacks=[early_stopping],
    validation_data=val_ds)

トレーニングプロセスが勾配の更新ごとにデータセット全体を考慮する場合は、このオーバーサンプリングは基本的にクラスの重み付けと同じになります。

しかし、ここで行ったようにバッチ単位でモデルをトレーニングする場合、オーバーサンプリングされたデータはより滑らかな勾配信号を提供します。それぞれの陽性の例を大きな重みを持つ 1 つのバッチで表示する代わりに、毎回小さな重みを持つ多くの異なるバッチで表示します。

このような滑らかな勾配信号は、モデルのトレーニングを容易にします。

トレーニング履歴を確認する

トレーニングデータは検証データやテストデータとは全く異なる分散を持つため、ここでのメトリクスの分散は異なることに注意してください。

In [ ]:

plot_metrics(resampled_history)

再トレーニングする

バランスの取れたデータの方がトレーニングしやすいため、上記のトレーニング方法ではすぐに過学習してしまう可能性があります。

したがって、エポックを分割して、tf.keras.callbacks.EarlyStopping がトレーニングを停止するタイミングをより細かく制御できるようにします。

In [ ]:

resampled_model = make_model()
resampled_model.load_weights(initial_weights)

# Reset the bias to zero, since this dataset is balanced.
output_layer = resampled_model.layers[-1] 
output_layer.bias.assign([0])

resampled_history = resampled_model.fit(
    resampled_ds,
    # These are not real epochs
    steps_per_epoch=20,
    epochs=10*EPOCHS,
    callbacks=[early_stopping],
    validation_data=(val_ds))

トレーニング履歴を再確認する

In [ ]:

plot_metrics(resampled_history)

メトリクスを評価する

In [ ]:

train_predictions_resampled = resampled_model.predict(train_features, batch_size=BATCH_SIZE)
test_predictions_resampled = resampled_model.predict(test_features, batch_size=BATCH_SIZE)

In [ ]:

resampled_results = resampled_model.evaluate(test_features, test_labels,
                                             batch_size=BATCH_SIZE, verbose=0)
for name, value in zip(resampled_model.metrics_names, resampled_results):
  print(name, ': ', value)
print()

plot_cm(test_labels, test_predictions_resampled)

ROC をプロットする

In [ ]:

plot_roc("Train Baseline", train_labels, train_predictions_baseline, color=colors[0])
plot_roc("Test Baseline", test_labels, test_predictions_baseline, color=colors[0], linestyle='--')

plot_roc("Train Weighted", train_labels, train_predictions_weighted, color=colors[1])
plot_roc("Test Weighted", test_labels, test_predictions_weighted, color=colors[1], linestyle='--')

plot_roc("Train Resampled", train_labels, train_predictions_resampled, color=colors[2])
plot_roc("Test Resampled", test_labels, test_predictions_resampled, color=colors[2], linestyle='--')
plt.legend(loc='lower right');

AUPRC をプロットする

In [ ]:

plot_prc("Train Baseline", train_labels, train_predictions_baseline, color=colors[0])
plot_prc("Test Baseline", test_labels, test_predictions_baseline, color=colors[0], linestyle='--')

plot_prc("Train Weighted", train_labels, train_predictions_weighted, color=colors[1])
plot_prc("Test Weighted", test_labels, test_predictions_weighted, color=colors[1], linestyle='--')

plot_prc("Train Resampled", train_labels, train_predictions_resampled, color=colors[2])
plot_prc("Test Resampled", test_labels, test_predictions_resampled, color=colors[2], linestyle='--')
plt.legend(loc='lower right');

このチュートリアルを問題に応用する

不均衡データの分類は、学習できるサンプルが非常に少ないため、本質的に難しい作業です。常に最初にデータから始め、できるだけ多くのサンプルを収集するよう最善を尽くし、モデルがマイノリティクラスを最大限に活用するのに適切な特徴はどれかを十分に検討する必要があります。ある時点では、モデルがなかなか改善せず望む結果をうまく生成できないことがあるので、問題のコンテキストおよび異なるタイプのエラー間のトレードオフを考慮することが重要です。

Copyright 2019 The TensorFlow Authors.

不均衡データの分類

Setup

データの処理と調査

Kaggle Credit Card Fraud データセットをダウンロードする

クラスラベルの不均衡を調べる

データをクリーニング、分割、正規化する

データ分散を確認する

モデルとメトリクスを定義する

有用なメトリクスを理解する

ベースラインモデル

モデルを構築する

オプション: 正しい初期バイアスを設定する

初期の重みをチェックポイントする

バイアス修正が有効であることを確認する

モデルをトレーニングする

トレーニング履歴を確認する

メトリクスを評価する

ROC をプロットする

AUPRC をプロットする

クラスの重み

クラスの重みを計算する

クラスの重みでモデルをトレーニングする

トレーニング履歴を確認する

メトリクスを評価する

ROC をプロットする

AUPRC をプロットする

オーバーサンプリング

マイノリティクラスをオーバーサンプリングする

NumPy を使用する

tf.dataを使用する

オーバーサンプリングデータをトレーニングする

トレーニング履歴を確認する

再トレーニングする

トレーニング履歴を再確認する

メトリクスを評価する

ROC をプロットする

AUPRC をプロットする

このチュートリアルを問題に応用する

`tf.data`を使用する