GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ja/hub/tutorials/bangla_article_classifier.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2019 The TensorFlow Hub Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [ ]:

# Copyright 2019 The TensorFlow Hub Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, 
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

TF-Hub によるベンガル語の記事分類

注意: このノートブックでは、pip を用いた Python パッケージのインストールに加え、sudo apt installを使用してシステムパッケージをインストールします。これにはunzipを使います。

この Colab は、非英語/現地語のテキスト分類に Tensorflow Hub を使用したデモンストレーションです。ここではローカル言語としてベンガル語を選択し、事前トレーニングされた単語埋め込みを使用してベンガル語のニュース記事を 5 つのカテゴリに分類する、マルチクラス分類タスクを解決します。ベンガル語の事前トレーニング済みの単語埋め込みは fastText を使用します。これは Facebook のライブラリで、157 言語の事前トレーニング済みの単語ベクトルが公開されています。

ここでは TF-Hub (TensorFlow Hub) の事前トレーニング済みの埋め込みエクスポート機能を使用して、まず単語埋め込みをテキスト埋め込みモジュールに変換した後、そのモジュールを使用して Tensorflow の使いやすい高レベル API である tf.keras で分類器のトレーニングを行い、ディープラーニングモデルを構築します。ここでは fastText Embedding を使用していますが、他のタスクで事前トレーニングした別の埋め込みをエクスポートし、TensorFlow Hub で素早く結果を得ることも可能です。

セットアップ

In [ ]:

%%bash
# https://github.com/pypa/setuptools/issues/1694#issuecomment-466010982
pip install gdown --no-use-pep517

In [ ]:

%%bash
sudo apt-get install -y unzip

In [ ]:

import os

import tensorflow as tf
import tensorflow_hub as hub

import gdown
import numpy as np
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import seaborn as sns

データセット

ここで使用するのは BARD（ベンガル語記事データセット）です。これは、様々なベンガル語のニュースポータルから収集した約 376,226 件の記事が、経済、州、国際、スポーツ、エンターテイメントの 5 つのカテゴリに分類されています。ファイルは Google Drive からダウンロードしますが、bit.ly/BARD_DATASET のリンクはこの GitHub リポジトリから参照しています。

In [ ]:

gdown.download(
    url='https://drive.google.com/uc?id=1Ag0jd21oRwJhVFIBohmX_ogeojVtapLy',
    output='bard.zip',
    quiet=True
)

In [ ]:

%%bash
unzip -qo bard.zip

事前トレーニング済み単語ベクトルを TF-Hub モジュールにエクスポートする

TF-Hub には、単語埋め込みを TF-Hubのテキスト埋め込みモジュールに変換する、この便利なスクリプトがあります。export_v2.py と同じディレクトリに単語埋め込み用の .txt または .vec ファイルをダウンロードしてスクリプトを実行するだけで、ベンガル語やその他の言語用のモジュールを作成することができます。

エクスポーターは埋め込みベクトルを読み込んで、Tensorflow の SavedModel にエクスポートします。SavedModel には重みとグラフを含む完全な TensorFlow プログラムが含まれています。TF-Hub は SavedModel をモジュールとして読み込むことができます。モデルを構築には tf.keras を使用するため、ハブモジュールにラッパーを提供する hub.KerasLayer を用いて Keras のレイヤーとして使用します。

まず、fastText から単語埋め込みを、TF-Hub のレポジトリから埋め込みエクスポーターを取得します。

In [ ]:

%%bash
curl -O https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.bn.300.vec.gz
curl -O https://raw.githubusercontent.com/tensorflow/hub/master/examples/text_embeddings_v2/export_v2.py
gunzip -qf cc.bn.300.vec.gz --k

次に、エクスポートスクリプトを埋め込みファイル上で実行します。fastText Embedding にはヘッダ行があり、かなり大きい（ベンガル語でモジュール変換後 3.3GB 程度）ため、ヘッダ行を無視して最初の 100,000 トークンのみをテキスト埋め込みモジュールにエクスポートします。

In [ ]:

%%bash
python export_v2.py --embedding_file=cc.bn.300.vec --export_path=text_module --num_lines_to_ignore=1 --num_lines_to_use=100000

In [ ]:

module_path = "text_module"
embedding_layer = hub.KerasLayer(module_path, trainable=False)

テキスト埋め込みモジュールは、文字列の 1 次元テンソル内の文のバッチを入力として受け取り、文に対応する形状の埋め込みベクトル (batch_size, embedding_dim) を出力します。これは入力をスペースで分割して、前処理を行います。単語埋め込みは sqrtn コンバイナ（こちらを参照）を使用して文の埋め込みに結合されます。これの実演として、ベンガル語の単語リストを入力として渡し、対応する埋め込みベクトルを取得します。

In [ ]:

embedding_layer(['বাস', 'বসবাস', 'ট্রেন', 'যাত্রী', 'ট্রাক'])

Tensorflow Dataset を変換する

データセットが非常に大きいため、データセット全体をメモリに読み込むのではなく、Tensorflow Dataset の関数を利用してジェネレータを使用し、実行時にサンプルをバッチで生成します。また、データセットは非常にバランスが悪いので、ジェネレータを使用する前にデータセットをシャッフルします。

In [ ]:

dir_names = ['economy', 'sports', 'entertainment', 'state', 'international']

file_paths = []
labels = []
for i, dir in enumerate(dir_names):
  file_names = ["/".join([dir, name]) for name in os.listdir(dir)]
  file_paths += file_names
  labels += [i] * len(os.listdir(dir))
  
np.random.seed(42)
permutation = np.random.permutation(len(file_paths))

file_paths = np.array(file_paths)[permutation]
labels = np.array(labels)[permutation]

シャッフル後には、トレーニング例と検証例のラベルの分布を確認することができます。

In [ ]:

train_frac = 0.8
train_size = int(len(file_paths) * train_frac)

In [ ]:

# plot training vs validation distribution
plt.subplot(1, 2, 1)
plt.hist(labels[0:train_size])
plt.title("Train labels")
plt.subplot(1, 2, 2)
plt.hist(labels[train_size:])
plt.title("Validation labels")
plt.tight_layout()

ジェネレータを使用して Datasete を作成するには、まず file_paths から各項目を、ラベル配列からラベルを読み込むジェネレータ関数を書き込み、各ステップ毎にそれぞれ 1 つのトレーニング例を生成します。このジェネレータ関数を tf.data.Dataset.from_generator メソッドに渡して出力タイプを指定します。各トレーニング例は、tf.string データ型の項目と One-Hot エンコーディングされたラベルを含むタプルです。tf.data.Dataset.skip メソッドとtf.data.Dataset.take メソッドを使用して、データセットは 80 対 20 の割合でトレーニングデータと検証データに分割しています。

In [ ]:

def load_file(path, label):
    return tf.io.read_file(path), label

In [ ]:

def make_datasets(train_size):
  batch_size = 256

  train_files = file_paths[:train_size]
  train_labels = labels[:train_size]
  train_ds = tf.data.Dataset.from_tensor_slices((train_files, train_labels))
  train_ds = train_ds.map(load_file).shuffle(5000)
  train_ds = train_ds.batch(batch_size).prefetch(tf.data.AUTOTUNE)

  test_files = file_paths[train_size:]
  test_labels = labels[train_size:]
  test_ds = tf.data.Dataset.from_tensor_slices((test_files, test_labels))
  test_ds = test_ds.map(load_file)
  test_ds = test_ds.batch(batch_size).prefetch(tf.data.AUTOTUNE)


  return train_ds, test_ds

In [ ]:

train_data, validation_data = make_datasets(train_size)

モデルのトレーニングと評価

既にモジュールの周りにラッパーを追加し、Keras の他のレイヤーと同じように使用できるようになったので、レイヤーの線形スタックである小さな Sequential モデルを作成します。他のレイヤーと同様に model.add を使用して、テキスト埋め込みモジュールの追加が可能です。損失とオプティマイザを指定してモデルをコンパイルし、10 エポック分をトレーニングします。tf.keras API はテンソルフローのデータセットを入力として扱うことができるので、fit メソッドに Dataset インスタンスを渡してモデルをトレーニングすることができます。ジェネレータ関数を使用するので、tf.data がサンプルの生成、バッチ処理、モデルへの供給を行います。

モデル

In [ ]:

def create_model():
  model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[], dtype=tf.string),
    embedding_layer,
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(5),
  ])
  model.compile(loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
      optimizer="adam", metrics=['accuracy'])
  return model

In [ ]:

model = create_model()
# Create earlystopping callback
early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=3)

トレーニングする

In [ ]:

history = model.fit(train_data, 
                    validation_data=validation_data, 
                    epochs=5, 
                    callbacks=[early_stopping_callback])

評価

tf.keras.Model.fit メソッドが返す各エポックの損失と精度の値を含む tf.keras.callbacks.History オブジェクトを使用して、学習データと検証データの精度と損失曲線を可視化することができます。

In [ ]:

# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

# Plot training & validation loss values
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

予測

検証データの予測値を取得して混同行列をチェックすることにより、5 つの各クラスのモデルの性能を確認することができます。tf.keras.Model.predict メソッドが各クラスの確率として N-D 配列を返すので、np.argmax を使用してそれらをクラスラベルに変換することができます。

In [ ]:

y_pred = model.predict(validation_data)

In [ ]:

y_pred = np.argmax(y_pred, axis=1)

In [ ]:

samples = file_paths[0:3]
for i, sample in enumerate(samples):
  f = open(sample)
  text = f.read()
  print(text[0:100])
  print("True Class: ", sample.split("/")[0])
  print("Predicted Class: ", dir_names[y_pred[i]])
  f.close()

パフォーマンスを比較する

これで labelsから検証データの正しいラベルを得ることができるようになったので、それを予測と比較して classification_report を取得します。

In [ ]:

y_true = np.array(labels[train_size:])

In [ ]:

print(classification_report(y_true, y_pred, target_names=dir_names))

また、発表されたオリジナルの論文で結果として報告されている精度 0.96 とモデルの性能を比較することもできます。オリジナルの論文の著者は、句読点や数字を削除したり、最も頻繁に使われるストップワードの上位 25 個を削除したり、データセットに対して多くの前処理を行ったと説明しています。classification_report を見ると分かりますが、ここでは前処理を行わずに 5 エポック分のトレーニングを行っただけでも、0.96 の精度と正解率が得られています！

この例では、埋め込みモジュールから Keras レイヤーを作成する際に trainable=False を設定しました。つまり、トレーニング中に埋め込み重みを更新しないことを意味します。これを True 設定にしてこのデータセットでトレーニングを行ってみると、わずか 2 エポックで 97% の精度を達成します。