GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ja/tutorials/load_data/text.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2018 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

テキストを読み込む

このチュートリアルでは、テキストを読み込んで前処理する 2 つの方法を紹介します。

まず、Keras ユーティリティと前処理レイヤーを使用します。これには、データを tf.data.Dataset に変換するための tf.keras.utils.text_dataset_from_directory とデータを標準化、トークン化、およびベクトル化するための tf.keras.layers.TextVectorization が含まれます。TensorFlow を初めて使用する場合は、これらから始める必要があります。
次に、tf.data.TextLineDataset などの低レベルのユーティリティを使用してテキストファイルを読み込み、text.UnicodeScriptTokenizer や text.case_fold_utf8 などの TensorFlow Text APIを使用して、よりきめ細かい制御のためにデータを前処理します。

In [ ]:

!pip install "tensorflow-text==2.11.*"

In [ ]:

import collections
import pathlib

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

例 1: StackOverflow の質問のタグを予測する

最初の例として、StackOverflow からプログラミングの質問のデータセットをダウンロードします。それぞれの質問 (「ディクショナリを値で並べ替えるにはどうすればよいですか？」) は、1 つのタグ (Python、CSharp、JavaScript、またはJava) でラベルされています。このタスクでは、質問のタグを予測するモデルを開発します。これは、マルチクラス分類の例です。マルチクラス分類は、重要で広く適用できる機械学習の問題です。

データセットをダウンロードして調査する

まず、tf.keras.utils.get_file を使用して Stack Overflow データセットをダウンロードし、ディレクトリの構造を調べます。

In [ ]:

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

dataset_dir = utils.get_file(
    origin=data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent

In [ ]:

list(dataset_dir.iterdir())

In [ ]:

train_dir = dataset_dir/'train'
list(train_dir.iterdir())

train/csharp、train/java、train/python および train/javascript ディレクトリには、多くのテキストファイルが含まれています。それぞれが Stack Overflow の質問です。

サンプルファイルを出力してデータを調べます。

In [ ]:

sample_file = train_dir/'python/1755.txt'

with open(sample_file) as f:
  print(f.read())

データセットを読み込む

次に、データをディスクから読み込み、トレーニングに適した形式に準備します。これを行うには、tf.keras.utils.text_dataset_from_directory ユーティリティを使用して、ラベル付きの tf.data.Dataset を作成します。これは、入力パイプラインを構築するための強力なツールのコレクションです。tf.data を始めて使用する場合は、tf.data: TensorFlow 入力パイプラインを構築するを参照してください。

tf.keras.utils.text_dataset_from_directory API は、次のようなディレクトリ構造を想定しています。

train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt

機械学習実験を実行するときは、データセットをトレーニング、検証、および、テストの 3 つに分割することをお勧めします。

Stack Overflow データセットは、すでにトレーニングセットとテストセットに分割されていますが、検証セットはありません。

tf.keras.utils.text_dataset_from_directory を使用し、validation_split を 0.2 (20%) に設定し、トレーニングデータを 80:20 に分割して検証セットを作成します。

In [ ]:

batch_size = 32
seed = 42

raw_train_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

前のセル出力が示すように、トレーニングフォルダには 8,000 の例があり、そのうち 80％ (6,400) をトレーニングに使用します。tf.data.Dataset を Model.fit に直接渡すことで、モデルをトレーニングできます。詳細は、後ほど見ていきます。

まず、データセットを反復処理し、いくつかの例を出力して、データを確認します。

注意: 分類問題の難易度を上げるために、データセットの作成者は、プログラミングの質問で、Python、CSharp、JavaScript、Java という単語を blank に置き換えました。

In [ ]:

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Question: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

ラベルは、0、1、2 または 3 です。これらのどれがどの文字列ラベルに対応するかを確認するには、データセットの class_names プロパティを確認します。

In [ ]:

for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

次に、tf.keras.utils.text_dataset_from_directory を使って検証およびテスト用データセットを作成します。トレーニング用セットの残りの 1,600 件のレビューを検証に使用します。

注意: tf.keras.utils.text_dataset_from_directory の validation_split および subset 引数を使用する場合は、必ずランダムシードを指定するか、shuffle=Falseを渡して、検証とトレーニング分割に重複がないようにします。

In [ ]:

# Create a validation set.
raw_val_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

In [ ]:

test_dir = dataset_dir/'test'

# Create a test set.
raw_test_ds = utils.text_dataset_from_directory(
    test_dir,
    batch_size=batch_size)

トレーニング用データセットを準備する

次に、tf.keras.layers.TextVectorization レイヤーを使用して、データを標準化、トークン化、およびベクトル化します。

標準化とは、テキストを前処理することを指します。通常、句読点や HTML 要素を削除して、データセットを簡素化します。
トークン化とは、文字列をトークンに分割することです（たとえば、空白で分割することにより、文を個々の単語に分割します）。
ベクトル化とは、トークンを数値に変換して、ニューラルネットワークに入力できるようにすることです。

これらのタスクはすべて、このレイヤーで実行できます。これらの詳細については、tf.keras.layers.TextVectorization API ドキュメントを参照してください。

注意点 :

デフォルトの標準化では、テキストが小文字に変換され、句読点が削除されます (standardize='lower_and_strip_punctuation')。
デフォルトのトークナイザーは空白で分割されます (split='whitespace')。
デフォルトのベクトル化モードは int です (output_mode='int')。これは整数インデックスを出力します（トークンごとに1つ）。このモードは、語順を考慮したモデルを構築するために使用できます。binary などの他のモードを使用して、bag-of-word モデルを構築することもできます。

TextVectorization を使用した標準化、トークン化、およびベクトル化について詳しくみるために、2 つのモデルを作成します。

まず、'binary' ベクトル化モードを使用して、bag-of-words モデルを構築します。
次に、1D ConvNet で 'int' モードを使用します。

In [ ]:

VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

'int' モードの場合、最大語彙サイズに加えて、明示的な最大シーケンス長 (MAX_SEQUENCE_LENGTH) を設定する必要があります。これにより、レイヤーはシーケンスを正確に output_sequence_length 値にパディングまたは切り捨てます。

In [ ]:

MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

次に、TextVectorization.adapt を呼び出して、前処理レイヤーの状態をデータセットに適合させます。これにより、モデルは文字列から整数へのインデックスを作成します。

注意: TextVectorization.adapt を呼び出すときは、トレーニング用データのみを使用することが重要です (テスト用セットを使用すると情報が漏洩します)。

In [ ]:

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

これらのレイヤーを使用してデータを前処理した結果を出力します。

In [ ]:

def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

In [ ]:

def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

In [ ]:

# Retrieve a batch (of 32 reviews and labels) from the dataset.
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)

In [ ]:

print("'binary' vectorized question:",
      binary_vectorize_text(first_question, first_label)[0])

In [ ]:

print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

上に示したように、TextVectorization の 'binary' モードは、入力に少なくとも 1 回存在するトークンを示す配列を返しますが、'int' モードでは、各トークンが整数に置き換えられるため、トークンの順序が保持されます。

レイヤーで TextVectorization.get_vocabulary を呼び出すことにより、各整数が対応するトークン (文字列) を検索できます。

In [ ]:

print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

モデルをトレーニングする準備がほぼ整いました。

最後の前処理ステップとして、トレーニング、検証、およびデータセットのテストのために前に作成した TextVectorization レイヤーを適用します。

In [ ]:

binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

パフォーマンスのためにデータセットを構成する

以下は、データを読み込むときに I/O がブロックされないようにするために使用する必要がある 2 つの重要な方法です。

Dataset.cache はデータをディスクから読み込んだ後、データをメモリに保持します。これにより、モデルのトレーニング中にデータセットがボトルネックになることを回避できます。データセットが大きすぎてメモリに収まらない場合は、この方法を使用して、パフォーマンスの高いオンディスクキャッシュを作成することもできます。これは、多くの小さなファイルを読み込むより効率的です。
Dataset.prefetch はトレーニング中にデータの前処理とモデルの実行をオーバーラップさせます。

以上の 2 つの方法とデータをディスクにキャッシュする方法についての詳細は、データパフォーマンスガイドの プリフェッチを参照してください。

In [ ]:

AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [ ]:

binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

モデルをトレーニングする

ニューラルネットワークを作成します。

'binary' のベクトル化されたデータの場合、単純な bag-of-words 線形モデルを定義し、それを構成してトレーニングします。

In [ ]:

binary_model = tf.keras.Sequential([layers.Dense(4)])

binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)

次に、'int' ベクトル化レイヤーを使用して、1D ConvNet を構築します。

In [ ]:

def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])
  return model

In [ ]:

# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)

2 つのモデルを比較します。

In [ ]:

print("Linear model on binary vectorized data:")
print(binary_model.summary())

In [ ]:

print("ConvNet model on int vectorized data:")
print(int_model.summary())

テストデータで両方のモデルを評価します。

In [ ]:

binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

注意: このサンプルデータセットは、かなり単純な分類問題を表しています。より複雑なデータセットと問題は、前処理戦略とモデルアーキテクチャに微妙ながら重要な違いをもたらします。さまざまなアプローチを比較するために、さまざまなハイパーパラメータとエポックを試してみてください。

モデルをエクスポートする

上記のコードでは、モデルにテキストをフィードする前に、tf.keras.layers.TextVectorization レイヤーをデータセットに適用しました。モデルで生の文字列を処理できるようにする場合 (たとえば、展開を簡素化するため)、モデル内に TextVectorization レイヤーを含めることができます。

これを行うには、トレーニングしたばかりの重みを使用して新しいモデルを作成できます。

In [ ]:

export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(accuracy))

これで、モデルは生の文字列を入力として受け取り、Model.predict を使用して各ラベルのスコアを予測できます。最大スコアのラベルを見つける関数を定義します。

In [ ]:

def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.math.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

新しいデータで推論を実行する

In [ ]:

inputs = [
    "how do I extract keys from a dict into a list?",  # 'python'
    "debug public static void main(string[] args) {...}",  # 'java'
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

モデル内にテキスト前処理ロジックを含めると、モデルを本番環境にエクスポートして展開を簡素化し、トレーニング/テストスキューの可能性を減らすことができます。

tf.keras.layers.TextVectorization を適用する場所を選択する際に性能の違いに留意する必要があります。モデルの外部で使用すると、GPU でトレーニングするときに非同期 CPU 処理とデータのバッファリングを行うことができます。したがって、GPU でモデルをトレーニングしている場合は、モデルの開発中に最高のパフォーマンスを得るためにこのオプションを使用し、デプロイの準備ができたらモデル内に TextVectorization レイヤーを含めるように切り替えることをお勧めします。

モデルの保存の詳細については、モデルの保存と読み込みチュートリアルをご覧ください。

例 2: イーリアスの翻訳者を予測する

以下に、tf.data.TextLineDataset を使用してテキストファイルから例を読み込み、TensorFlow Text を使用してデータを前処理する例を示します。この例では、ホーマーのイーリアスの 3 つの異なる英語翻訳を使用し、与えられた 1 行のテキストから翻訳者を識別するようにモデルをトレーニングします。

データセットをダウンロードして調査する

3 つのテキストの翻訳者は次のとおりです。

このチュートリアルで使われているテキストファイルは、ヘッダ、フッタ、行番号、章のタイトルの削除など、いくつかの典型的な前処理が行われています。

前処理後のファイルをローカルにダウンロードします。

In [ ]:

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

データセットを読み込む

以前は、tf.keras.utils.text_dataset_from_directory では、ファイルのすべてのコンテンツが 1 つの例として扱われていました。ここでは、tf.data.TextLineDataset を使用します。これは、テキストファイルから tf.data.Dataset を作成するように設計されています。それぞれの例は、元のファイルからの行です。TextLineDataset は、主に行ベースのテキストデータ (詩やエラーログなど) に役立ちます。

これらのファイルを繰り返し処理し、各ファイルを独自のデータセットに読み込みます。各例には個別にラベルを付ける必要があるため、Dataset.map を使用して、それぞれにラベラー関数を適用します。これにより、データセット内のすべての例が繰り返され、 (example, label) ペアが返されます。

In [ ]:

def labeler(example, index):
  return example, tf.cast(index, tf.int64)

In [ ]:

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

次に、Dataset.concatenate を使用し、これらのラベル付きデータセットを 1 つのデータセットに結合し、Dataset.shuffle を使用してシャッフルします。

In [ ]:

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

In [ ]:

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

前述の手順でいくつかの例を出力します。データセットはまだバッチ処理されていないため、all_labeled_data の各エントリは 1 つのデータポイントに対応します。

In [ ]:

for text, label in all_labeled_data.take(10):
  print("Sentence: ", text.numpy())
  print("Label:", label.numpy())

トレーニング用データセットを準備する

tf.keras.layers.TextVectorization を使用してテキストデータセットを前処理する代わりに、TensorFlow Text API を使用してデータを標準化およびトークン化し、語彙を作成し、tf.lookup.StaticVocabularyTable を使用してトークンを整数にマッピングし、モデルにフィードします。(詳細については TensorFlow Text を参照してください)。

テキストを小文字に変換してトークン化する関数を定義します。

TensorFlow Text は、さまざまなトークナイザーを提供します。この例では、text.UnicodeScriptTokenizer を使用してデータセットをトークン化します。
Dataset.map を使用して、トークン化をデータセットに適用します。

In [ ]:

tokenizer = tf_text.UnicodeScriptTokenizer()

In [ ]:

def tokenize(text, unused_label):
  lower_case = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)

In [ ]:

tokenized_ds = all_labeled_data.map(tokenize)

データセットを反復処理して、トークン化されたいくつかの例を出力します。

In [ ]:

for text_batch in tokenized_ds.take(5):
  print("Tokens: ", text_batch.numpy())

次に、トークンを頻度で並べ替え、上位の VOCAB_SIZE トークンを保持することにより、語彙を構築します。

In [ ]:

tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok] += 1

vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])

トークンを整数に変換するには、vocab セットを使用して、tf.lookup.StaticVocabularyTable を作成します。トークンを [2, vocab_size + 2] の範囲の整数にマップします。TextVectorization レイヤーと同様に、0 はパディングを示すために予約されており、1 は語彙外 (OOV) トークンを示すために予約されています。

In [ ]:

keys = vocab
values = range(2, len(vocab) + 2)  # Reserve `0` for padding, `1` for OOV tokens.

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

最後に、トークナイザーとルックアップテーブルを使用して、データセットを標準化、トークン化、およびベクトル化する関数を定義します。

In [ ]:

def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label

1 つの例でこれを試して、出力を確認します。

In [ ]:

example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())

次に、Dataset.map を使用して、データセットに対して前処理関数を実行します。

In [ ]:

all_encoded_data = all_labeled_data.map(preprocess_text)

データセットをトレーニング用セットとテスト用セットに分割する

Keras TextVectorization レイヤーでも、ベクトル化されたデータをバッチ処理してパディングします。バッチ内の例は同じサイズと形状である必要があるため、パディングが必要です。これらのデータセットの例はすべて同じサイズではありません。テキストの各行には、異なる数の単語があります。

tf.data.Dataset は、データセットの分割とパディングのバッチ処理をサポートしています

In [ ]:

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

In [ ]:

train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

validation_data および train_data は (example, label) ペアのコレクションではなく、バッチのコレクションです。各バッチは、配列として表される (多くの例、多くのラベル) のペアです。

以下に示します。

In [ ]:

sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])

パディングに 0 を使用し、語彙外 (OOV) トークンに 1 を使用するため、語彙のサイズが 2 つ増えました。

In [ ]:

vocab_size += 2

以前と同じように、パフォーマンスを向上させるためにデータセットを構成します。

In [ ]:

train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

モデルをトレーニングする

以前と同じように、このデータセットでモデルをトレーニングできます。

In [ ]:

model = create_model(vocab_size=vocab_size, num_labels=3)

model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

history = model.fit(train_data, validation_data=validation_data, epochs=3)

In [ ]:

loss, accuracy = model.evaluate(validation_data)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

モデルをエクスポートする

モデルが生の文字列を入力として受け取ることができるようにするには、カスタム前処理関数と同じ手順を実行する TextVectorization レイヤーを作成します。すでに語彙をトレーニングしているので、新しい語彙をトレーニングする TextVectorization.adapt の代わりに、TextVectorization.set_vocabulary を使用できます。

In [ ]:

preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

preprocess_layer.set_vocabulary(vocab)

In [ ]:

export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

In [ ]:

# Create a test dataset of raw strings.
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)

loss, accuracy = export_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

エンコードされた検証セットのモデルと生の検証セットのエクスポートされたモデルの損失と正確度は、予想どおり同じです。

新しいデータで推論を実行する

In [ ]:

inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]

predicted_scores = export_model.predict(inputs)
predicted_labels = tf.math.argmax(predicted_scores, axis=1)

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

TensorFlow Datasets (TFDS) を使用して、より多くのデータセットをダウンロードする

TensorFlow Dataset からより多くのデータセットをダウンロードできます。

この例では、IMDB 大規模映画レビューデータセットを使用して、感情分類のモデルをトレーニングします。

In [ ]:

# Training set.
train_ds = tfds.load(
    'imdb_reviews',
    split='train[:80%]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

In [ ]:

# Validation set.
val_ds = tfds.load(
    'imdb_reviews',
    split='train[80%:]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

いくつかの例を出力します。

In [ ]:

for review_batch, label_batch in val_ds.take(1):
  for i in range(5):
    print("Review: ", review_batch[i].numpy())
    print("Label: ", label_batch[i].numpy())

これで、以前と同じようにデータを前処理してモデルをトレーニングできます。

注意: これは二項分類の問題であるため、モデルには tf.keras.losses.SparseCategoricalCrossentropy の代わりに tf.keras.losses.BinaryCrossentropy を使用します。

トレーニング用データセットを準備する

In [ ]:

vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)

In [ ]:

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [ ]:

train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)

In [ ]:

# Configure datasets for performance as before.
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

モデルを作成、構成、およびトレーニングする

In [ ]:

model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()

In [ ]:

model.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

In [ ]:

history = model.fit(train_ds, validation_data=val_ds, epochs=3)

In [ ]:

loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

モデルをエクスポートする

In [ ]:

export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

In [ ]:

# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]

predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label)

まとめ

このチュートリアルでは、テキストを読み込んで前処理するいくつかの方法を示しました。次のステップとして、以下のような TensorFlow Text テキスト前処理チュートリアルをご覧ください。

TensorFlow Datasets でも新しいデータセットを見つけることができます。また、tf.data の詳細については、入力パイプラインの構築に関するガイドをご覧ください。

Copyright 2018 The TensorFlow Authors.

テキストを読み込む

例 1: StackOverflow の質問のタグを予測する

データセットをダウンロードして調査する

データセットを読み込む

トレーニング用データセットを準備する

パフォーマンスのためにデータセットを構成する

モデルをトレーニングする

モデルをエクスポートする

新しいデータで推論を実行する

例 2: イーリアスの翻訳者を予測する

データセットをダウンロードして調査する

データセットを読み込む

トレーニング用データセットを準備する

データセットをトレーニング用セットとテスト用セットに分割する

モデルをトレーニングする

モデルをエクスポートする

新しいデータで推論を実行する

TensorFlow Datasets (TFDS) を使用して、より多くのデータセットをダウンロードする

トレーニング用データセットを準備する

モデルを作成、構成、およびトレーニングする

モデルをエクスポートする

まとめ

Product

Resources

Company