GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/zh-cn/tutorials/load_data/text.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2018 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

加载文本

本教程演示了两种加载和预处理文本的方法。

首先，您将使用 Keras 效用函数和预处理层。这包括用于将数据转换为 tf.data.Dataset 的 tf.keras.utils.text_dataset_from_directory 和用于数据标准化、词例化和向量化的 tf.keras.layers.TextVectorization。如果您是 TensorFlow 新手，则应当从这些开始。
然后，您将使用 tf.data.TextLineDataset 等较低级别的效用函数来加载文本文件，并使用 TensorFlow Text API（如 text.UnicodeScriptTokenizer 和 text.case_fold_utf8）来预处理数据以实现粒度更细的控制。

In [ ]:

!pip install "tensorflow-text==2.11.*"

In [ ]:

import collections
import pathlib

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

示例 1：预测 Stack Overflow 问题的标签

作为第一个示例，您将从 Stack Overflow 下载一个编程问题的数据集。每个问题（“How do I sort a dictionary by value?”）都会添加一个标签（Python、CSharp、JavaScript 或 Java）。您的任务是开发一个模型来预测问题的标签。这是多类分类的一个示例，多类分类是一种重要且广泛适用的机器学习问题。

下载并探索数据集

首先，使用 tf.keras.utils.get_file 下载 Stack Overflow 数据集，然后探索目录结构：

In [ ]:

data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

dataset_dir = utils.get_file(
    origin=data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent

In [ ]:

list(dataset_dir.iterdir())

In [ ]:

train_dir = dataset_dir/'train'
list(train_dir.iterdir())

train/csharp、train/java、train/python 和 train/javascript 目录包含许多文本文件，每个文件都是一个 Stack Overflow 问题。

打印示例文件并检查数据：

In [ ]:

sample_file = train_dir/'python/1755.txt'

with open(sample_file) as f:
  print(f.read())

加载数据集

接下来，您将从磁盘加载数据并将其准备成适合训练的格式。为此，您将使用 tf.keras.utils.text_dataset_from_directory 效用函数来创建带标签的 tf.data.Dataset。如果您是 tf.data 新手，它是用于构建输入流水线的强大工具集合。（要了解更多信息，请参阅 tf.data：构建 TensorFlow 输入流水线指南。）

tf.keras.utils.text_dataset_from_directory API 需要如下目录结构：

train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt

运行机器学习实验时，最佳做法是将数据集拆成三份：训练、验证和测试。

Stack Overflow 数据集已经拆分为训练集和测试集，但缺少验证集。

通过使用 tf.keras.utils.text_dataset_from_directory 并将 validation_split 设置为 0.2（即 20%），使用训练数据的 80:20 拆分创建验证集：

In [ ]:

batch_size = 32
seed = 42

raw_train_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

正如前面的单元输出所示，训练文件夹中有 8,000 个样本，您将使用其中的 80%（即 6,400 个）进行训练。稍后您将学习到，可以通过将 tf.data.Dataset 直接传递给 Model.fit 来训练模型。

首先，遍历数据集并打印出一些样本来感受一下数据。

注：为了增加分类问题的难度，数据集作者将编程问题中出现的单词 Python、CSharp、JavaScript 或 Java 替换为 blank 一词。

In [ ]:

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Question: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

标签为 0、1、2 或 3。要查看其中哪些对应于哪个字符串标签，可以检查数据集上的 class_names 属性：

In [ ]:

for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

接下来，您将使用 tf.keras.utils.text_dataset_from_directory 创建验证集和测试集。您将使用训练集中剩余的 1,600 条评论进行验证。

注：使用 tf.keras.utils.text_dataset_from_directory 的 validation_split 和 subset 参数时，请确保要么指定随机种子，要么传递 shuffle=False，这样验证拆分和训练拆分就不会重叠。

In [ ]:

# Create a validation set.
raw_val_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

In [ ]:

test_dir = dataset_dir/'test'

# Create a test set.
raw_test_ds = utils.text_dataset_from_directory(
    test_dir,
    batch_size=batch_size)

准备用于训练的数据集

接下来，您将使用 tf.keras.layers.TextVectorization 层对数据进行标准化、词例化和向量化。

标准化是指预处理文本，通常是移除标点符号或 HTML 元素以简化数据集。
词例化是指将字符串拆分为词例（例如，通过按空格分割将一个句子拆分为各个单词）。
向量化是指将词例转换为编号，以便将它们输入到神经网络中。

所有这些任务都可以通过这一层来完成。（您可以在 tf.keras.layers.TextVectorization API 文档中了解有关这些内容的更多信息。）

请注意：

默认标准化会将文本转换为小写并移除标点符号 (standardize='lower_and_strip_punctuation')。
默认分词器会按空格分割 (split='whitespace')。
默认向量化模式为 'int' (output_mode='int')。这会输出整数索引（每个词例一个）。此模式可用于构建考虑词序的模型。您还可以使用其他模式（例如 'binary'）来构建词袋模型。

您将使用 TextVectorization 构建两个模型来详细了解标准化、词例化和向量化：

首先，您将使用 'binary' 向量化模式来构建词袋模型。
随后，您将使用具有 1D ConvNet 的 'int' 模式。

In [ ]:

VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')

对于 'int' 模式，除了最大词汇量之外，您还需要设置显式最大序列长度 (MAX_SEQUENCE_LENGTH)，这会导致层将序列精确地填充或截断为 output_sequence_length 值：

In [ ]:

MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

接下来，调用 TextVectorization.adapt 以使预处理层的状态适合数据集。这会使模型构建字符串到整数的索引。

注：在调用 TextVectorization.adapt 时请务必仅使用您的训练数据（使用测试集会泄漏信息）。

In [ ]:

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

打印使用这些层预处理数据的结果：

In [ ]:

def binary_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return binary_vectorize_layer(text), label

In [ ]:

def int_vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return int_vectorize_layer(text), label

In [ ]:

# Retrieve a batch (of 32 reviews and labels) from the dataset.
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print("Label", first_label)

In [ ]:

print("'binary' vectorized question:",
      binary_vectorize_text(first_question, first_label)[0])

In [ ]:

print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

如上所示，TextVectorization 的 'binary' 模式返回一个数组，表示哪些词例在输入中至少存在一次，而 'int' 模式将每个词例替换为一个整数，从而保留它们的顺序。

您可以通过在层上调用 TextVectorization.get_vocabulary 来查找每个整数对应的词例（字符串）：

In [ ]:

print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

差不多可以训练您的模型了。

作为最后的预处理步骤，将之前创建的 TextVectorization 层应用于训练集、验证集和测试集：

In [ ]:

binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

配置数据集以提高性能

以下是加载数据时应该使用的两种重要方法，以确保 I/O 不会阻塞。

从磁盘加载后，Dataset.cache 会将数据保存在内存中。这将确保数据集在训练模型时不会成为瓶颈。如果您的数据集太大而无法放入内存，也可以使用此方法创建高性能的磁盘缓存，这比许多小文件的读取效率更高。
Dataset.prefetch 会在训练时将数据预处理和模型执行重叠。

您可以在使用 tf.data API 提升性能指南的预提取部分中详细了解这两种方法，以及如何将数据缓存到磁盘。

In [ ]:

AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
  return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [ ]:

binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

训练模型。

是时候创建您的神经网络了。

对于 'binary' 向量化数据，定义一个简单的词袋线性模型，然后对其进行配置和训练：

In [ ]:

binary_model = tf.keras.Sequential([layers.Dense(4)])

binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10)

接下来，您将使用 'int' 向量化层来构建 1D ConvNet：

In [ ]:

def create_model(vocab_size, num_labels):
  model = tf.keras.Sequential([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])
  return model

In [ ]:

# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)

比较两个模型：

In [ ]:

print("Linear model on binary vectorized data:")
print(binary_model.summary())

In [ ]:

print("ConvNet model on int vectorized data:")
print(int_model.summary())

在测试数据上评估两个模型：

In [ ]:

binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

注：此示例数据集代表了一个相当简单的分类问题。更复杂的数据集和问题会在预处理策略和模型架构上带来微妙但显著的差异。务必尝试不同的超参数和周期来比较各种方法。

导出模型

在上面的代码中，您在向模型馈送文本之前对数据集应用了 tf.keras.layers.TextVectorization。如果您想让模型能够处理原始字符串（例如，为了简化部署），您可以在模型中包含 TextVectorization 层。

为此，您可以使用刚刚训练的权重创建一个新模型：

In [ ]:

export_model = tf.keras.Sequential(
    [binary_vectorize_layer, binary_model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(accuracy))

现在，您的模型可以将原始字符串作为输入，并使用 Model.predict 预测每个标签的得分。定义一个函数来查找得分最高的标签：

In [ ]:

def get_string_labels(predicted_scores_batch):
  predicted_int_labels = tf.math.argmax(predicted_scores_batch, axis=1)
  predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
  return predicted_labels

在新数据上运行推断

In [ ]:

inputs = [
    "how do I extract keys from a dict into a list?",  # 'python'
    "debug public static void main(string[] args) {...}",  # 'java'
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

将文本预处理逻辑包含在模型中后，您可以导出用于生产的模型，从而简化部署并降低训练/测试偏差的可能性。

在选择应用 tf.keras.layers.TextVectorization 层的位置时，需要注意性能差异。在模型之外使用它可以让您在 GPU 上训练时进行异步 CPU 处理和数据缓冲。因此，如果您在 GPU 上训练模型，您应该在开发模型时使用此选项以获得最佳性能，然后在准备好部署时进行切换，在模型中包含 TextVectorization 层。

请参阅保存和加载模型教程，详细了解如何保存模型。

例 2：预测《伊利亚特》翻译的作者

下面提供了一个使用 tf.data.TextLineDataset 从文本文件中加载样本，以及使用 TensorFlow Text 预处理数据的示例。您将使用同一作品（荷马的《伊利亚特》）的三种不同英语翻译，训练一个模型来识别给定单行文本的译者。

下载并探索数据集

三个译本的文本来自：

本教程中使用的文本文件经历了一些典型的预处理任务，例如移除文档页眉和页脚、行号和章节标题。

将这些稍微改动过的文件下载到本地：

In [ ]:

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

加载数据集

以前，使用 tf.keras.utils.text_dataset_from_directory 时，文件的所有内容都会被视为单个样本。在这里，您将使用 tf.data.TextLineDataset，它旨在从文本文件创建 tf.data.Dataset，其中每个样本都是原始文件中的一行文本。TextLineDataset 对于主要基于行的文本数据（例如，诗歌或错误日志）非常有用。

遍历这些文件，将每个文件加载到自己的数据集中。每个样本都需要单独加标签，因此请使用 Dataset.map 为每个样本应用标签添加器功能。这将遍历数据集中的每个样本，同时返回 (example, label) 对。

In [ ]:

def labeler(example, index):
  return example, tf.cast(index, tf.int64)

In [ ]:

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

接下来，您将使用 Dataset.concatenate 将这些带标签的数据集组合到一个数据集中，并使用 Dataset.shuffle 打乱其顺序：

In [ ]:

BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

In [ ]:

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

像以前一样打印出几个样本。数据集尚未经过批处理，因此 all_labeled_data 中的每个条目都对应一个数据点：

In [ ]:

for text, label in all_labeled_data.take(10):
  print("Sentence: ", text.numpy())
  print("Label:", label.numpy())

准备用于训练的数据集

现在，将不再使用 tf.keras.layers.TextVectorization 来预处理文本数据集，而是使用 TensorFlow Text API 对数据进行标准化和词例化、构建词汇表并使用 tf.lookup.StaticVocabularyTable 将词例映射到整数以馈送给模型。（详细了解 TensorFlow Text）。

定义一个将文本转换为小写并对其进行词例化的函数：

TensorFlow Text 提供各种分词器。在此示例中，您将使用 text.UnicodeScriptTokenizer 对数据集进行词例化。
您将使用 Dataset.map 将词例化应用于数据集。

In [ ]:

tokenizer = tf_text.UnicodeScriptTokenizer()

In [ ]:

def tokenize(text, unused_label):
  lower_case = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)

In [ ]:

tokenized_ds = all_labeled_data.map(tokenize)

您可以遍历数据集并打印出一些词例化的样本：

In [ ]:

for text_batch in tokenized_ds.take(5):
  print("Tokens: ", text_batch.numpy())

接下来，您将通过按频率对词例进行排序并保留顶部 VOCAB_SIZE 词例来构建词汇表：

In [ ]:

tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok] += 1

vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])

要将词例转换为整数，请使用 vocab 集创建 tf.lookup.StaticVocabularyTable。您将词例映射到 [2, vocab_size + 2] 范围内的整数。与 TextVectorization 层一样，保留 0 表示填充，保留 1 表示词汇表外 (OOV) 词例。

In [ ]:

keys = vocab
values = range(2, len(vocab) + 2)  # Reserve `0` for padding, `1` for OOV tokens.

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

最后，定义一个函数来使用分词器和查找表对数据集进行标准化、词例化和向量化：

In [ ]:

def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label

您可以在单个样本上尝试此操作并打印输出：

In [ ]:

example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())

现在，使用 Dataset.map 在数据集上运行预处理函数：

In [ ]:

all_encoded_data = all_labeled_data.map(preprocess_text)

将数据集拆分为训练集和测试集

Keras TextVectorization 层还会对向量化数据进行批处理和填充。填充是必需的，因为批次内的样本需要具有相同的大小和形状，但这些数据集中的样本并非全部相同 – 每行文本具有不同数量的单词。

tf.data.Dataset 支持拆分和填充批次数据集：

In [ ]:

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

In [ ]:

train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

现在，validation_data 和 train_data 不是 (example, label) 对的集合，而是批次的集合。每个批次都是一对表示为数组的（许多样本、许多标签）。

为了说明这一点：

In [ ]:

sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])

由于您将 0 用于填充，将 1 用于词汇外 (OOV) 词例，词汇量增加了两倍：

In [ ]:

vocab_size += 2

像以前一样配置数据集以提高性能：

In [ ]:

train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

训练模型

您可以像以前一样在此数据集上训练模型：

In [ ]:

model = create_model(vocab_size=vocab_size, num_labels=3)

model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])

history = model.fit(train_data, validation_data=validation_data, epochs=3)

In [ ]:

loss, accuracy = model.evaluate(validation_data)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

导出模型

为了使模型能够将原始字符串作为输入，您将创建一个 Keras TextVectorization 层，该层执行与您的自定义预处理函数相同的步骤。由于您已经训练了一个词汇表，可以使用 TextVectorization.set_vocabulary（而不是 TextVectorization.adapt）来训练一个新词汇表。

In [ ]:

preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

preprocess_layer.set_vocabulary(vocab)

In [ ]:

export_model = tf.keras.Sequential(
    [preprocess_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

In [ ]:

# Create a test dataset of raw strings.
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)

loss, accuracy = export_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

正如预期的那样，编码验证集上的模型和原始验证集上的导出模型的损失和准确率相同。

在新数据上运行推断

In [ ]:

inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]

predicted_scores = export_model.predict(inputs)
predicted_labels = tf.math.argmax(predicted_scores, axis=1)

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())

使用 TensorFlow Datasets (TFDS) 下载更多数据集

您可以从 TensorFlow Datasets 下载更多数据集。

在此示例中，您将使用 IMDB Large Movie Review Dataset 来训练情感分类模型：

In [ ]:

# Training set.
train_ds = tfds.load(
    'imdb_reviews',
    split='train[:80%]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

In [ ]:

# Validation set.
val_ds = tfds.load(
    'imdb_reviews',
    split='train[80%:]',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True)

打印几个样本：

In [ ]:

for review_batch, label_batch in val_ds.take(1):
  for i in range(5):
    print("Review: ", review_batch[i].numpy())
    print("Label: ", label_batch[i].numpy())

您现在可以像以前一样预处理数据并训练模型。

注：您将对模型使用 tf.keras.losses.BinaryCrossentropy 而不是 tf.keras.losses.SparseCategoricalCrossentropy，因为这是一个二元分类问题。

准备用于训练的数据集

In [ ]:

vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)

In [ ]:

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

In [ ]:

train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)

In [ ]:

# Configure datasets for performance as before.
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

创建、配置和训练模型

In [ ]:

model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()

In [ ]:

model.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

In [ ]:

history = model.fit(train_ds, validation_data=val_ds, epochs=3)

In [ ]:

loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

导出模型

In [ ]:

export_model = tf.keras.Sequential(
    [vectorize_layer, model,
     layers.Activation('sigmoid')])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

In [ ]:

# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]

predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]

for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label)

结论

本教程演示了几种加载和预处理文本的方法。接下来，您可以探索其他文本预处理 TensorFlow Text 教程，例如：

此外，您还可以在 TensorFlow Datasets 上找到新的数据集。而且，要详细了解 tf.data，请查看有关构建输入流水线的指南。

Copyright 2018 The TensorFlow Authors.

加载文本

示例 1：预测 Stack Overflow 问题的标签

下载并探索数据集

加载数据集

准备用于训练的数据集

配置数据集以提高性能

训练模型。

导出模型

在新数据上运行推断

例 2：预测《伊利亚特》翻译的作者

下载并探索数据集

加载数据集

准备用于训练的数据集

将数据集拆分为训练集和测试集

训练模型

导出模型

在新数据上运行推断

使用 TensorFlow Datasets (TFDS) 下载更多数据集

准备用于训练的数据集

创建、配置和训练模型

导出模型

结论

Product

Resources

Company