GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/zh-cn/tutorials/estimator/premade.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2019 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

预创建的 Estimators

在 tensorFlow.google.cn 上查看

在 Google Colab 中运行

在 GitHub 上查看源代码

下载笔记本

警告：不建议将 Estimator 用于新代码。Estimator 运行 v1.Session 风格的代码，此类代码更加难以正确编写，并且可能会出现意外行为，尤其是与 TF 2 代码结合使用时。Estimator 确实在我们的兼容性保证范围内，但除了安全漏洞之外不会得到任何修复。请参阅迁移指南以了解详情。

本教程向您展示了如何使用 Estimator 在 TensorFlow 中解决鸢尾花分类问题。Estimator 是完整模型在旧版 TensorFlow 中的高级表示。有关更多详细信息，请参阅 Estimator。

注：在 TensorFlow 2.0 中，Keras API 可以完成这些相同的任务，并且被认为是一个更容易学习的 API。如果您刚入门，建议您从 Keras 开始。

首先要做的事

为了开始，您将首先导入 Tensorflow 和一系列您需要的库。

In [ ]:

import tensorflow as tf

import pandas as pd

数据集

本文档中的示例程序构建并测试了一个模型，该模型根据花萼和花瓣的大小将鸢尾花分成三种物种。

您将使用鸢尾花数据集训练模型。该数据集包括四个特征和一个标签。这四个特征确定了单个鸢尾花的以下植物学特征：

花萼长度
花萼宽度
花瓣长度
花瓣宽度

根据这些信息，您可以定义一些有用的常量来解析数据：

In [ ]:

CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']

接下来，使用 Keras 与 Pandas 下载并解析鸢尾花数据集。注意为训练和测试保留不同的数据集。

In [ ]:

train_path = tf.keras.utils.get_file(
    "iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_path = tf.keras.utils.get_file(
    "iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")

train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0)
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)

通过检查数据您可以发现有四列浮点型特征和一列 int32 型标签。

In [ ]:

train.head()

对于每个数据集都分割出标签，模型将被训练来预测这些标签。

In [ ]:

train_y = train.pop('Species')
test_y = test.pop('Species')

# The label column has now been removed from the features.
train.head()

Estimator 编程概述

现在您已经设置了数据，可以使用 TensorFlow Estimator 定义模型。 Estimator 是从 tf.estimator.Estimator 派生的任何类。TensorFlow 提供了一组 tf.estimator（例如 LinearRegressor）来实现常见的 ML 算法。除此之外，您可以编写自己的自定义 Estimator。建议在刚开始时使用预制的 Estimator。

为了编写基于预创建的 Estimator 的 Tensorflow 项目，您必须完成以下工作：

创建一个或多个输入函数
定义模型的特征列
实例化一个 Estimator，指定特征列和各种超参数。
在 Estimator 对象上调用一个或多个方法，传递合适的输入函数以作为数据源。

我们来看看这些任务是如何在鸢尾花分类中实现的。

创建输入函数

您必须创建输入函数来提供用于训练、评估和预测的数据。

输入函数是一个返回 tf.data.Dataset 对象的函数，此对象会输出下列含两个元素的元组：

features——Python字典，其中：
- 每个键都是特征名称
- 每个值都是包含此特征所有值的数组
label 包含每个样本的标签的值的数组。

为了向您展示输入函数的格式，请查看下面这个简单的实现：

In [ ]:

def input_evaluation_set():
    features = {'SepalLength': np.array([6.4, 5.0]),
                'SepalWidth':  np.array([2.8, 2.3]),
                'PetalLength': np.array([5.6, 3.3]),
                'PetalWidth':  np.array([2.2, 1.0])}
    labels = np.array([2, 1])
    return features, labels

您的输入函数可以用您喜欢的任何方式生成 features字典和label 列表。但是，推荐使用 TensorFlow 的 Dataset API，它可以解析各种数据。

Dataset API 可以为您处理很多常见情况。例如，使用 Dataset API，您可以轻松地从大量文件中并行读取记录，并将它们合并为单个数据流。

为了简化此示例，我们将使用 pandas 加载数据，并利用此内存数据构建输入管道。

In [ ]:

def input_fn(features, labels, training=True, batch_size=256):
    """An input function for training or evaluating"""
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # Shuffle and repeat if you are in training mode.
    if training:
        dataset = dataset.shuffle(1000).repeat()
    
    return dataset.batch(batch_size)

定义特征列（feature columns）

特征列（feature columns）是一个对象，用于描述模型应该如何使用特征字典中的原始输入数据。当您构建一个 Estimator 模型的时候，您会向其传递一个特征列的列表，其中包含您希望模型使用的每个特征。tf.feature_column 模块提供了许多为模型表示数据的选项。

对于鸢尾花，4 个原始特征是数值，因此您将构建一个特征列列表来告诉 Estimator 模型将四个特征中的每一个表示为 32 位浮点值。因此，创建特征列的代码为：

In [ ]:

# Feature columns describe how to use the input.
my_feature_columns = []
for key in train.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key=key))

特征列可能比这里显示的要复杂得多。您可以在此指南中阅读有关特征列的更多信息。

我们已经介绍了如何使模型表示原始特征，现在您可以构建 Estimator 了。

实例化 Estimator

鸢尾花为题是一个经典的分类问题。幸运的是，Tensorflow 提供了几个预创建的 Estimator 分类器，其中包括：

tf.estimator.DNNClassifier 用于多类别分类的深度模型
tf.estimator.DNNLinearCombinedClassifier 用于广度与深度模型
tf.estimator.LinearClassifier 用于基于线性模型的分类器

对于鸢尾花问题，tf.estimator.DNNClassifier 似乎是最好的选择。您可以这样实例化该 Estimator：

In [ ]:

# Build a DNN with 2 hidden layers with 30 and 10 hidden nodes each.
classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    # Two hidden layers of 30 and 10 nodes respectively.
    hidden_units=[30, 10],
    # The model must choose between 3 classes.
    n_classes=3)

训练、评估和预测

我们已经有一个 Estimator 对象，现在可以调用方法来执行下列操作：

训练模型。
评估经过训练的模型。
使用经过训练的模型进行预测。

训练模型

通过调用 Estimator 的 Train 方法来训练模型，如下所示：

In [ ]:

# Train the Model.
classifier.train(
    input_fn=lambda: input_fn(train, train_y, training=True),
    steps=5000)

注意将 input_fn 调用封装在 lambda 中以获取参数，同时提供不带参数的输入函数，如 Estimator 所预期的那样。step 参数告知该方法在训练多少步后停止训练。

评估经过训练的模型

现在模型已经经过训练，您可以获取一些关于模型性能的统计信息。代码块将在测试数据上对经过训练的模型的准确率（accuracy）进行评估：

In [ ]:

eval_result = classifier.evaluate(
    input_fn=lambda: input_fn(test, test_y, training=False))

print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))

与对 train 方法的调用不同，我们没有传递 steps 参数来进行评估。用于评估的 input_fn 只生成一个 epoch 的数据。

eval_result 字典亦包含 average_loss（每个样本的平均误差），loss（每个 mini-batch 的平均误差）与 Estimator 的 global_step（经历的训练迭代次数）值。

利用经过训练的模型进行预测（推理）

我们已经有一个经过训练的模型，可以生成准确的评估结果。我们现在可以使用经过训练的模型，根据一些无标签测量结果预测鸢尾花的品种。与训练和评估一样，我们使用单个函数调用进行预测：

In [ ]:

# Generate predictions from the model
expected = ['Setosa', 'Versicolor', 'Virginica']
predict_x = {
    'SepalLength': [5.1, 5.9, 6.9],
    'SepalWidth': [3.3, 3.0, 3.1],
    'PetalLength': [1.7, 4.2, 5.4],
    'PetalWidth': [0.5, 1.5, 2.1],
}

def input_fn(features, batch_size=256):
    """An input function for prediction."""
    # Convert the inputs to a Dataset without labels.
    return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)

predictions = classifier.predict(
    input_fn=lambda: input_fn(predict_x))

predict 方法返回一个 Python 可迭代对象，为每个样本生成一个预测结果字典。以下代码输出了一些预测及其概率：

In [ ]:

for pred_dict, expec in zip(predictions, expected):
    class_id = pred_dict['class_ids'][0]
    probability = pred_dict['probabilities'][class_id]

    print('Prediction is "{}" ({:.1f}%), expected "{}"'.format(
        SPECIES[class_id], 100 * probability, expec))