GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/zh-cn/io/tutorials/orc.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2021 The TensorFlow Authors.

In [1]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Apache ORC Reader

概述

Apache ORC 是比较流行的列式存储格式。tensorflow-io 软件包可以默认实现 Apache ORC 文件的读取。

安装

安装所需的软件包，然后重新启动运行时

In [2]:

!pip install tensorflow-io

In [3]:

import tensorflow as tf
import tensorflow_io as tfio

在 ORC 中下载示例数据集文件

您将在此处使用的数据集是来自 UCI 的 iris 数据集。该数据集包含 3 个类别，每个类别 50 个实例，其中每个类别指的是一种鸢尾植物。它有 4 个属性：(1) 萼片长度、(2) 萼片宽度、(3) 花瓣长度、(4) 花瓣宽度，最后一列包含类别标签。

In [4]:

!curl -OL https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc
!ls -l iris.orc

从文件创建数据集

In [35]:

dataset = tfio.IODataset.from_orc("iris.orc", capacity=15).batch(1)

检查数据集：

In [42]:

for item in dataset.take(1):
    print(item)

让我们来看一个端到端示例，该示例基于 iris 数据集使用 ORC 数据集来训练 tf.keras 模型。

数据预处理

配置哪些列是特征，哪些列是标签：

In [47]:

feature_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
label_cols = ["species"]

# select feature columns
feature_dataset = tfio.IODataset.from_orc("iris.orc", columns=feature_cols)
# select label columns
label_dataset = tfio.IODataset.from_orc("iris.orc", columns=label_cols)

将物种映射到浮点数以进行模型训练的 util 函数如下：

In [48]:

vocab_init = tf.lookup.KeyValueTensorInitializer(
    keys=tf.constant(["virginica", "versicolor", "setosa"]),
    values=tf.constant([0, 1, 2], dtype=tf.int64))
vocab_table = tf.lookup.StaticVocabularyTable(
    vocab_init,
    num_oov_buckets=4)

In [49]:

label_dataset = label_dataset.map(vocab_table.lookup)
dataset = tf.data.Dataset.zip((feature_dataset, label_dataset))
dataset = dataset.batch(1)

def pack_features_vector(features, labels):
    """Pack the features into a single array."""
    features = tf.stack(list(features), axis=1)
    return features, labels

dataset = dataset.map(pack_features_vector)

构建、编译并训练模型

最后，您已准备好构建模型并对其进行训练！您将构建一个 3 层 keras 模型，以根据刚刚处理的数据集来预测鸢尾植物的类别。

In [50]:

model = tf.keras.Sequential(
    [
        tf.keras.layers.Dense(
            10, activation=tf.nn.relu, input_shape=(4,)
        ),
        tf.keras.layers.Dense(10, activation=tf.nn.relu),
        tf.keras.layers.Dense(3),
    ]
)

model.compile(optimizer="adam", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=["accuracy"])
model.fit(dataset, epochs=5)

Copyright 2021 The TensorFlow Authors.

Apache ORC Reader

概述

安装

在 ORC 中下载示例数据集文件

从文件创建数据集

数据预处理

构建、编译并训练模型

Product

Resources

Company