GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/zh-cn/io/tutorials/bigtable.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2020 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

标题

概述

本笔记本展示了 tensorflow_io.bigtable 模块的基本用法和功能。继续操作之前，请确保您熟悉以下主题：

注：Jupyter 以 Shell 命令形式运行前缀为 ! 的代码行，并将前缀为 $ 的 Python 变量插入到这些命令中。

安装

In [ ]:

!pip install tensorflow-io

注：执行下面的代码单元时，系统将要求您登录到 Google Cloud。

In [ ]:

!mkdir /tools/google-cloud-sdk/.install
!gcloud --quiet components install beta cbt bigtable
!gcloud init

对于本示例，使用的是 bigtable 仿真器。如果您已经建立 bigtable 实例并为其填入了值，则跳过这些步骤，直接转到“快速入门”部分。

在后台启动仿真器。

In [ ]:

import os
import subprocess
_emulator = subprocess.Popen(['/tools/google-cloud-sdk/bin/gcloud', 'beta', 'emulators', 'bigtable', 'start', '--host-port=127.0.0.1:8086'],
                                      stdout=subprocess.DEVNULL,
                                      stderr=subprocess.DEVNULL, bufsize=0)

创建表

In [ ]:

%env BIGTABLE_EMULATOR_HOST=127.0.0.1:8086
!cbt -project "test-project" -instance "test-instance" createtable t1 families=cf1 splits=row-a,row-h,row-p,row-z
!cbt -project "test-project" -instance "test-instance" ls

在表中填入值

In [ ]:

!cbt -project "test-project" -instance "test-instance" set t1 row-a cf1:c1=A
!cbt -project "test-project" -instance "test-instance" set t1 row-b cf1:c1=B
!cbt -project "test-project" -instance "test-instance" set t1 row-c cf1:c1=C
!cbt -project "test-project" -instance "test-instance" set t1 row-d cf1:c1=D
!cbt -project "test-project" -instance "test-instance" set t1 row-e cf1:c1=E
!cbt -project "test-project" -instance "test-instance" set t1 row-f cf1:c1=F
!cbt -project "test-project" -instance "test-instance" set t1 row-g cf1:c1=G
!cbt -project "test-project" -instance "test-instance" set t1 row-h cf1:c1=H
!cbt -project "test-project" -instance "test-instance" set t1 row-i cf1:c1=I
!cbt -project "test-project" -instance "test-instance" set t1 row-j cf1:c1=J
!cbt -project "test-project" -instance "test-instance" set t1 row-k cf1:c1=K
!cbt -project "test-project" -instance "test-instance" set t1 row-l cf1:c1=L
!cbt -project "test-project" -instance "test-instance" set t1 row-m cf1:c1=M
!cbt -project "test-project" -instance "test-instance" set t1 row-n cf1:c1=N
!cbt -project "test-project" -instance "test-instance" set t1 row-o cf1:c1=O
!cbt -project "test-project" -instance "test-instance" set t1 row-p cf1:c1=P
!cbt -project "test-project" -instance "test-instance" set t1 row-q cf1:c1=Q
!cbt -project "test-project" -instance "test-instance" set t1 row-r cf1:c1=R
!cbt -project "test-project" -instance "test-instance" set t1 row-s cf1:c1=S
!cbt -project "test-project" -instance "test-instance" set t1 row-t cf1:c1=T
!cbt -project "test-project" -instance "test-instance" set t1 row-u cf1:c1=U
!cbt -project "test-project" -instance "test-instance" set t1 row-v cf1:c1=V
!cbt -project "test-project" -instance "test-instance" set t1 row-w cf1:c1=W
!cbt -project "test-project" -instance "test-instance" set t1 row-x cf1:c1=X
!cbt -project "test-project" -instance "test-instance" set t1 row-y cf1:c1=Y
!cbt -project "test-project" -instance "test-instance" set t1 row-z cf1:c1=Z

In [ ]:

import tensorflow as tf
import numpy as np
import tensorflow_io as tfio
import random

random.seed(10)

快速入门

首先，您需要创建一个客户端和一个要从中读取值的表。

In [ ]:

# If using your bigtable instance replace the project_id, instance_id 
# and the name of the table with suitable values.

client = tfio.bigtable.BigtableClient(project_id="test-project", instance_id="test-instance")
train_table = client.get_table("t1")

非常棒！现在您可以创建一个 tensorflow 数据集，该数据集将从我们的表中读取数据。

为此，您必须提供希望读取的数据类型、column_family:column_name 格式的列名称列表，以及您希望读取的 row_set。

要创建 row_set，请使用 tensorflow.bigtable.row_set 和 tensorflow.bigtable.row_range 模块中提供的实用方法。此处创建的是包含所有行的 row_set。

请记住，bigtable 按字典顺序读取值，而不是按其放入顺序。行被赋予随机行键，因此它们将被打乱。

In [ ]:

row_set = tfio.bigtable.row_set.from_rows_or_ranges(tfio.bigtable.row_range.infinite())

train_dataset = train_table.read_rows(["cf1:c1"],row_set, output_type=tf.string)

for tensor in train_dataset:
  print(tensor)

就是这样！恭喜恭喜！

并行读取

我们的数据集支持从 Bigtable 并行读取。为此，请使用 parallel_read_rows 方法并将 num_parallel_calls 指定为参数。当调用此方法时，工作是首先基于 SampleRowKey 拆分工作线程。

注：请记住，并行读取时，不会以任何特定顺序读取行。

In [ ]:

for tensor in train_table.parallel_read_rows(["cf1:c1"],row_set=row_set, num_parallel_calls=2):
  print(tensor)

读取特定的 row_key

要从 Bigtable 读取数据，您可以指定行集、范围或二者组合。

read_rows 方法要求您提供行集。您可以从特定的行键或行范围构造行集，如下所示：

In [ ]:

row_range_below_300 = tfio.bigtable.row_range.right_open("row000", "row300")

my_row_set = tfio.bigtable.row_set.from_rows_or_ranges(row_range_below_300, "row585", "row832")
print(my_row_set)

此类 row_set 将包含行范围 [row000, row300) 以及 row585 和 row832 行。

您还可以从无限范围、空范围或前缀创建 row_set。也可以使其与 row_range 相交。

In [ ]:

my_truncated_row_set = tfio.bigtable.row_set.intersect(my_row_set,
                                         tfio.bigtable.row_range.right_open("row200", "row700"))
print(my_truncated_row_set)

指定值的版本

Bigtable 允许您在一个单元格中保存具有不同时间戳的多个值。您可以使用版本过滤器指定要选择的版本。但是，只能使用 tensorflow.bigtable 连接器检索二维向量，因此 latest 过滤器始终附加到用户指定的版本过滤器。这意味着，如果一个单元格的多个值通过了提供的过滤器，则应使用较新的值。

您可以使用 latest 过滤器传递最新值，也可以指定时间范围。可以将时间范围提供为 python 日期时间对象，或表示自 Epoch 以来的秒数或微秒数的数字。

In [ ]:

from datetime import datetime

start = datetime(2020, 10, 10, 12, 0, 0)
end = datetime(2100, 10, 10, 13, 0, 0)
from_datetime = tfio.bigtable.filters.timestamp_range(start, end)
from_posix_timestamp = tfio.bigtable.filters.timestamp_range(int(start.timestamp()), int(end.timestamp()))
print("from_datetime:", from_datetime)

print("from_posix_timestamp:", from_posix_timestamp)

Copyright 2020 The TensorFlow Authors.

标题

概述

安装

快速入门

并行读取

读取特定的 row_key

指定值的版本

Product

Resources

Company