Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/zh-cn/io/tutorials/bigtable.ipynb
25118 views
Kernel: Python 3
#@title Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

标题

概述

本笔记本展示了 tensorflow_io.bigtable 模块的基本用法和功能。继续操作之前,请确保您熟悉以下主题:

  1. 创建 GCP 项目。

  2. 为 Bigtable 安装 Cloud SDK

  3. cbt 工具概述

  4. 使用仿真器

:Jupyter 以 Shell 命令形式运行前缀为 ! 的代码行,并将前缀为 $ 的 Python 变量插入到这些命令中。

安装

!pip install tensorflow-io

:执行下面的代码单元时,系统将要求您登录到 Google Cloud。

!mkdir /tools/google-cloud-sdk/.install !gcloud --quiet components install beta cbt bigtable !gcloud init

对于本示例,使用的是 bigtable 仿真器。如果您已经建立 bigtable 实例并为其填入了值,则跳过这些步骤,直接转到“快速入门”部分。

在后台启动仿真器。

import os import subprocess _emulator = subprocess.Popen(['/tools/google-cloud-sdk/bin/gcloud', 'beta', 'emulators', 'bigtable', 'start', '--host-port=127.0.0.1:8086'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, bufsize=0)

创建表

%env BIGTABLE_EMULATOR_HOST=127.0.0.1:8086 !cbt -project "test-project" -instance "test-instance" createtable t1 families=cf1 splits=row-a,row-h,row-p,row-z !cbt -project "test-project" -instance "test-instance" ls

在表中填入值

!cbt -project "test-project" -instance "test-instance" set t1 row-a cf1:c1=A !cbt -project "test-project" -instance "test-instance" set t1 row-b cf1:c1=B !cbt -project "test-project" -instance "test-instance" set t1 row-c cf1:c1=C !cbt -project "test-project" -instance "test-instance" set t1 row-d cf1:c1=D !cbt -project "test-project" -instance "test-instance" set t1 row-e cf1:c1=E !cbt -project "test-project" -instance "test-instance" set t1 row-f cf1:c1=F !cbt -project "test-project" -instance "test-instance" set t1 row-g cf1:c1=G !cbt -project "test-project" -instance "test-instance" set t1 row-h cf1:c1=H !cbt -project "test-project" -instance "test-instance" set t1 row-i cf1:c1=I !cbt -project "test-project" -instance "test-instance" set t1 row-j cf1:c1=J !cbt -project "test-project" -instance "test-instance" set t1 row-k cf1:c1=K !cbt -project "test-project" -instance "test-instance" set t1 row-l cf1:c1=L !cbt -project "test-project" -instance "test-instance" set t1 row-m cf1:c1=M !cbt -project "test-project" -instance "test-instance" set t1 row-n cf1:c1=N !cbt -project "test-project" -instance "test-instance" set t1 row-o cf1:c1=O !cbt -project "test-project" -instance "test-instance" set t1 row-p cf1:c1=P !cbt -project "test-project" -instance "test-instance" set t1 row-q cf1:c1=Q !cbt -project "test-project" -instance "test-instance" set t1 row-r cf1:c1=R !cbt -project "test-project" -instance "test-instance" set t1 row-s cf1:c1=S !cbt -project "test-project" -instance "test-instance" set t1 row-t cf1:c1=T !cbt -project "test-project" -instance "test-instance" set t1 row-u cf1:c1=U !cbt -project "test-project" -instance "test-instance" set t1 row-v cf1:c1=V !cbt -project "test-project" -instance "test-instance" set t1 row-w cf1:c1=W !cbt -project "test-project" -instance "test-instance" set t1 row-x cf1:c1=X !cbt -project "test-project" -instance "test-instance" set t1 row-y cf1:c1=Y !cbt -project "test-project" -instance "test-instance" set t1 row-z cf1:c1=Z
import tensorflow as tf import numpy as np import tensorflow_io as tfio import random random.seed(10)

快速入门

首先,您需要创建一个客户端和一个要从中读取值的表。

# If using your bigtable instance replace the project_id, instance_id # and the name of the table with suitable values. client = tfio.bigtable.BigtableClient(project_id="test-project", instance_id="test-instance") train_table = client.get_table("t1")

非常棒!现在您可以创建一个 tensorflow 数据集,该数据集将从我们的表中读取数据。

为此,您必须提供希望读取的数据类型、column_family:column_name 格式的列名称列表,以及您希望读取的 row_set。

要创建 row_set,请使用 tensorflow.bigtable.row_settensorflow.bigtable.row_range 模块中提供的实用方法。此处创建的是包含所有行的 row_set。

请记住,bigtable 按字典顺序读取值,而不是按其放入顺序。行被赋予随机行键,因此它们将被打乱。

row_set = tfio.bigtable.row_set.from_rows_or_ranges(tfio.bigtable.row_range.infinite()) train_dataset = train_table.read_rows(["cf1:c1"],row_set, output_type=tf.string) for tensor in train_dataset: print(tensor)

就是这样!恭喜恭喜!

并行读取

我们的数据集支持从 Bigtable 并行读取。为此,请使用 parallel_read_rows 方法并将 num_parallel_calls 指定为参数。当调用此方法时,工作是首先基于 SampleRowKey 拆分工作线程。

:请记住,并行读取时,不会以任何特定顺序读取行。

for tensor in train_table.parallel_read_rows(["cf1:c1"],row_set=row_set, num_parallel_calls=2): print(tensor)

读取特定的 row_key

要从 Bigtable 读取数据,您可以指定行集、范围或二者组合。

read_rows 方法要求您提供行集。您可以从特定的行键或行范围构造行集,如下所示:

row_range_below_300 = tfio.bigtable.row_range.right_open("row000", "row300") my_row_set = tfio.bigtable.row_set.from_rows_or_ranges(row_range_below_300, "row585", "row832") print(my_row_set)

此类 row_set 将包含行范围 [row000, row300) 以及 row585 和 row832 行。

您还可以从无限范围、空范围或前缀创建 row_set。也可以使其与 row_range 相交。

my_truncated_row_set = tfio.bigtable.row_set.intersect(my_row_set, tfio.bigtable.row_range.right_open("row200", "row700")) print(my_truncated_row_set)

指定值的版本

Bigtable 允许您在一个单元格中保存具有不同时间戳的多个值。您可以使用版本过滤器指定要选择的版本。但是,只能使用 tensorflow.bigtable 连接器检索二维向量,因此 latest 过滤器始终附加到用户指定的版本过滤器。这意味着,如果一个单元格的多个值通过了提供的过滤器,则应使用较新的值。

您可以使用 latest 过滤器传递最新值,也可以指定时间范围。可以将时间范围提供为 python 日期时间对象,或表示自 Epoch 以来的秒数或微秒数的数字。

from datetime import datetime start = datetime(2020, 10, 10, 12, 0, 0) end = datetime(2100, 10, 10, 13, 0, 0) from_datetime = tfio.bigtable.filters.timestamp_range(start, end) from_posix_timestamp = tfio.bigtable.filters.timestamp_range(int(start.timestamp()), int(end.timestamp())) print("from_datetime:", from_datetime) print("from_posix_timestamp:", from_posix_timestamp)