GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/ja/io/tutorials/bigtable.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2020 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

タイトル

概要

このノートブックは、 tensorflow_io.bigtable モジュールの基本的な使用法と機能を表しています。続行する前に、これらのトピックに精通していることを確認してください。

注意: Jupyter は、シェルコマンドとして接頭辞 ! のある行を実行し、接頭辞 $ のある Python 変数をこれらのコマンドに補間します。

MNIST モデルをビルドする

In [ ]:

!pip install tensorflow-io

注：以下のセルを実行すると、Google Cloud にログインするように求められます。

In [ ]:

!mkdir /tools/google-cloud-sdk/.install
!gcloud --quiet components install beta cbt bigtable
!gcloud init

この例では、bigtable エミュレーターを使用しています。 bigtable インスタンスを設定して値を入力している場合は、これらの手順をスキップして、[クイックスタート] セクションに直接進んでください。

バックグラウンドでエミュレーターを起動します。

In [ ]:

import os
import subprocess
_emulator = subprocess.Popen(['/tools/google-cloud-sdk/bin/gcloud', 'beta', 'emulators', 'bigtable', 'start', '--host-port=127.0.0.1:8086'],
                                      stdout=subprocess.DEVNULL,
                                      stderr=subprocess.DEVNULL, bufsize=0)

テーブルを作成します

In [ ]:

%env BIGTABLE_EMULATOR_HOST=127.0.0.1:8086
!cbt -project "test-project" -instance "test-instance" createtable t1 families=cf1 splits=row-a,row-h,row-p,row-z
!cbt -project "test-project" -instance "test-instance" ls

テーブルに値を入力します

In [ ]:

!cbt -project "test-project" -instance "test-instance" set t1 row-a cf1:c1=A
!cbt -project "test-project" -instance "test-instance" set t1 row-b cf1:c1=B
!cbt -project "test-project" -instance "test-instance" set t1 row-c cf1:c1=C
!cbt -project "test-project" -instance "test-instance" set t1 row-d cf1:c1=D
!cbt -project "test-project" -instance "test-instance" set t1 row-e cf1:c1=E
!cbt -project "test-project" -instance "test-instance" set t1 row-f cf1:c1=F
!cbt -project "test-project" -instance "test-instance" set t1 row-g cf1:c1=G
!cbt -project "test-project" -instance "test-instance" set t1 row-h cf1:c1=H
!cbt -project "test-project" -instance "test-instance" set t1 row-i cf1:c1=I
!cbt -project "test-project" -instance "test-instance" set t1 row-j cf1:c1=J
!cbt -project "test-project" -instance "test-instance" set t1 row-k cf1:c1=K
!cbt -project "test-project" -instance "test-instance" set t1 row-l cf1:c1=L
!cbt -project "test-project" -instance "test-instance" set t1 row-m cf1:c1=M
!cbt -project "test-project" -instance "test-instance" set t1 row-n cf1:c1=N
!cbt -project "test-project" -instance "test-instance" set t1 row-o cf1:c1=O
!cbt -project "test-project" -instance "test-instance" set t1 row-p cf1:c1=P
!cbt -project "test-project" -instance "test-instance" set t1 row-q cf1:c1=Q
!cbt -project "test-project" -instance "test-instance" set t1 row-r cf1:c1=R
!cbt -project "test-project" -instance "test-instance" set t1 row-s cf1:c1=S
!cbt -project "test-project" -instance "test-instance" set t1 row-t cf1:c1=T
!cbt -project "test-project" -instance "test-instance" set t1 row-u cf1:c1=U
!cbt -project "test-project" -instance "test-instance" set t1 row-v cf1:c1=V
!cbt -project "test-project" -instance "test-instance" set t1 row-w cf1:c1=W
!cbt -project "test-project" -instance "test-instance" set t1 row-x cf1:c1=X
!cbt -project "test-project" -instance "test-instance" set t1 row-y cf1:c1=Y
!cbt -project "test-project" -instance "test-instance" set t1 row-z cf1:c1=Z

In [ ]:

import tensorflow as tf
import numpy as np
import tensorflow_io as tfio
import random

random.seed(10)

クイックスタート

まず、読み取りたいクライアントとテーブルを作成する必要があります。

In [ ]:

# If using your bigtable instance replace the project_id, instance_id 
# and the name of the table with suitable values.

client = tfio.bigtable.BigtableClient(project_id="test-project", instance_id="test-instance")
train_table = client.get_table("t1")

素晴らしい！これで、テーブルからデータを読み取る Tensorflow データセットを作成できます。

これを行うには、読み取りたいデータのタイプ、 column_family:column_name の形式の列名のリスト、および読み取りたい row_set を指定する必要があります。

row_set を作成するには、 tensorflow.bigtable.row_set と　tensorflow.bigtable.row_range モジュールで提供されているユーティリティメソッドを使用します。ここで、すべての行を含む row_set が作成されます。

その bigtable は、値が配置された順序ではなく、字句順で値を読み取ることに注意してください。行にはランダムな行キーが与えられているため、シャッフルされます。

In [ ]:

row_set = tfio.bigtable.row_set.from_rows_or_ranges(tfio.bigtable.row_range.infinite())

train_dataset = train_table.read_rows(["cf1:c1"],row_set, output_type=tf.string)

for tensor in train_dataset:
  print(tensor)

これだけです！おめでとうございます！

並列読み取り

データセットは、Bigtable からの並列読み取りをサポートしています。これを行うには、 parallel_read_rows メソッドを使用し、引数として num_parallel_calls を指定します。このメソッドが呼び出されると、作業は最初に SampleRowKeys に基づいてワーカー間で分割されます。

注: 並行して読み取る場合、行は特定の順序で読み取られないことに注意してください。

In [ ]:

for tensor in train_table.parallel_read_rows(["cf1:c1"],row_set=row_set, num_parallel_calls=2):
  print(tensor)

特定の row_keys を読み取る

Bigtable からデータを読み取るには、行のセット、範囲、またはそれらの組み合わせを指定できます。

read_rows メソッドは、RowSet を提供することを想定しています。次のように、特定の行キーまたは RowRanges から RowSet を作成できます。

In [ ]:

row_range_below_300 = tfio.bigtable.row_range.right_open("row000", "row300")

my_row_set = tfio.bigtable.row_set.from_rows_or_ranges(row_range_below_300, "row585", "row832")
print(my_row_set)

このような row_set には、一連の行 [row000, row300) と行 row585 および row832 が含まれます。

無限の範囲、空の範囲、またはプレフィックスから row_set を作成することもできます。 row_range と交差させることもできます。

In [ ]:

my_truncated_row_set = tfio.bigtable.row_set.intersect(my_row_set,
                                         tfio.bigtable.row_range.right_open("row200", "row700"))
print(my_truncated_row_set)

値のバージョンを指定する

Bigtable を使用すると、タイムスタンプが異なる 1 つのセルに多くの値を保持できます。バージョンフィルターを使用して、選択するバージョンを指定できます。ただし、tensorflow.bigtable コネクタを使用して取得できるのは 2 次元ベクトルのみであるため、 latest フィルターは常にユーザー指定のバージョンフィルターに追加されます。つまり、1 つのセルに複数の値が提供されたフィルターを通過する場合は、新しい方の値が使用されます。

最新の値を渡す latest フィルターを使用するか、時間範囲を指定することができます。時間範囲は、Python の日時オブジェクトとして、またはエポックからの秒またはマイクロ秒を表す数値として指定できます。

In [ ]:

from datetime import datetime

start = datetime(2020, 10, 10, 12, 0, 0)
end = datetime(2100, 10, 10, 13, 0, 0)
from_datetime = tfio.bigtable.filters.timestamp_range(start, end)
from_posix_timestamp = tfio.bigtable.filters.timestamp_range(int(start.timestamp()), int(end.timestamp()))
print("from_datetime:", from_datetime)

print("from_posix_timestamp:", from_posix_timestamp)