Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/en-snapshot/io/tutorials/bigtable.ipynb
25118 views
Kernel: Python 3
#@title Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

Title

Overview

This notebook represents the basic usage and features of the tensorflow_io.bigtable module. Make sure you are familiar with these topics before continuing:

  1. Creating a GCP project.

  2. Installing the Cloud SDK for Bigtable

  3. cbt tool overview

  4. Using the emulator

Note: Jupyter runs lines prefixed with ! as shell commands, and it interpolates Python variables prefixed with $ into these commands.

Setup

!pip install tensorflow-io

Note: When executing the cell below, you will be asked to log in to google cloud.

!mkdir /tools/google-cloud-sdk/.install !gcloud --quiet components install beta cbt bigtable !gcloud init

For the sake of this example, the bigtable emulator is used. If you have your bigtable instance set up and populated with values, skip these steps and go straight to the Quickstart section.

Start the emulator in the background.

import os import subprocess _emulator = subprocess.Popen(['/tools/google-cloud-sdk/bin/gcloud', 'beta', 'emulators', 'bigtable', 'start', '--host-port=127.0.0.1:8086'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL, bufsize=0)

Create a table

%env BIGTABLE_EMULATOR_HOST=127.0.0.1:8086 !cbt -project "test-project" -instance "test-instance" createtable t1 families=cf1 splits=row-a,row-h,row-p,row-z !cbt -project "test-project" -instance "test-instance" ls

Populate table with values

!cbt -project "test-project" -instance "test-instance" set t1 row-a cf1:c1=A !cbt -project "test-project" -instance "test-instance" set t1 row-b cf1:c1=B !cbt -project "test-project" -instance "test-instance" set t1 row-c cf1:c1=C !cbt -project "test-project" -instance "test-instance" set t1 row-d cf1:c1=D !cbt -project "test-project" -instance "test-instance" set t1 row-e cf1:c1=E !cbt -project "test-project" -instance "test-instance" set t1 row-f cf1:c1=F !cbt -project "test-project" -instance "test-instance" set t1 row-g cf1:c1=G !cbt -project "test-project" -instance "test-instance" set t1 row-h cf1:c1=H !cbt -project "test-project" -instance "test-instance" set t1 row-i cf1:c1=I !cbt -project "test-project" -instance "test-instance" set t1 row-j cf1:c1=J !cbt -project "test-project" -instance "test-instance" set t1 row-k cf1:c1=K !cbt -project "test-project" -instance "test-instance" set t1 row-l cf1:c1=L !cbt -project "test-project" -instance "test-instance" set t1 row-m cf1:c1=M !cbt -project "test-project" -instance "test-instance" set t1 row-n cf1:c1=N !cbt -project "test-project" -instance "test-instance" set t1 row-o cf1:c1=O !cbt -project "test-project" -instance "test-instance" set t1 row-p cf1:c1=P !cbt -project "test-project" -instance "test-instance" set t1 row-q cf1:c1=Q !cbt -project "test-project" -instance "test-instance" set t1 row-r cf1:c1=R !cbt -project "test-project" -instance "test-instance" set t1 row-s cf1:c1=S !cbt -project "test-project" -instance "test-instance" set t1 row-t cf1:c1=T !cbt -project "test-project" -instance "test-instance" set t1 row-u cf1:c1=U !cbt -project "test-project" -instance "test-instance" set t1 row-v cf1:c1=V !cbt -project "test-project" -instance "test-instance" set t1 row-w cf1:c1=W !cbt -project "test-project" -instance "test-instance" set t1 row-x cf1:c1=X !cbt -project "test-project" -instance "test-instance" set t1 row-y cf1:c1=Y !cbt -project "test-project" -instance "test-instance" set t1 row-z cf1:c1=Z
import tensorflow as tf import numpy as np import tensorflow_io as tfio import random random.seed(10)

Quickstart

First you need to create a client and a table you would like to read from.

# If using your bigtable instance replace the project_id, instance_id # and the name of the table with suitable values. client = tfio.bigtable.BigtableClient(project_id="test-project", instance_id="test-instance") train_table = client.get_table("t1")

Great! Now you can create a tensorflow dataset that will read the data from our table.

To do that, you have to provide the type of the data you wish to read, list of column names in format column_family:column_name, and a row_set that you would like to read.

To create a row_set use utility methods provided in tensorflow.bigtable.row_set and tensorflow.bigtable.row_range modules. Here a row_set containing all rows is created.

Keep in mind that that bigtable reads values in lexicographical order, not the order they were put in. The rows were given random row-keys so they will be shuffled.

row_set = tfio.bigtable.row_set.from_rows_or_ranges(tfio.bigtable.row_range.infinite()) train_dataset = train_table.read_rows(["cf1:c1"],row_set, output_type=tf.string) for tensor in train_dataset: print(tensor)

That's it! Congrats!

Parallel read

Our dataset supports reading in parallel from Bigtable. To do that, use the parallel_read_rows method and specify num_parallel_calls as an argument. When this method is called work is first split between workers based SampleRowKeys.

Note: Keep in mind that when reading in parallel, the rows are not going to be read in any particular order.

for tensor in train_table.parallel_read_rows(["cf1:c1"],row_set=row_set, num_parallel_calls=2): print(tensor)

Reading specific row_keys

To read the data from Bigtable, you can specify a set of rows or a range or a combination of those.

read_rows method expects you to provide a RowSet. You can construct a RowSet from specific row keys or RowRanges as follows:

row_range_below_300 = tfio.bigtable.row_range.right_open("row000", "row300") my_row_set = tfio.bigtable.row_set.from_rows_or_ranges(row_range_below_300, "row585", "row832") print(my_row_set)

such row_set would contain a range of rows [row000, row300) and rows row585 and row832.

you can also create a row_set from an infinite range, empty range or a prefix. You can also intersect it with a row_range.

my_truncated_row_set = tfio.bigtable.row_set.intersect(my_row_set, tfio.bigtable.row_range.right_open("row200", "row700")) print(my_truncated_row_set)

Specifying a version of a value

Bigtable lets you keep many values in one cell with different timestamps. You can specify which version you want to pick using version filters. However, you can only retrieve a two dimensional vector using tensorflow.bigtable connector, so latest filter is always appended to the user specified version filter. Meaning, if more than one value for one cell goes through the provided filter, the newer shall be used.

You can either use the latest filter passing the newest value, or you can specify a time range. The time range can be provided either as python datetime objects or a number representing seconds or microseconds since epoch.

from datetime import datetime start = datetime(2020, 10, 10, 12, 0, 0) end = datetime(2100, 10, 10, 13, 0, 0) from_datetime = tfio.bigtable.filters.timestamp_range(start, end) from_posix_timestamp = tfio.bigtable.filters.timestamp_range(int(start.timestamp()), int(end.timestamp())) print("from_datetime:", from_datetime) print("from_posix_timestamp:", from_posix_timestamp)