Path: blob/master/site/en-snapshot/io/tutorials/bigtable.ipynb
25118 views
Copyright 2020 The TensorFlow Authors.
Title
Overview
This notebook represents the basic usage and features of the tensorflow_io.bigtable
module. Make sure you are familiar with these topics before continuing:
Note: Jupyter runs lines prefixed with !
as shell commands, and it interpolates Python variables prefixed with $
into these commands.
Setup
Note: When executing the cell below, you will be asked to log in to google cloud.
For the sake of this example, the bigtable emulator is used. If you have your bigtable instance set up and populated with values, skip these steps and go straight to the Quickstart section.
Start the emulator in the background.
Create a table
Populate table with values
Quickstart
First you need to create a client and a table you would like to read from.
Great! Now you can create a tensorflow dataset that will read the data from our table.
To do that, you have to provide the type of the data you wish to read, list of column names in format column_family:column_name
, and a row_set that you would like to read.
To create a row_set use utility methods provided in tensorflow.bigtable.row_set
and tensorflow.bigtable.row_range
modules. Here a row_set containing all rows is created.
Keep in mind that that bigtable reads values in lexicographical order, not the order they were put in. The rows were given random row-keys so they will be shuffled.
That's it! Congrats!
Parallel read
Our dataset supports reading in parallel from Bigtable. To do that, use the parallel_read_rows
method and specify num_parallel_calls
as an argument. When this method is called work is first split between workers based SampleRowKeys.
Note: Keep in mind that when reading in parallel, the rows are not going to be read in any particular order.
Reading specific row_keys
To read the data from Bigtable, you can specify a set of rows or a range or a combination of those.
read_rows
method expects you to provide a RowSet. You can construct a RowSet from specific row keys or RowRanges as follows:
such row_set would contain a range of rows [row000, row300)
and rows row585 and row832.
you can also create a row_set from an infinite range, empty range or a prefix. You can also intersect it with a row_range.
Specifying a version of a value
Bigtable lets you keep many values in one cell with different timestamps. You can specify which version you want to pick using version filters. However, you can only retrieve a two dimensional vector using tensorflow.bigtable connector, so latest
filter is always appended to the user specified version filter. Meaning, if more than one value for one cell goes through the provided filter, the newer shall be used.
You can either use the latest
filter passing the newest value, or you can specify a time range. The time range can be provided either as python datetime objects or a number representing seconds or microseconds since epoch.