Path: blob/master/site/en-snapshot/io/development.md
25115 views
Development
The document contains the necessary information for setting up the development environment and building the tensorflow-io
package from source on various platforms. Once the setup is completed please refer to the STYLE_GUIDE for guidelines on adding new ops.
IDE Setup
For instructions on how to configure Visual Studio Code for developing TensorFlow I/O, please refer to this doc.
Lint
TensorFlow I/O's code conforms to Bazel Buildifier, Clang Format, Black, and Pyupgrade. Please use the following command to check the source code and identify lint issues:
For Bazel Buildifier and Clang Format, the following command will automatically identify and fix any lint errors:
Alternatively, if you only want to perform lint check using individual linters, then you can selectively pass black
, pyupgrade
, bazel
, or clang
to the above commands.
For example, a black
specific lint check can be done using:
Lint fix using Bazel Buildifier and Clang Format can be done using:
Lint check using black
and pyupgrade
for an individual python file can be done using:
Lint fix an individual python file with black and pyupgrade using:
Python
macOS
On macOS Catalina 10.15.7, it is possible to build tensorflow-io with system provided python 3.8.2. Both tensorflow
and bazel
are needed to do so.
NOTE: The system default python 3.8.2 on macOS 10.15.7 will cause regex
installation error caused by compiler option of -arch arm64 -arch x86_64
(similar to the issue mentioned in https://github.com/giampaolo/psutil/issues/1832). To overcome this issue export ARCHFLAGS="-arch x86_64"
will be needed to remove arm64 build option.
NOTE: When running pytest, TFIO_DATAPATH=bazel-bin
has to be passed so that python can utilize the generated shared libraries after the build process.
Troubleshoot
If Xcode is installed, but $ xcodebuild -version
is not displaying the expected output, you might need to enable Xcode command line with the command:
$ xcode-select -s /Applications/Xcode.app/Contents/Developer
.
A terminal restart might be required for the changes to take effect.
Sample output:
Linux
Development of tensorflow-io on Linux is similar to macOS. The required packages are gcc, g++, git, bazel, and python 3. Newer versions of gcc or python, other than the default system installed versions might be required though.
Ubuntu 20.04
Ubuntu 20.04 requires gcc/g++, git, and python 3. The following will install dependencies and build the shared libraries on Ubuntu 20.04:
CentOS 8
The steps to build shared libraries for CentOS 8 is similar to Ubuntu 20.04 above except that
should be used instead to install gcc/g++, git, unzip/which (for bazel), and python3.
CentOS 7
On CentOS 7, the default python and gcc version are too old to build tensorflow-io's shared libraries (.so). The gcc provided by Developer Toolset and rh-python36 should be used instead. Also, the libstdc++ has to be linked statically to avoid discrepancy of libstdc++ installed on CentOS vs. newer gcc version by devtoolset.
Furthermore, a special flag --//tensorflow_io/core:static_build
has to be passed to Bazel in order to avoid duplication of symbols in statically linked libraries for file system plugins.
The following will install bazel, devtoolset-9, rh-python36, and build the shared libraries:
Docker
For Python development, a reference Dockerfile here can be used to build the TensorFlow I/O package (tensorflow-io
) from source. Additionally, the pre-built devel images can be used as well:
A package file dist/tensorflow_io-*.whl
will be generated after a build is successful.
NOTE: When working in the Python development container, an environment variable TFIO_DATAPATH
is automatically set to point tensorflow-io to the shared C++ libraries built by Bazel to run pytest
and build the bdist_wheel
. Python setup.py
can also accept --data [path]
as an argument, for example python setup.py --data bazel-bin bdist_wheel
.
NOTE: While the tfio-dev container gives developers an easy to work with environment, the released whl packages are built differently due to manylinux2010 requirements. Please check [Build Status and CI] section for more details on how the released whl packages are generated.
Python Wheels
It is possible to build python wheels after bazel build is complete with the following command:
The .whl file will be available in dist directory. Note the bazel binary directory bazel-bin
has to be passed with --data
args in order for setup.py to locate the necessary share objects, as bazel-bin
is outside of the tensorflow_io
package directory.
Alternatively, source install could be done with:
with TFIO_DATAPATH=bazel-bin
passed for the same reason.
Note installing with -e
is different from the above. The
will not install shared object automatically even with TFIO_DATAPATH=bazel-bin
. Instead, TFIO_DATAPATH=bazel-bin
has to be passed everytime the program is run after the install:
Testing
Some tests require launching a test container or start a local instance of the associated tool before running. For example, to run kafka related tests which will start a local instance of kafka, zookeeper and schema-registry, use:
Testing Datasets
associated with tools such as Elasticsearch
or MongoDB
require docker to be available on the system. In such scenarios, use:
Additionally, testing some features of tensorflow-io
doesn't require you to spin up any additional tools as the data has been provided in the tests
directory itself. For example, to run tests related to parquet
dataset's, use:
R
We provide a reference Dockerfile here for you so that you can use the R package directly for testing. You can build it via:
Inside the container, you can start your R session, instantiate a SequenceFileDataset
from an example Hadoop SequenceFile string.seq, and then use any transformation functions provided by tfdatasets package on the dataset like the following: