Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
tensorflow
GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/en-snapshot/io/tutorials/genome.ipynb
25118 views
Kernel: Python 3
#@title Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # https://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.

Overview

This tutorial demonstrates the tfio.genome package that provides commonly used genomics IO functionality--namely reading several genomics file formats and also providing some common operations for preparing the data (for example--one hot encoding or parsing Phred quality into probabilities).

This package uses the Google Nucleus library to provide some of the core functionality.

Setup

try: %tensorflow_version 2.x except Exception: pass !pip install tensorflow-io
import tensorflow_io as tfio import tensorflow as tf

FASTQ Data

FASTQ is a common genomics file format that stores both sequence information in addition to base quality information.

First, let's download a sample fastq file.

# Download some sample data: !curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq

Read FASTQ Data

Now, let's use tfio.genome.read_fastq to read this file (note a tf.data API coming soon).

fastq_data = tfio.genome.read_fastq(filename="test.fastq") print(fastq_data.sequences) print(fastq_data.raw_quality)

As you see, the returned fastq_data has fastq_data.sequences which is a string tensor of all sequences in the fastq file (which can each be a different size) along with fastq_data.raw_quality which includes Phred encoded quality information about the quality of each base read in the sequence.

Quality

You can use a helper op to convert this quality information into probabilities if you are interested.

quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality) print(quality.shape) print(quality.row_lengths().numpy()) print(quality)

One hot encodings

You may also want to encode the genome sequence data (which consists of A T C G bases) using a one hot encoder. There's a built in operation that can help with this.

one_hot = tfio.genome.sequences_to_onehot(fastq_data.sequences) print(one_hot) print(one_hot.shape)
print(tfio.genome.sequences_to_onehot.__doc__)