CoCalc -- genome.ipynb

GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/en-snapshot/io/tutorials/genome.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2020 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View on TensorFlow.org

Run in Google Colab

View source on GitHub

Download notebook

Overview

This tutorial demonstrates the tfio.genome package that provides commonly used genomics IO functionality--namely reading several genomics file formats and also providing some common operations for preparing the data (for example--one hot encoding or parsing Phred quality into probabilities).

This package uses the Google Nucleus library to provide some of the core functionality.

Setup

In [ ]:

try:
  %tensorflow_version 2.x
except Exception:
  pass
!pip install tensorflow-io

In [ ]:

import tensorflow_io as tfio
import tensorflow as tf

FASTQ Data

FASTQ is a common genomics file format that stores both sequence information in addition to base quality information.

First, let's download a sample fastq file.

In [ ]:

# Download some sample data:
!curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq

Read FASTQ Data

Now, let's use tfio.genome.read_fastq to read this file (note a tf.data API coming soon).

In [ ]:

fastq_data = tfio.genome.read_fastq(filename="test.fastq")
print(fastq_data.sequences)
print(fastq_data.raw_quality)

As you see, the returned fastq_data has fastq_data.sequences which is a string tensor of all sequences in the fastq file (which can each be a different size) along with fastq_data.raw_quality which includes Phred encoded quality information about the quality of each base read in the sequence.

Quality

You can use a helper op to convert this quality information into probabilities if you are interested.

In [ ]:

quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)
print(quality.shape)
print(quality.row_lengths().numpy())
print(quality)

One hot encodings

You may also want to encode the genome sequence data (which consists of A T C G bases) using a one hot encoder. There's a built in operation that can help with this.

In [ ]:

one_hot = tfio.genome.sequences_to_onehot(fastq_data.sequences)
print(one_hot)
print(one_hot.shape)

In [ ]:

print(tfio.genome.sequences_to_onehot.__doc__)