Path: blob/master/site/en-snapshot/io/tutorials/genome.ipynb
25118 views
Copyright 2020 The TensorFlow Authors.
Overview
This tutorial demonstrates the tfio.genome
package that provides commonly used genomics IO functionality--namely reading several genomics file formats and also providing some common operations for preparing the data (for example--one hot encoding or parsing Phred quality into probabilities).
This package uses the Google Nucleus library to provide some of the core functionality.
Setup
FASTQ Data
FASTQ is a common genomics file format that stores both sequence information in addition to base quality information.
First, let's download a sample fastq
file.
Read FASTQ Data
Now, let's use tfio.genome.read_fastq
to read this file (note a tf.data
API coming soon).
As you see, the returned fastq_data
has fastq_data.sequences
which is a string tensor of all sequences in the fastq file (which can each be a different size) along with fastq_data.raw_quality
which includes Phred encoded quality information about the quality of each base read in the sequence.
Quality
You can use a helper op to convert this quality information into probabilities if you are interested.
One hot encodings
You may also want to encode the genome sequence data (which consists of A
T
C
G
bases) using a one hot encoder. There's a built in operation that can help with this.