CoCalc -- genome.ipynb

GitHub Repository: tensorflow/docs-l10n
Path: blob/master/site/zh-cn/io/tutorials/genome.ipynb
²⁵¹¹⁸ views

Kernel: Python 3

Copyright 2020 The TensorFlow Authors.

In [ ]:

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

概述

本教程将演示 tfio.genome 软件包，其中提供了常用的基因组学 IO 功能，即读取多种基因组学文件格式，以及提供一些用于准备数据（例如，独热编码或将 Phred 质量解析为概率）的常用运算。

此软件包使用 Google Nucleus 库来提供一些核心功能。

设置

In [ ]:

try:
  %tensorflow_version 2.x
except Exception:
  pass
!pip install tensorflow-io

In [ ]:

import tensorflow_io as tfio
import tensorflow as tf

FASTQ 数据

FASTQ 是一种常见的基因组学文件格式，除了基本的质量信息外，还存储序列信息。

首先，让我们下载一个样本 fastq 文件。

In [ ]:

# Download some sample data:
!curl -OL https://raw.githubusercontent.com/tensorflow/io/master/tests/test_genome/test.fastq

读取 FASTQ 数据

现在，让我们使用 tfio.genome.read_fastq 读取此文件（请注意，tf.data API 即将发布）。

In [ ]:

fastq_data = tfio.genome.read_fastq(filename="test.fastq")
print(fastq_data.sequences)
print(fastq_data.raw_quality)

如您所见，返回的 fastq_data 具有 fastq_data.sequences，后者是 fastq 文件中所有序列的字符串张量（大小可以不同）；并具有 fastq_data.raw_quality，其中包含与在序列中读取的每个碱基的质量有关的 Phred 编码质量信息。

质量

如有兴趣，您可以使用辅助运算将此质量信息转换为概率。

In [ ]:

quality = tfio.genome.phred_sequences_to_probability(fastq_data.raw_quality)
print(quality.shape)
print(quality.row_lengths().numpy())
print(quality)

独热编码

您可能还需要使用独热编码器对基因组序列数据（由 A T C G 碱基组成）进行编码。有一项内置运算可以帮助编码。

In [ ]:

print(tfio.genome.sequences_to_onehot.__doc__)

In [ ]:

print(tfio.genome.sequences_to_onehot.__doc__)

Copyright 2020 The TensorFlow Authors.

概述

设置

FASTQ 数据

读取 FASTQ 数据

质量

独热编码

Product

Resources

Company