Background
There are a growing number of RNA gene families and RNA
motifs [ 1 2 ] . Many (though not all) RNAs conserve a
base-paired RNA secondary structure. Computational analyses
of RNA sequence families are more powerful if they take
into account both primary sequence and secondary structure
consensus [ 3 4 ] .
Some excellent approaches have been developed for
database searching with RNA secondary structure consensus
patterns. Exact- and approximate-match pattern searches
(analogous to PROSITE patterns for proteins) have been
extended to allow patterns to specify long-range base
pairing constraints [ 5 6 ] . In several cases, specialized
programs have been developed to recognize specific RNA
structures [ 4 ] - for example, programs exist for
detecting transfer RNA genes [ 7 8 9 ] , group I catalytic
introns [ 10 ] , and small nucleolar RNAs [ 11 12 ] . All
of these approaches, though powerful, lack generality, and
they require expert knowledge about each particular RNA
family of interest.
In primary sequence analysis, the most useful analysis
techniques are general primary sequence alignment
algorithms with probabilistically based scoring systems -
for example, the BLAST [ 13 ] , FASTA [ 14 ] , or CLUSTALW
[ 15 ] algorithms, and the PAM [ 16 ] or BLOSUM [ 17 ]
score matrices. Unlike specialized programs, a general
alignment algorithm can be applied to find homologs of any
query sequence(s). Unlike pattern searches, which give
yes/no answers for whether a candidate sequence is a match,
a scoring system gives a meaningful score that allows
ranking candidate hits by their statistical significance.
It is of interest to develop general alignment algorithms
for RNA secondary structures.
The problem I consider here is as follows. I am given a
multiple alignment of an RNA sequence family for which I
know the consensus secondary structure. I want to search a
sequence database for homologs that significantly match the
sequence
and structure of my query. The
sequence analysis analogue is the use of profile hidden
Markov models (profile HMMs) to model multiple alignments
of conserved protein domains, and to discover new
homologues in sequence databases [ 18 19 ] . For instance,
if we had an RNA structure equivalent of the HMMER profile
HMM program suite http://hmmer.wustl.edu/it would be
possible to develop and efficiently maintain databases of
conserved RNA structures and multiple alignments, analogous
to the Pfam or SMART databases of conserved protein domains
[ 20 21 ] .
Stochastic context free grammar (SCFG) algorithms
provide a general approach to RNA structure alignment [ 22
23 24 ] . SCFGs allow the strong pairwise residue
correlations in non-pseudoknotted RNA secondary structure
to be taken into account in RNA alignments. SCFGs can be
aligned to sequences using a dynamic programming algorithm
that guarantees finding a mathematically optimal solution
in polynomial time. SCFG alignment algorithms can be
thought of as an extension of sequence alignment algorithms
(particularly those with fully probabilistic, hidden Markov
model formulations) into an additional dimension necessary
to deal with 2D RNA secondary structure.
While SCFGs provide a natural mathematical framework for
RNA secondary structure alignment problems, SCFG algorithms
have high computational complexity that has impeded their
practical application. Optimal SCFG-based structural
alignment of an RNA structure to a sequence costs
O (
N 3) memory and
O (
N 4) time for a sequence of length
N, compared to
O (
N 2) memory and time for sequence
alignment algorithms. (Corpet and Michot described a
program that implements a different general dynamic
programming algorithm for RNA alignment; their algorithm
solves the same problem but even less efficiently,
requiring
O (
N 4) memory and
O (
N 5) time [ 25 ] .) SCFG-based
alignments of small structural RNAs are feasible. Using my
COVE software
http://www.genetics.wustl.edu/eddy/software#cove, transfer
RNA alignments (~75 nucleotides) take about 0.2 cpu second
and 3 Mb of memory. Most genome centers now use an
COVE-based search program, tRNAscan-SE, for annotating
transfer RNA genes [ 9 ] . However, many larger RNAs of
interest are OUTSIDE the capabilities of the standard SCFG
alignment algorithm. Alignment of a small subunit (SSU)
ribosomal RNA sequence to the SSU rRNA consensus structure
would take about 23 GB of RAM and an hour of CPU time.
Applying SCFG methods to RNAs this large has required
clever heuristics, such as using a precalculation of
confidently predicted regions of primary sequence alignment
to strongly constrain which parts of the SCFG dynamic
programming matrix need to be calculated [ 26 ] . The steep
memory requirement remains a significant barrier to the
practicality of SCFG algorithms.
Notredame
et al. pointed specifically to this
problem [ 27 ] . They described RAGA, a program that uses a
genetic algorithm (GA) to optimize a pairwise RNA alignment
using an objective function that includes base pairing
terms. Because GAs have an O(
N ) memory requirement, RAGA can find
reasonable solutions for large RNA alignment problems,
including ribosomal RNA alignments. A different
memory-efficient approach has also been described [ 28 29 ]
. However, both approaches are approximate and cannot
guarantee a mathematically optimal solution, in contrast to
the (mathematically) optimal but more expensive dynamic
programming approaches.
Here, I introduce a dynamic programming solution to the
problem of structural alignment of large RNAs. The central
idea is a divide and conquer strategy. For linear sequence
alignment, a divide and conquer algorithm was introduced by
Hirschberg [ 30 ] , an algorithm known in the computational
biology community as the Myers/Miller algorithm [ 31 ] .
(Ironically, at the time, dynamic programming methods for
optimal sequence alignment were well known, but were
considered impractical on 1970's era computers because of
the "extreme"
O (
N 2) memory requirement.)
Myers/Miller reduces the memory complexity of a dynamic
programming sequence alignment algorithm from
O (
N 2) to
O (
N ), at the cost of a roughly
two-fold increase in CPU time. Here I show that a divide
and conquer strategy can also be applied to the RNA
structural alignment problem, greatly reducing the memory
requirement of SCFG alignments and making optimal
structural alignment of large RNAs possible.
I will strictly be dealing with the problem of aligning
a target sequence of unknown secondary structure to a query
of known RNA structure. By "secondary structure" I mean
nested (nonpseudoknotted) pairwise RNA secondary structure
interactions, primarily Watson-Crick base pairs but also
permitting noncanonical base pairs. This RNA structural
alignment problem is different from the problem of aligning
two known RNA secondary structures together [ 32 ] , and
from the problem of aligning two RNA sequences of unknown
structure together under a secondary structure-aware
scoring system [ 33 34 35 36 37 ] .
Algorithm
Prelude: the simpler case of sequence
alignment
The essential concepts of a divide and conquer
alignment algorithm are most easily understood for the
case of linear sequence alignment [ 30 31 ] .
Dynamic programming (DP) algorithms for sequence
alignment fill in an
N ×
M DP matrix of scores
F(i,j) for two sequences of lengths
N and
M (
N <
M ) [ 38 39 ] . Each score
F(i,j) is the score of the optimal
alignment of prefix
x
1 ..
x
i
of one sequence to prefix
y
1 ..
y
j
of the other. These scores are calculated
iteratively, e.g. for global (Needleman/Wunsch)
alignment:
At the end,
F(N, M) contains the score of the
optimal alignment. The alignment itself is recovered by
tracing the individual optimal steps backwards through
the matrix, starting from cell (
N,M ). The algorithm is
O(NM) in both time and memory.
If we are only interested in the score, not the
alignment itself, the whole
F matrix does not have to be kept
in memory. The iterative calculation only depends on the
current and previous row of the matrix. Keeping two rows
in memory suffices (in fact, a compulsively efficient
implementation can get away with
N + 1 cells). A score-only
calculation can be done in
O (
N ) space.
The fill stage of DP alignment algorithms may be run
either forwards and backwards. We can just as easily
calculate the optimal score
B(i, j) of the best alignment of
the
suffix i + 1..
N of sequence 1 to the suffix
j + 1..
M of sequence 2, until one obtains
B(0,0), the overall optimal score - the same number as
F(N,M).
The sum of
F(i,j) and
B(i,j) at any cell in the optimal
path through the DP matrix is also the optimal overall
alignment score. More generally,
F(i,j) + B(i,j) at any cell (
i,j ) is the score of the best
alignment that uses that cell. Therefore, since we know
the optimal alignment must pass through any given row
i somewhere, we can pick some row
i in the middle of sequence
x, run the forward calculation to
i to obtain row
F(i), run the backwards calculation
back to
i to get row
B(i), and then find argmax
j
F (
i ,
j )+
B (
i ,
j ). Now I know the optimal
alignment passes through cell (
i,j ). (For clarity, I am leaving
out details of how indels and local alignments are
handled.)
This divides the alignment into two smaller alignment
problems, and these smaller problems can themselves be
subdivided by the same trick. Thus, the complete optimal
alignment can be found by a recursive series of split
point calculations. Although this seems laborious - each
calculation is giving us only a single point in the
alignment - if we choose our split row
i to be in the middle, the size of
the two smaller DP problems is decreased by about 4-fold
at each split. A complete alignment thus costs only about
times as much CPU time as doing the alignment in a single
DP matrix calculation, but the algorithm is
O (
N ) in memory.
A standard dynamic programming alignment algorithm for
SCFGs is the Cocke-Younger-Kasami (CYK) algorithm, which
finds an optimal parse tree (e.g. alignment) for a model
and a sequence [ 24 40 41 42 ] . (CYK is usually
described in the literature as a dynamic programming
recognition algorithm for nonstochastic CFGs in Chomsky
normal form, rather than as a dynamic programming parsing
algorithm for SCFGs in any form. The use of the name
"CYK" here is therefore a little imprecise [ 24 ] .) CYK
can be run in a memory-saving "score only" mode. The DP
matrix for CYK can also be filled in two opposite
directions - either "inside" or "outside", analogous to
forward and backward DP matrix fills for linear sequence
alignment. I will refer to these algorithms as CYK/inside
and CYK/outside (or just inside and outside), but readers
familiar with SCFG algorithms should not confuse them
with the SCFG Inside and Outside algorithms [ 43 44 ]
which sum over all possible parse trees rather than
finding one optimal parse tree. I am always talking about
the CYK algorithm in this paper, and by "inside" and
"outside" I am only referring generically to the
direction of the CYK DP calculation.
The CYK/inside and CYK/outside algorithms are not as
nicely symmetrical as the forward and backward DP fills
are in sequence alignment algorithms. The splitting
procedure that one obtains does not generate identical
types of subproblems, so the divide and conquer procedure
for SCFG-based RNA alignment is not as obvious.
Definition and construction of a covariance
model
Definition of a stochastic context free
grammar
A stochastic context free grammar (SCFG) consists of
the following:
•
M different nonterminals (here
called
states ). I will use capital
letters to refer to specific nonterminals;
V and
Y will be used to refer
generically to unspecified nonterminals.
•
K different terminal symbols
(e.g. the observable alphabet, a,c,g,u for RNA). I will
use small letters
a, b to refer generically to
terminal symbols.
• a number of
production rules of the form:
V → γ, where γ can be any string
of nonterminal and/or terminal symbols, including (as a
special case) the empty string ε.
• Each production rule is associated with a
probability, such that the sum of the production
probabilities for any given nonterminal
V is equal to 1.
SCFG productions allowed in covariance
models
A covariance model is a specific repetitive "profile
SCFG" architecture consisting of groups of model states
that are associated with base pairs and single-stranded
positions in an RNA secondary structure consensus. A
covariance model has seven types of states and
production rules (Table 1).
Each overall production probability is the
independent product of an emission probability
e
v
and a transition probability
t
v
, both of which are position-dependent parameters
that depend on the state
v (analogous to hidden Markov
models). For example, a particular pair (P) state
v produces two correlated letters
a and
b (e.g. one of 16 possible base
pairs) with probability
e
v
(
a ,
b ) and transits to one of
several possible new states
Y of various types with
probability
t
v
(
Y ). A bifurcation (B) state
splits into two new start (
S ) states with probability 1.
The E state is a special case ε production that
terminates a derivation.
A CM consists of many states of these seven basic
types, each with its own emission and transition
probability distributions, and its own set of states
that it can transition to. Consensus base pairs will be
modeled by P states, consensus single stranded residues
by L and R states, insertions relative to the consensus
by more L and R states, deletions relative to consensus
by D states, and the branching topology of the RNA
secondary structure by B, S, and E states. The
procedure for starting from an input multiple alignment
and determining how many states, what types of states,
and how they are interconnected by transition
probabilities is described next.
From consensus structural alignment to guide
tree
Figure 1shows an example input file: a multiple
sequence alignment of homologous RNAs, with a line
describing the consensus RNA secondary structure. The
first step of building a CM is to produce a binary
guide tree of nodes representing
the consensus secondary structure. The guide tree is a
parse tree for the consensus structure, with nodes as
nonterminals and alignment columns as terminals.
The guide tree has eight types of nodes (Table
2).
These consensus node types correspond closely with a
CM's final state types. Each node will eventually
contain one or more states. The guide tree deals with
the consensus structure. For individual sequences, we
will need to deal with insertions and deletions with
respect to this consensus. The guide tree is the
skeleton on which we will organize the CM. For example,
a MATP node will contain a P-type state to model a
consensus base pair; but it will also contain several
other states to model infrequent insertions and
deletions at or adjacent to this pair.
The input alignment is first used to construct a
consensus secondary structure (Figure 2) that defines
which aligned columns will be ignored as non-consensus
(and later modeled as insertions relative to the
consensus), and which consensus alignment columns are
base-paired to each other. Here I assume that both the
structural annotation and the labeling of insert versus
consensus columns is given in the input file, as shown
in the line marked "[structure]" in the alignment in
Figure 1. Alternatively, automatic methods might be
employed. A consensus structure could be predicted from
comparative analysis of the alignment [ 22 45 46 ] .
The consensus columns could be chosen as those columns
with less than a certain fraction of gap symbols, or by
a maximum likelihood criterion, as is done for profile
HMM construction [ 18 24 ] .
Given the consensus structure, consensus base pairs
are assigned to MATP nodes and consensus unpaired
columns are assigned to MATL or MATR nodes. One ROOT
node is used at the head of the tree. Multifurcation
loops and/or multiple stems are dealt with by assigning
one or more BIF nodes that branch to subtrees starting
with BEGL or BEGR head nodes. (ROOT, BEGL, and BEGR
start nodes are labeled differently because they will
be expanded to different groups of states; this has to
do with avoiding ambiguous parse trees for individual
sequences, as described below.) Alignment columns that
are considered to be insertions relative to the
consensus structure are ignored at this stage.
In general there will be more than one possible
guide tree for any given consensus structure. Almost
all of this ambiguity is eliminated by three
conventions: (1) MATL nodes are always used instead of
MATR nodes where possible, for instance in hairpin
loops; (2) in describing interior loops, MATL nodes are
used before MATR nodes; and (3) BIF nodes are only
invoked where necessary to explain branching secondary
structure stems (as opposed to unnecessarily
bifurcating in single stranded sequence). One source of
ambiguity remains. In invoking a bifurcation to explain
alignment columns
i..j by two substructures on
columns
i..k and
k + 1..
j, there will be more than one
possible choice of
k if
i..j is a multifurcation loop
containing three or more stems. The choice of
k impacts the performance of the
divide and conquer algorithm; for optimal time
performance, we will want bifurcations to split into
roughly equal sized alignment problems, so I choose the
k that makes
i..k and
k + 1..
j as close to the same length as
possible.
The result of this procedure is the guide tree. The
nodes of the guide tree are numbered in preorder
traversal (e.g. a recursion of "number the current
node, visit its left child, visit its right child":
thus parent nodes always have lower indices than their
children). The guide tree corresponding to the input
multiple alignment in Figure 1is shown in Figure 2.
From guide tree to covariance model
A CM must deal with insertions and deletions in
individual sequences relative to the consensus
structure. For example, for a consensus base pair,
either partner may be deleted leaving a single unpaired
residue, or the pair may be entirely deleted;
additionally, there may be inserted nonconsensus
residues between this pair and the next pair in the
stem. Accordingly, each node in the master tree is
expanded into one or more
states in the CM as follows
(Table 3)
Here we distinguish between consensus ("M", for
"match") states and insert ("I") states. ML and IL, for
example, are both L type states with L type
productions, but they will have slightly different
properties, as described below.
The states are grouped into a
split set of 1-4 states (shown in
brackets above) and an
insert set of 0-2 insert states.
The split set includes the main consensus state, which
by convention is first. One and only one of the states
in the split set must be visited in every parse tree
(and this fact will be exploited by the divide and
conquer algorithm). The insert state(s) are not
obligately visited, and they have self-transitions, so
they will be visited zero or more times in any given
parse tree.
State transitions are then assigned as follows. For
bifurcation nodes, the B state makes obligate
transitions to the S states of the child BEGL and BEGR
nodes. For other nodes, each state in a split set has a
possible transition to every insert state in the
same node, and to every state in
the split set of the
next node. An IL state makes a
transition to itself, to the IR state in the same node
(if present), and to every state in the split set of
the next node. An IR state makes a transition to itself
and to every state in the split set of the next
node.
This arrangement of transitions guarantees that
(given the guide tree) there is unambiguously one and
only one parse tree for any given individual structure.
This is important. The algorithm will find a maximum
likelihood parse tree for a given sequence, and we wish
to interpret this result as a maximum likelihood
structure, so there must be a one to one relationship
between parse trees and secondary structures [ 47 ]
.
The final CM is an array of
M states, connected as a directed
graph by transitions
t
v
(
y ) (or probability 1 transitions
v → (
y,z ) for bifurcations) with the
states numbered such that (
y,z ) ≥
v. There are no cycles in the
directed graph other than cycles of length one (e.g.
the self-transitions of the insert states). We can
think of the CM as an array of states in which all
transition dependencies run in one direction; we can do
an iterative dynamic programming calculation through
the model states starting with the last numbered end
state
M and ending in the root state 1.
An example CM, corresponding to the input alignment of
Figure 1, is shown in Figure 3.
As a convenient side effect of the construction
procedure, it is guaranteed that the transitions from
any state are to a
contiguous set of child states,
so the transitions for state
v may be kept as an offset and a
count. For example, in Figure 3, state 12 (an MP)
connects to states 16, 17, 18, 19, 20, and 21. We can
store this as an offset of 4 to the first connected
state, and a total count of 6 connected states. We know
that the offset is the distance to the next non-split
state in the current node; we also know that the count
is equal to the number of insert states in the current
node, plus the number of split set states in the next
node. These properties make establishing the
connectivity of the CM trivial. Similarly, all the
parents of any given state are also contiguously
numbered, and can be determined analogously. We are
also guaranteed that the states in a split set are
numbered contiguously. This contiguity is exploited by
the divide and conquer implementation.
Parameterization
Using the guide tree and the final CM, each
individual sequence in the input multiple alignment can
be converted unambiguously to a CM parse tree, as shown
in Figure 4. Counts for observed state transitions and
singlet/pair emissions are then collected from these
parse trees. The observed counts are converted to
transition and emission probabilities by standard
procedures. I calculate maximum a posteriori
parameters, using Dirichlet priors.
Comparison to profile HMMs
The relationship between an SCFG and a covariance
model is analogous to the relationship of hidden Markov
models (HMMs) and profile HMMs for modeling multiple
sequence alignments [ 18 19 24 ] . A comparison may be
instructive to readers familiar with profile HMMs. A
profile HMM is a repetitive HMM architecture that
associates each consensus column of a multiple
alignment with a single type of model node - a MATL
node, in the above notation. Each node contains a
"match", "delete", and "insert" HMM state - ML, IL, and
D states, in the above notation. The profile HMM also
has special begin and end states. Profile HMMs could
therefore be thought of as a special case of CMs. An
unstructured RNA multiple alignment would be modeled by
a guide tree of all MATL nodes, and converted to an
unbifurcated CM that would essentially be identical to
a profile HMM. (The only difference is trivial; the CM
root node includes a IR state, whereas the start node
of a profile HMM does not.) All the other node types
(especially MATP, MATR, and BIF) and state types (e.g.
MP, MR, IR, and B) are SCFG augmentations necessary to
extend profile HMMs to deal with RNA secondary
structure.
The SCFG inside and outside algorithms are analogous
to the Forward and Backward algorithms for HMMs [ 24 48
] . The CYK/inside parsing algorithm is analogous to
the Viterbi HMM alignment algorithm run in the forward
direction. CYK/outside is analogous to a Viterbi DP
algorithm run in the backwards direction.
Divide and conquer algorithm
Notation
I use
r, v, w, y, and
z as indices of states in the
model, where
r ≤ (
v ,
w ,
y ) ≤
z. These indices will range from
1..
M , for a CM
G that contains
M states. refers to a subgraph of
the model, rooted at state
r and ending at state
z, for a contiguous set of states
r..z. G
r , without a subscript, refers
to a subgraph of the model rooted at state
r and ending at the highest
numbered E state descendant from state
r. The complete model is , or
G 1, or just
G.
S
v
refers to the
type of state
v ; it will be one of seven types
{D,P,L,R,S,E,B}.
C
v
is a list of children for state
v (e.g. the states that
v can transit to); it will
contain up to six contiguous indices
y with
v ≤
y ≤
M .
P
v
is a list of parents for state
v (states that could have
transited to state
v ); it will contain up to six
contiguous indices
y with 1 ≤
y ≤
v. (
P
v
parent lists should not be confused with P state
types.)
I use
g, h, i, j, k, p, and
q as indices referring to
positions in a sequence
x, where
g ≤
h ≤
p ≤
q and
i ≤
j for all subsequences of nonzero
length. These indices range from 1..
L, for a sequence of length
L . Some algorithms will also use
d to refer to a subsequence
length, where
d =
j -
i + 1 for a subsequence
x
i
..
x
j
.
The algorithms will have to account for subsequences
of zero length (because of deletions). By convention,
these will be in the off-diagonal where
j =
i - 1 or
i =
j + 1. This special case (usually
an initialization condition) is the reason for the
qualification that
i ≤
j for subsequences of
nonzero length.
The CYK/inside algorithm calculates a
three-dimensional matrix of numbers α
v
(
i ,
j ), and CYK/outside calculates
numbers β
v
(
i ,
j ). I will refer to
v (state indices) as
deck coordinates in the
three-dimensional matrices, whereas
j and
i (sequence positions) are row
and column coordinates within each deck. α
v
and β
v
refer to whole two-dimensional decks containing
scores α
v
(
i ,
j ) and β
v
(
i ,
j ) for a particular state
v. The dividing and conquering
will be done in the
v dimension, by choosing
particular decks as split points.
The CYK/inside algorithm
The CYK/inside algorithm iteratively calculates α
v
(
i ,
j ) - the log probability of the
most likely CM parse subtree rooted at state
v that generates subsequence
x
i
..
x
j
of sequence
x. The calculation initializes at
the smallest subgraphs and subsequences (e.g. subgraphs
rooted at E states, generating subsequences of length
0), and iterates outwards to progessively longer
subsequences and larger CM subgraphs.
For example, if we're calculating α
v
(
i ,
j ) and
S
v
= P (that is,
v is a pair state),
v will generate the pair
x
i
,
x
j
and transit to a new state
y (one of its possible
transitions
C
v
) which then will have to account for the smaller
subsequence
x
i +1 ..
x
j -1 . The log probability
for a particular choice of next state
y is the sum of three terms: an
emission term log
e
v
(
x
i
,
x
j
), a transition term log
t
v
(
y ), and an already calculated
solution for the smaller optimal parse tree rooted at
y , α
y
(
i + 1,
j - 1). The answer for α
v
(
i ,
j ) is the maximum over all
possible choices of child states
y that
v can transit to.
The algorithm INSIDE is as follows:
Input: A CM subgraph and
subsequence
x
g
..
x
q
.
Output: Scoring matrix decks α
r
..α
z
.
INSIDE(r,z; g,q)
for
v ←
z
down to
r
for
j ←
g - 1
to
q
for
i ←
j + 1
down to
g
d ←
j -
i + 1
if
S
v
= D or S:
α
v
(
i, j ) = [α
y
(
i, j ) + log
t
v
(
y )]
else if
S
v
= P and
d ≥ 2:
α
v
(
i, j ) = log
e
v
(
x
i
,
x
j
) + [α
y
(
i + 1,
j - 1) + log
t
v
(
y )]
else if
S
v
= L and
d ≥ 1:
α
v
(
i, j ) = log
e
v
(
x
i
) + [α
y
(
i + 1,
j ) + log
t
v
(
y )]
else if
S
v
= R and
d ≥ 1:
α
v
(
i, j ) = log
e
v
(
x
j
) + [α
y
(
i ,
j - 1) + log
t
v
(
y )]
else if
S
v
= B:
(
y ,
z ) ← left and right S children
of state
v
α
v
(
i, j ) = [α
y
(
i, k ) + α
z
(
k + 1,
j )]
else if
S
v
= E and
d = 0:
α
v
(
i ,
j ) = 0 (initializations)
else
α
v
(
i ,
j ) = -∞ (initializations)
Given a sequence
x of length
L and a CM
G of length
M , we could call INSIDE (1, M;
1, L) to align the whole model (states 1..
M ) to the whole sequence (
x
1 ..
x
L
). When INSIDE returns, α
1 (1,
L ) would contain the log
probability of the best parse of the complete sequence
with the complete model.
We do not have to keep the entire α
three-dimensional matrix in memory to calculate these
scores. As we reach higher decks α
v
in the three dimensional dynamic programming
matrix, our calculations no longer depend on certain
lower decks. A lower deck
y can be deallocated whenever all
the parent decks
P
y
that depend on it have been calculated. (The
implementation goes even further and recycles decks
when possible, saving some initialization steps and
many memory allocation calls; for example, since values
in all E decks are identical, only one E deck needs to
be calculated and that precalculated deck can be reused
whenever
S
v
=
E .)
This deallocation rule has an important property
that the divide and conquer algorithm takes advantage
of when solving smaller subproblems for CM subgraphs
rooted at some state
w. When the root state
w is an S state, the α matrix
returned by INSIDE contains only one active deck α
w
. (No lower state >
w can be reached from any state
<
w without going through
w, so all lower decks are
deallocated once deck
w is completed.) When the root
state
w is the first state in a split
set
w..y (see below for more
explanation), all (and only) the decks α
w
..α
y
are active when INSIDE returns.
In some cases we want to recover the optimal parse
tree itself, not just its score. The INSIDE τroutine is
a modified version of INSIDE. It keeps an additional
"shadow matrix" τ
v
(
i,j ). A τ
v
(
i,j ) traceback pointer either
records the index
y that maximized α
v
(
i,j ) (for state types D,S,P,L,R)
or records the split point
k that maximized α
v
(
i,j ) for a bifurcation (B)
state. The τ shadow matrix does not use the
deallocation rules - INSIDE τcan only be called for
problems small enough that they can be solved within
our available memory space. Thus the INSIDE τroutine
works by calling INSIDE in a mode that also keeps a
shadow matrix τ, and then calls a recursive traceback
starting with
v, i, j :
Input: A shadow matrix τ for CM
subgraph
G
v rooted at state
v, and subsequence
x
i
..
x
j
.
Output: An optimal parse tree
T .
TRACEBACK(v,i,j)
if
S
v
=
E :
attach
v
else if
S
v
=
S or D:
attach
v
TRACEBACK(τ
v
(
i ,
j ),
i ,
j ))
else if
S
v
= P:
attach
x
i
,
v ,
x
j
TRACEBACK(τ
v
(
i ,
j ),
i + 1,
j - 1)
else if
S
v
= L:
attach
x
i
,
v
TRACEBACK(τ
v
(
i ,
j ),
i + 1,
j )
else if
S
v
= R:
attach
v ,
x
j
TRACEBACK(τ
v
(
i ,
j ),
i ,
j - 1)
else if
S
v
= B:
(
y ,
z ) ← left and right S children
of state
v
attach
v
TRACEBACK(
y ,
i , τ
v
(
i ,
j ))
TRACEBACK(
z , τ
v
(
i ,
j ) + 1,
j )
The CYK/outside algorithm
The CYK/outside algorithm iteratively calculates β
v
(
i ,
j ), the log probability of the
most likely CM parse tree for a CM generating a
sequence
x
1 ..
x
L
excluding the optimal parse
subtree rooted at state
v that accounts for the
subsequence
x
i
..
x
j
. The calculation initializes with the entire
sequence excluded (e.g. β
1 (1,
L ) = 0), and iterates inward to
progressively shorter and shorter excluded subsequences
and smaller CM subgraphs.
A complete implementation of the CYK/outside
algorithm requires first calculating the CYK/inside
matrix α because it is needed to calculate β
v
when the parent of
v is a bifurcation [ 24 43 44 ] .
However, the divide and conquer algorithm described
here only calls OUTSIDE on
unbifurcated, linear CM
subgraphs (only the final state
z may be a B state; there are no
internal bifurcations that lead to branches in the
model). Thus the parent of a state
v is never a bifurcation, and the
implementation can therefore be streamlined as
follows:
Input: An unbifurcated CM subgraph
and subsequence
x
g
..
x
q
.
Output: Scoring matrix decks β
r
..β
z
.
OUTSIDE(r,z; g,q)
β
v
(
i ,
j ) ← - ∞ ∀
v ,
i ,
j
β
r
(
g ,
q ) ← 0
for
v ←
r + 1
to
z
for
j ←
q
down to
g - 1
for
i ←
g
to
j + 1
As with INSIDE, we do not keep the entire β matrix
in memory. A deck β
v
can be deallocated when all child decks
C
v
that depend on the values in β
v
have been calculated. This means that if the last
deck
z is a bifurcation or end state,
β
z
will be the only active allocated deck when
OUTSIDE returns. If
z is the last state in a split
set
w..z, all (and only) the split
set decks β
w
..β
z
will be active when OUTSIDE returns.
Using CYK/inside and CYK/outside to divide and
conquer
Now, for any chosen state
v, argmax
i ,
j [α
v
(
i, j ) + β
v
(
,i,j )] tells us which cell
v, i, j the optimal parse tree
passes through, conditional on using state
v in the parse. We know that any
parse tree must include all the bifurcation and start
states of the CM, so we know that the optimal alignment
must use any chosen bifurcation
state
v and its child start states
w and
y. Thus, we are guaranteed that
(when
S
v
= B and
C
v
=
w, y ):
is the optimal overall alignment score, and we also
know that
gives us a triplet that identifies three cells that
must be in the optimal alignment - (
v, i, j ), (
w, i, k ), and (
y, k + 1,
j ). This splits the remaining
problem into three smaller subproblems - an alignment
of the sequence
x
i
..
x
k
to a CM subgraph
w..y - 1, an alignment of the
sequence
x
k +1 ..
x
j
to a CM subgraph
y..M , and an alignment of the
two-piece sequence
x
1 ..
z
i -1 //
x
j +1 ..
x
L
to a CM subgraph 1..
v .
The subproblems are then themselves split, and this
splitting can continue recursively until all the
bifurcation triplets on the optimal parse tree have
been determined.
At this point the remaining alignment subproblems
might be small enough to be solved by straightforward
application of the standard CYK/inside algorithm (e.g.
INSIDE τ). However, this is not guaranteed to be the
case. A more general division strategy is needed that
does not depend on splitting at bifurcations.
For the more general strategy we take advantage of
the fact that we know that the optimal parse tree must
also include one and only one state from the split set
of each node (e.g. the non-insert states in the node).
Let
w..y be the indices of a split
set of states in the middle of the current model
subgraph. (
w..y can be at most 4 states.) We
know that
gives us a new cell (
v, i, j ) in the optimal parse
tree, and splits the problem into two smaller problems.
This strategy can be applied recursively all the way
down to single nodes, if necessary. We can therefore
guarantee that we will never need to carry out a full
CYK/inside alignment algorithm on any subproblem. The
most memory-intensive alignment problem that needs to
be solved is the very first split. The properties of
the first split determine the memory complexity of the
algorithm.
The bifurcation-dependent strategy is a special case
of this more general splitting strategy, where the B
state is the only member of its split set, and where we
also take advantage of the fact that α
v
(
i ,
j ) = max
k
α
w
(
i, k ) + α
y
(
k + 1,
j ). By carrying out the max
k
operation during the split, rather than before, we
can split the current problem into three optimal pieces
instead of just two.
If we look at the consequences of these splitting
strategies, we see we will have to deal with three
types of problems (Figure 5):
• A
generic problem means finding the
optimal alignment of a CM subgraph to a contiguous
subsequence
x
g
..
x
q
. The subgraph corresponds to a complete subtree
of the CM's guide tree - e.g. state
r is a start (S), and state
z is an end (E). may contain
bifurcations. The problem is solved in one of two ways.
If contains no bifurcations, it is solved as a wedge
problem (see below). Else, the problem is subdivided by
the bifurcation-dependent strategy: an optimal triple
(i,
k, j ) is found for a bifurcation
state
v and its children
w, y, splitting the problem into
a V problem and two generic problems.
• A
wedge problem means finding the
optimal alignment of an
unbifurcated CM subgraph to a
contiguous subsequence
x
g
..
x
q
. State
z does not have to be a start
state (S); it may be a state in a split set (MP, ML,
MR, or D). State
z is an end (E). A wedge problem
is solved by the split set-dependent strategy: an
optimal (
v, i, j ) is found, splitting the
problem into a V problem and a smaller wedge
problem.
• A
V problem consists of finding the
optimal alignment of an
unbifurcated CM subgraph to a
noncontiguous, two-piece sequence
x
g
..
x
h
//
x
p
..
x
g
, exclusive of the residues
x
h
and
x
p
(open circles in Figure 5). State
r can be a start state or any
state in a split set; the same is true for
z. A V problem is solved by a
split set-dependent strategy: an optimal (
v, i, j ) is found, splitting the
problem into two V problems.
The three recursive splitting algorithms to solve
these problems are as follows:
The generic_splitter routine
Input: A generic problem, for CM
subgraph and subsequence
x
g ..
q .
Output: An optimal parse subtree
T .
GENERIC_SPLITTER(r, z; g,q)
if no bifurcation in :
return WEDGE_SPLITTER(r,z; g,q)
else
v ← lowest numbered bifurcation
state in subgraph .
w,y ← left and right S children
of
v.
β
v
← OUTSIDE(r,w; g,q)
α
w
← INSIDE(w,y-1; g,q)
α
y
← INSIDE(y,z; g,q)
T
1 ← V_SPLITTER(r,v; g,i;
j,q)
T
2 ← GENERIC_SPLITTER(w,y-1;
i,k)
T
3 ← GENERIC_SPLITTER(y,z;
k+l,j)
Attach S state
w of
T
2 as left child of B state
v in
T
1 .
Attach S state
y of
T
3 as right child of B state
v in
T
2 .
return T
1 .
The wedge_splitter routine
Input: A wedge problem, for
unbifurcated CM subgraph and subsequence
x
g ..
q .
Output: An optimal parse subtree
T .
WEDGE_SPLITTER(r,z; g,q)
(
w ..
y ) ← a split set chosen from
middle of
(α
w
..α
y
) ← INSIDE(w,z; g,q)
(β
w
..β
y
) ← OUTSIDE(r,y; g,q)
T
1 ← V_SPLITTER(r,v; g,i;
j,q)
T
2 ← WEDGE_SPLITTER(v,z;
i,j)
Attach
T
2 to
T
1 by merging at state
v.
return T
1 .
The V_splitter routine
Input: A V problem, for
unbifurcated CM subgraph and two-part subsequence
x
g ..
h //
x
p ..
q .
Output: An optimal parse subtree
T .
V_SPLITTER(r,z; g,h; p,q)
(
w..y ) ← a split set chosen from
middle of
(α
w
..α
y
) ← VINSIDE(w,z; g,h; p,q)
(β
w
..β
y
) ← VOUTSlDE(r,y; g,h; p,q)
T
1 ← V_SPLITTER(r,v; g,i;
j,q)
T
2 ← V_SPLITTER(v,z; i,h;
p,j)
Attach
T
2 to
T
1 by merging at state
v.
return T
1 .
The vinside and voutside routines
The VINSIDE and VOUTSIDE routines are just INSIDE
and OUTSIDE, modified to deal with a two-piece
subsequence
x
g
..
x
h
//
x
p
..
x
g
instead of a contiguous sequence
x
g
..
x
q
. These modifications are fairly obvious. The
range of
i, j is restricted so that
i ≤
h and
j ≥
p. Also, VINSIDE (w, z; g, h; p,
q) initializes α
z
(
h ,
p ) = 0: that is, we know that
sequence
x
h
..
x
p
has already been accounted for by a CM parse tree
rooted at
z.
Implementation
In the description of the algorithms above, some
technical detail has been omitted - in particular, a
detailed description of efficient initialization steps,
and details of how the the dynamic programming matrices
are laid out in memory. These details are not necessary
for a high level understanding of the divide and
conquer algorithm. However, they may be necessary for
reproducing a working implementation. Commented ANSI/C
source code for a reference implementation is therefore
freely available at
http://www.genetics.wustl.edu/eddy/infernal/under a GNU
General Public License. This code has been tested on
GNU/Linux platforms.
In this codebase, the CM data structure is defined
in structs.h. The CM construction procedure is in
modelmaker. c:Handmodelmaker(). The guide tree is
constructed in HandModelmaker(). A CM is constructed
from the guide tree by cm_from_master(). Individual
parse trees are constructed using the guide tree by
transmogrify(). The divide and conquer algorithm is
implemented in smallcyk. c:CYKDivideAndConquer(), which
will recursively call a set of functions: the three
splitting routines GENERIC_SPLITTER(),
wedge_splitter(), and v_splitter(); the four alignment
engines INSIDE(), OUTSIDE(), VINSIDE(), and VOUTSIDE();
and the two traceback routines INSIDET() and
VINSIDET().
Results and discussion
Memory complexity analysis
The memory complexity of normal CYK/inside is
O (
N 2
M ), for a model of
M states and a query sequence of
N residues, since the full 3D
dynamic programming matrix is indexed
N ×
N ×
M (and since
N ∝
M , we can alternatively state the
upper bound as
O (
N 3)). The memory complexity of the
divide and conquer algorithm is
O (
N 2log
M ). The analysis that leads to
this conclusion is as follows.
For a model with no bifurcations, the divide and
conquer algorithm will never require more than 10 decks
in memory at once. In the case of two adjacent MATP
nodes, we will need six decks to store the scores for the
current node we're calculating, and four decks for the
split set of the adjacent node that we're connecting to
(and dependent upon) (Figure 3).
Bifurcations will require some number of additional
decks for start states (BEGL_S and BEGR_S) to be kept. In
INSIDE, whenever we reach a deck for a start state, we
will keep that deck in memory until we reach the parent
bifurcation state. Half the time, that will mean waiting
until another complete subgraph of the model is
calculated (e.g. the subgraph rooted at the other start
child of that bifurcation); that is, to calculate deck α
v
for a bifurcation
v, we need both decks α
w
and α
y
for its child start states
w and
y, so we have to hold on to α
y
until we reach α
w
. In turn, the subgraph rooted at
w might contain bifurcations, so
our calculation of α
w
might require additional decks to be kept. Each
start deck we reach in the INSIDE iteration means holding
one extra deck in memory, and each bifurcation we reach
means deallocating the two start decks it depends on;
therefore we can iteratively calculate the maximum number
of extra decks we will require:
x
M
← -1
for v ←
M -1
to 1
return max
v
x
v
This number depends on the topology and order of
evaluation of the states in the CM. Think of the
bifurcating structure of the CM as a binary tree numbered
in preorder traversal (e.g. left children are visited
first, and have lower indices than right children). If
this is a complete balanced tree with
B bifurcations, we will need log
2
B extra decks. If it is a maximally
unbalanced tree in which bifurcations only occur in left
children, we will need
B extra decks (all the right
children). If it is a maximally unbalanced tree in which
bifurcations only occur in right children, we will only
ever need 1 extra deck. A left-unbalanced binary tree can
be converted to a right-unbalanced binary tree just by
swapping branches. For a CM, we can't swap branches
without affecting the order of the sequence that's
generated. We can, however, get the same effect by
renumbering the CM states in a modified preorder
traversal. Instead of always visiting the left subtree
first, we visit the best subtree first, where "best"
means the choice that will optimize memory usage. This
reordering is readily calculated in
O (
M ) time (not shown; see cm.
c:CMRebalance() in the implementation). This way, we can
never do worse than the balanced case, and we will often
do better. We never need more than log
2
B extra decks. Since
B <
M, we can guarantee a
O (
N 2log
M ) bound on memory complexity.
Time complexity analysis
The time complexity of the standard algorithm is
O (
MN 2+
BN 3), for a model of
M states (
B of which are bifurcations)
aligned to a sequence of
N residues. Since
B <
M, and
M ∝
N, we can also state the upper
bound as
O (
MN 3) or
O (
N 4).
The time complexity of the divide and conquer
algorithm depends on how close each split is to dividing
a problem into equal sized subproblems. In the most ideal
case, each call to GENERIC_SPLITTER could split into
three subproblems that each contained 1/3 of the states
and residues: splitting those three subproblems would
only cost:
3 × ×
MN 3
in time, e.g. only about 1/27 the time it took to
split the first problem. Thus in an ideal case the time
requirement is almost completely dominated by the first
split, and the extra time required to do the complete
divide and conquer algorithm could be negligible. In
pathological cases, optimal splits might lead to a series
of very unequally sized problems. We never need to do
more splits than there are states in the model, so we
cannot do worse than
O (
M 2
N 3) in time.
An example of a pathological case is an RNA structure
composed of a series of multifurcation loops such that
each bifurcation leads to a small stem on one side, and
the rest of the structure on the other. In such a case,
every call to GENERIC_SPLITTER will split into a small
subproblem containing a small stem (e.g. only removing a
constant number of states and residues per split) and a
large subproblem containing all the remaining states and
sequence. This case can be avoided. It only arises
because of the decision to implement a simplified
CYK/Outside algorithm and always split at the highest
bifurcations. Better time performance could be guaranteed
if a complete CYK/Outside algorithm were implemented (at
the cost of complexity in the description and
implementation of the algorithm). This would allow us to
choose a split point in a generic problem at any state in
the CM (for instance, in the middle) regardless of its
bifurcating structure.
In practice, empirical results on a variety of real
RNAs (see below) indicate that the extra time required to
do the divide and conquer is a small constant factor. A
more complex implementation does not seem to be
necessary.
Empirical results
Six structural RNA families were chosen for empirical
evaluations of the algorithm, using available RNA
database resources - tRNA [ 49 ] , 5S ribosomal RNA [ 50
] , signal recognition particle (SRP) RNA [ 51 ] , RNase
P [ 52 ] , small subunit (SSU) ribosomal RNA [ 53 ] , and
large subunit (LSU) ribosomal RNA [ 54 ] . For each
family, a secondary structure annotated multiple
alignment of four or five example sequences was
extracted, and used to construct a CM.
The top of Table 4shows some statistics about these
alignments and CMs. The number of consensus columns in
the alignments ranges from 72 (tRNA) to 2898 (LSU rRNA);
about 55-60% are involved in consensus base pairs. The
number of CM states is consistently about 3-fold more
than the consensus alignment length, ranging from 230
states (tRNA) to 9023 (LSU). About 1/150 of the states in
each model are bifurcations. After optimal reordering of
the model states, the number of extra decks required by
the alignment algorithm is small, ranging up to 3 for SSU
and 5 for LSU rRNA. Therefore the minimum constant of 10
decks required in iterations across unbifurcated model
segments dominates the memory requirement. The memory
required for extra decks does not have much impact even
for the largest structural RNAs. (Even without optimal
reordering, the number of extra decks required for SSU
and LSU are only 7 and 9, respectively. State reordering
was only needed to assure a
O (
N 2log
M ) memory complexity bound.)
To determine the memory and CPU time requirements for
a structural alignment, one example sequence from each
family was aligned to the CM. CPU time was measured and
memory requirements were calculated for three algorithms:
(1) the full CYK/inside algorithm, but in memory-saving
score-only mode (e.g. INSIDE(1,M; 1,L); (2) the full
CYK/inside algorithm, with shadow matrix and traceback to
recover the optimal alignment (e.g. INSIDE τ(1,M; 1,L));
and (3) the divide and conquer algorithm to recover an
optimal alignment (e.g. GENERIC_SPLITTER (1,M; 1,L)). The
most important comparison is between the full CYK/inside
algorithm and the divide and conquer algorithm. The
score-only CYK/inside algorithm was included, because a
complete CYK alignment couldn't be done on SSU and LSU
rRNA (because of the steep memory requirement). In all
cases where comparison could be done, the scores and
alignments produced by these algorithms were verified to
be identical.
The results of these tests are shown in the bottom
half of Table 1. The memory required by divide and
conquer alignment ranges up to 271 MB for LSU rRNA,
compared to a prohibitive 151 GB for the standard CYK
algorithm. The extra CPU time required by the divide and
conquer is small; usually about 20% more, with a maximum
of about two-fold more for SRP-RNA.
The same results are plotted in Figure 6. Memory
requirements scale as expected:
N 3for standard CYK alignment, and
better than
N 2log
N for the divide and conquer
algorithm. Empirical CPU time requirements scale
similarly for the two algorithms (
N 3.24-
N 3.29). The observed performance
is better than the theoretical worst case of
O (
N 4). The proportion of extra time
required by divide and conquer is roughly constant over a
wide range of RNAs. The difference shown in Figure 6is
exaggerated because times are plotted for score-only CYK,
not complete CYK alignment, in order to include CPU times
for SSU and LSU rRNA. Because score-only CYK does not
keep a shadow traceback matrix nor perform the traceback,
it is about 20% faster than CYK alignment, as seen in the
data in Table 4.
Conclusions
The divide and conquer algorithm described here makes it
possible to align even the largest structural RNAs to
secondary structure consensus models, without exceeding the
available memory on current computational hardware. Optimal
SSU and LSU rRNA structural alignments can be performed in
70 MB and 270 MB of memory, respectively. Previous
structural alignment algorithms had to sacrifice
mathematical optimality to achieve ribosomal RNA
alignments.
The CPU time requirement of the alignment algorithm is
still significant, and even prohibitive for certain
important applications. However, CPU time is generally an
easier issue to deal with. A variety of simple
parallelization strategies are possible. Banded dynamic
programming algorithms (e.g. calculating only relevant
parts of the matrix) of various forms can also be explored,
including not only heuristic schemes, but also optimal
algorithms based on branch and bound ideas. (Properly
implemented, banded DP algorithms would also save
additional memory.)
The algorithm takes advantage of the structure of
covariance models (profile SCFGs), splitting in the
dimension of the states of the model rather than in the
sequence dimensions. The approach does not readily apply,
therefore, to unprofiled SCFG applications in RNA secondary
structure prediction [ 34 36 55 56 ] , where the states are
fewer and more fully interconnected. For these
applications, it would seem to be necessary to divide and
conquer in the sequence dimensions to obtain any
significant improvement in memory requirements, and it is
not immediately apparent how one would do this.
The current implementation of the algorithm is not
biologically useful. It is meant only as a testbed for the
algorithm. It outputs a raw traceback structure and
alignment score, not a standardly formatted alignment file.
Most importantly, the probability parameters for models are
calculated in a very quick and simple minded fashion, and
are far from being reasonable for producing robustly
accurate structural alignments. The next step along this
line is to produce good prior distributions for estimating
better parameters, by estimating mixture Dirichlet priors
from known RNA structures [ 57 ] . At this stage it would
not be meaningful to compare the biological alignment
accuracy of this implementation to (for instance) the
excellent performance of the RAGA genetic algorithm [ 27 ]
. A biologically useful implementation with accurate
alignment performance is of course the eventual goal of
this line of work, but is not the point of the present
paper.