Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
There are a growing number of RNA gene families and RNA
7
motifs [ 1 2 ] . Many (though not all) RNAs conserve a
8
base-paired RNA secondary structure. Computational analyses
9
of RNA sequence families are more powerful if they take
10
into account both primary sequence and secondary structure
11
consensus [ 3 4 ] .
12
Some excellent approaches have been developed for
13
database searching with RNA secondary structure consensus
14
patterns. Exact- and approximate-match pattern searches
15
(analogous to PROSITE patterns for proteins) have been
16
extended to allow patterns to specify long-range base
17
pairing constraints [ 5 6 ] . In several cases, specialized
18
programs have been developed to recognize specific RNA
19
structures [ 4 ] - for example, programs exist for
20
detecting transfer RNA genes [ 7 8 9 ] , group I catalytic
21
introns [ 10 ] , and small nucleolar RNAs [ 11 12 ] . All
22
of these approaches, though powerful, lack generality, and
23
they require expert knowledge about each particular RNA
24
family of interest.
25
In primary sequence analysis, the most useful analysis
26
techniques are general primary sequence alignment
27
algorithms with probabilistically based scoring systems -
28
for example, the BLAST [ 13 ] , FASTA [ 14 ] , or CLUSTALW
29
[ 15 ] algorithms, and the PAM [ 16 ] or BLOSUM [ 17 ]
30
score matrices. Unlike specialized programs, a general
31
alignment algorithm can be applied to find homologs of any
32
query sequence(s). Unlike pattern searches, which give
33
yes/no answers for whether a candidate sequence is a match,
34
a scoring system gives a meaningful score that allows
35
ranking candidate hits by their statistical significance.
36
It is of interest to develop general alignment algorithms
37
for RNA secondary structures.
38
The problem I consider here is as follows. I am given a
39
multiple alignment of an RNA sequence family for which I
40
know the consensus secondary structure. I want to search a
41
sequence database for homologs that significantly match the
42
sequence
43
and structure of my query. The
44
sequence analysis analogue is the use of profile hidden
45
Markov models (profile HMMs) to model multiple alignments
46
of conserved protein domains, and to discover new
47
homologues in sequence databases [ 18 19 ] . For instance,
48
if we had an RNA structure equivalent of the HMMER profile
49
HMM program suite http://hmmer.wustl.edu/it would be
50
possible to develop and efficiently maintain databases of
51
conserved RNA structures and multiple alignments, analogous
52
to the Pfam or SMART databases of conserved protein domains
53
[ 20 21 ] .
54
Stochastic context free grammar (SCFG) algorithms
55
provide a general approach to RNA structure alignment [ 22
56
23 24 ] . SCFGs allow the strong pairwise residue
57
correlations in non-pseudoknotted RNA secondary structure
58
to be taken into account in RNA alignments. SCFGs can be
59
aligned to sequences using a dynamic programming algorithm
60
that guarantees finding a mathematically optimal solution
61
in polynomial time. SCFG alignment algorithms can be
62
thought of as an extension of sequence alignment algorithms
63
(particularly those with fully probabilistic, hidden Markov
64
model formulations) into an additional dimension necessary
65
to deal with 2D RNA secondary structure.
66
While SCFGs provide a natural mathematical framework for
67
RNA secondary structure alignment problems, SCFG algorithms
68
have high computational complexity that has impeded their
69
practical application. Optimal SCFG-based structural
70
alignment of an RNA structure to a sequence costs
71
O (
72
N 3) memory and
73
O (
74
N 4) time for a sequence of length
75
N, compared to
76
O (
77
N 2) memory and time for sequence
78
alignment algorithms. (Corpet and Michot described a
79
program that implements a different general dynamic
80
programming algorithm for RNA alignment; their algorithm
81
solves the same problem but even less efficiently,
82
requiring
83
O (
84
N 4) memory and
85
O (
86
N 5) time [ 25 ] .) SCFG-based
87
alignments of small structural RNAs are feasible. Using my
88
COVE software
89
http://www.genetics.wustl.edu/eddy/software#cove, transfer
90
RNA alignments (~75 nucleotides) take about 0.2 cpu second
91
and 3 Mb of memory. Most genome centers now use an
92
COVE-based search program, tRNAscan-SE, for annotating
93
transfer RNA genes [ 9 ] . However, many larger RNAs of
94
interest are OUTSIDE the capabilities of the standard SCFG
95
alignment algorithm. Alignment of a small subunit (SSU)
96
ribosomal RNA sequence to the SSU rRNA consensus structure
97
would take about 23 GB of RAM and an hour of CPU time.
98
Applying SCFG methods to RNAs this large has required
99
clever heuristics, such as using a precalculation of
100
confidently predicted regions of primary sequence alignment
101
to strongly constrain which parts of the SCFG dynamic
102
programming matrix need to be calculated [ 26 ] . The steep
103
memory requirement remains a significant barrier to the
104
practicality of SCFG algorithms.
105
Notredame
106
et al. pointed specifically to this
107
problem [ 27 ] . They described RAGA, a program that uses a
108
genetic algorithm (GA) to optimize a pairwise RNA alignment
109
using an objective function that includes base pairing
110
terms. Because GAs have an O(
111
N ) memory requirement, RAGA can find
112
reasonable solutions for large RNA alignment problems,
113
including ribosomal RNA alignments. A different
114
memory-efficient approach has also been described [ 28 29 ]
115
. However, both approaches are approximate and cannot
116
guarantee a mathematically optimal solution, in contrast to
117
the (mathematically) optimal but more expensive dynamic
118
programming approaches.
119
Here, I introduce a dynamic programming solution to the
120
problem of structural alignment of large RNAs. The central
121
idea is a divide and conquer strategy. For linear sequence
122
alignment, a divide and conquer algorithm was introduced by
123
Hirschberg [ 30 ] , an algorithm known in the computational
124
biology community as the Myers/Miller algorithm [ 31 ] .
125
(Ironically, at the time, dynamic programming methods for
126
optimal sequence alignment were well known, but were
127
considered impractical on 1970's era computers because of
128
the "extreme"
129
O (
130
N 2) memory requirement.)
131
Myers/Miller reduces the memory complexity of a dynamic
132
programming sequence alignment algorithm from
133
O (
134
N 2) to
135
O (
136
N ), at the cost of a roughly
137
two-fold increase in CPU time. Here I show that a divide
138
and conquer strategy can also be applied to the RNA
139
structural alignment problem, greatly reducing the memory
140
requirement of SCFG alignments and making optimal
141
structural alignment of large RNAs possible.
142
I will strictly be dealing with the problem of aligning
143
a target sequence of unknown secondary structure to a query
144
of known RNA structure. By "secondary structure" I mean
145
nested (nonpseudoknotted) pairwise RNA secondary structure
146
interactions, primarily Watson-Crick base pairs but also
147
permitting noncanonical base pairs. This RNA structural
148
alignment problem is different from the problem of aligning
149
two known RNA secondary structures together [ 32 ] , and
150
from the problem of aligning two RNA sequences of unknown
151
structure together under a secondary structure-aware
152
scoring system [ 33 34 35 36 37 ] .
153
154
155
Algorithm
156
157
Prelude: the simpler case of sequence
158
alignment
159
The essential concepts of a divide and conquer
160
alignment algorithm are most easily understood for the
161
case of linear sequence alignment [ 30 31 ] .
162
Dynamic programming (DP) algorithms for sequence
163
alignment fill in an
164
N ×
165
M DP matrix of scores
166
F(i,j) for two sequences of lengths
167
168
N and
169
M (
170
N <
171
M ) [ 38 39 ] . Each score
172
F(i,j) is the score of the optimal
173
alignment of prefix
174
x
175
1 ..
176
x
177
178
i
179
of one sequence to prefix
180
y
181
1 ..
182
y
183
184
j
185
of the other. These scores are calculated
186
iteratively, e.g. for global (Needleman/Wunsch)
187
alignment:
188
189
At the end,
190
F(N, M) contains the score of the
191
optimal alignment. The alignment itself is recovered by
192
tracing the individual optimal steps backwards through
193
the matrix, starting from cell (
194
N,M ). The algorithm is
195
O(NM) in both time and memory.
196
If we are only interested in the score, not the
197
alignment itself, the whole
198
F matrix does not have to be kept
199
in memory. The iterative calculation only depends on the
200
current and previous row of the matrix. Keeping two rows
201
in memory suffices (in fact, a compulsively efficient
202
implementation can get away with
203
N + 1 cells). A score-only
204
calculation can be done in
205
O (
206
N ) space.
207
The fill stage of DP alignment algorithms may be run
208
either forwards and backwards. We can just as easily
209
calculate the optimal score
210
B(i, j) of the best alignment of
211
the
212
suffix i + 1..
213
N of sequence 1 to the suffix
214
j + 1..
215
M of sequence 2, until one obtains
216
B(0,0), the overall optimal score - the same number as
217
F(N,M).
218
The sum of
219
F(i,j) and
220
B(i,j) at any cell in the optimal
221
path through the DP matrix is also the optimal overall
222
alignment score. More generally,
223
F(i,j) + B(i,j) at any cell (
224
i,j ) is the score of the best
225
alignment that uses that cell. Therefore, since we know
226
the optimal alignment must pass through any given row
227
i somewhere, we can pick some row
228
i in the middle of sequence
229
x, run the forward calculation to
230
i to obtain row
231
F(i), run the backwards calculation
232
back to
233
i to get row
234
B(i), and then find argmax
235
236
j
237
238
F (
239
i ,
240
j )+
241
B (
242
i ,
243
j ). Now I know the optimal
244
alignment passes through cell (
245
i,j ). (For clarity, I am leaving
246
out details of how indels and local alignments are
247
handled.)
248
This divides the alignment into two smaller alignment
249
problems, and these smaller problems can themselves be
250
subdivided by the same trick. Thus, the complete optimal
251
alignment can be found by a recursive series of split
252
point calculations. Although this seems laborious - each
253
calculation is giving us only a single point in the
254
alignment - if we choose our split row
255
i to be in the middle, the size of
256
the two smaller DP problems is decreased by about 4-fold
257
at each split. A complete alignment thus costs only about
258
times as much CPU time as doing the alignment in a single
259
DP matrix calculation, but the algorithm is
260
O (
261
N ) in memory.
262
A standard dynamic programming alignment algorithm for
263
SCFGs is the Cocke-Younger-Kasami (CYK) algorithm, which
264
finds an optimal parse tree (e.g. alignment) for a model
265
and a sequence [ 24 40 41 42 ] . (CYK is usually
266
described in the literature as a dynamic programming
267
recognition algorithm for nonstochastic CFGs in Chomsky
268
normal form, rather than as a dynamic programming parsing
269
algorithm for SCFGs in any form. The use of the name
270
"CYK" here is therefore a little imprecise [ 24 ] .) CYK
271
can be run in a memory-saving "score only" mode. The DP
272
matrix for CYK can also be filled in two opposite
273
directions - either "inside" or "outside", analogous to
274
forward and backward DP matrix fills for linear sequence
275
alignment. I will refer to these algorithms as CYK/inside
276
and CYK/outside (or just inside and outside), but readers
277
familiar with SCFG algorithms should not confuse them
278
with the SCFG Inside and Outside algorithms [ 43 44 ]
279
which sum over all possible parse trees rather than
280
finding one optimal parse tree. I am always talking about
281
the CYK algorithm in this paper, and by "inside" and
282
"outside" I am only referring generically to the
283
direction of the CYK DP calculation.
284
The CYK/inside and CYK/outside algorithms are not as
285
nicely symmetrical as the forward and backward DP fills
286
are in sequence alignment algorithms. The splitting
287
procedure that one obtains does not generate identical
288
types of subproblems, so the divide and conquer procedure
289
for SCFG-based RNA alignment is not as obvious.
290
291
292
Definition and construction of a covariance
293
model
294
295
Definition of a stochastic context free
296
grammar
297
A stochastic context free grammar (SCFG) consists of
298
the following:
299
300
M different nonterminals (here
301
called
302
states ). I will use capital
303
letters to refer to specific nonterminals;
304
V and
305
Y will be used to refer
306
generically to unspecified nonterminals.
307
308
K different terminal symbols
309
(e.g. the observable alphabet, a,c,g,u for RNA). I will
310
use small letters
311
a, b to refer generically to
312
terminal symbols.
313
• a number of
314
production rules of the form:
315
V → γ, where γ can be any string
316
of nonterminal and/or terminal symbols, including (as a
317
special case) the empty string ε.
318
• Each production rule is associated with a
319
probability, such that the sum of the production
320
probabilities for any given nonterminal
321
V is equal to 1.
322
323
324
SCFG productions allowed in covariance
325
models
326
A covariance model is a specific repetitive "profile
327
SCFG" architecture consisting of groups of model states
328
that are associated with base pairs and single-stranded
329
positions in an RNA secondary structure consensus. A
330
covariance model has seven types of states and
331
production rules (Table 1).
332
Each overall production probability is the
333
independent product of an emission probability
334
e
335
336
v
337
and a transition probability
338
t
339
340
v
341
, both of which are position-dependent parameters
342
that depend on the state
343
v (analogous to hidden Markov
344
models). For example, a particular pair (P) state
345
v produces two correlated letters
346
347
a and
348
b (e.g. one of 16 possible base
349
pairs) with probability
350
e
351
352
v
353
(
354
a ,
355
b ) and transits to one of
356
several possible new states
357
Y of various types with
358
probability
359
t
360
361
v
362
(
363
Y ). A bifurcation (B) state
364
splits into two new start (
365
S ) states with probability 1.
366
The E state is a special case ε production that
367
terminates a derivation.
368
A CM consists of many states of these seven basic
369
types, each with its own emission and transition
370
probability distributions, and its own set of states
371
that it can transition to. Consensus base pairs will be
372
modeled by P states, consensus single stranded residues
373
by L and R states, insertions relative to the consensus
374
by more L and R states, deletions relative to consensus
375
by D states, and the branching topology of the RNA
376
secondary structure by B, S, and E states. The
377
procedure for starting from an input multiple alignment
378
and determining how many states, what types of states,
379
and how they are interconnected by transition
380
probabilities is described next.
381
382
383
From consensus structural alignment to guide
384
tree
385
Figure 1shows an example input file: a multiple
386
sequence alignment of homologous RNAs, with a line
387
describing the consensus RNA secondary structure. The
388
first step of building a CM is to produce a binary
389
guide tree of nodes representing
390
the consensus secondary structure. The guide tree is a
391
parse tree for the consensus structure, with nodes as
392
nonterminals and alignment columns as terminals.
393
The guide tree has eight types of nodes (Table
394
2).
395
These consensus node types correspond closely with a
396
CM's final state types. Each node will eventually
397
contain one or more states. The guide tree deals with
398
the consensus structure. For individual sequences, we
399
will need to deal with insertions and deletions with
400
respect to this consensus. The guide tree is the
401
skeleton on which we will organize the CM. For example,
402
a MATP node will contain a P-type state to model a
403
consensus base pair; but it will also contain several
404
other states to model infrequent insertions and
405
deletions at or adjacent to this pair.
406
The input alignment is first used to construct a
407
consensus secondary structure (Figure 2) that defines
408
which aligned columns will be ignored as non-consensus
409
(and later modeled as insertions relative to the
410
consensus), and which consensus alignment columns are
411
base-paired to each other. Here I assume that both the
412
structural annotation and the labeling of insert versus
413
consensus columns is given in the input file, as shown
414
in the line marked "[structure]" in the alignment in
415
Figure 1. Alternatively, automatic methods might be
416
employed. A consensus structure could be predicted from
417
comparative analysis of the alignment [ 22 45 46 ] .
418
The consensus columns could be chosen as those columns
419
with less than a certain fraction of gap symbols, or by
420
a maximum likelihood criterion, as is done for profile
421
HMM construction [ 18 24 ] .
422
Given the consensus structure, consensus base pairs
423
are assigned to MATP nodes and consensus unpaired
424
columns are assigned to MATL or MATR nodes. One ROOT
425
node is used at the head of the tree. Multifurcation
426
loops and/or multiple stems are dealt with by assigning
427
one or more BIF nodes that branch to subtrees starting
428
with BEGL or BEGR head nodes. (ROOT, BEGL, and BEGR
429
start nodes are labeled differently because they will
430
be expanded to different groups of states; this has to
431
do with avoiding ambiguous parse trees for individual
432
sequences, as described below.) Alignment columns that
433
are considered to be insertions relative to the
434
consensus structure are ignored at this stage.
435
In general there will be more than one possible
436
guide tree for any given consensus structure. Almost
437
all of this ambiguity is eliminated by three
438
conventions: (1) MATL nodes are always used instead of
439
MATR nodes where possible, for instance in hairpin
440
loops; (2) in describing interior loops, MATL nodes are
441
used before MATR nodes; and (3) BIF nodes are only
442
invoked where necessary to explain branching secondary
443
structure stems (as opposed to unnecessarily
444
bifurcating in single stranded sequence). One source of
445
ambiguity remains. In invoking a bifurcation to explain
446
alignment columns
447
i..j by two substructures on
448
columns
449
i..k and
450
k + 1..
451
j, there will be more than one
452
possible choice of
453
k if
454
i..j is a multifurcation loop
455
containing three or more stems. The choice of
456
k impacts the performance of the
457
divide and conquer algorithm; for optimal time
458
performance, we will want bifurcations to split into
459
roughly equal sized alignment problems, so I choose the
460
461
k that makes
462
i..k and
463
k + 1..
464
j as close to the same length as
465
possible.
466
The result of this procedure is the guide tree. The
467
nodes of the guide tree are numbered in preorder
468
traversal (e.g. a recursion of "number the current
469
node, visit its left child, visit its right child":
470
thus parent nodes always have lower indices than their
471
children). The guide tree corresponding to the input
472
multiple alignment in Figure 1is shown in Figure 2.
473
474
475
From guide tree to covariance model
476
A CM must deal with insertions and deletions in
477
individual sequences relative to the consensus
478
structure. For example, for a consensus base pair,
479
either partner may be deleted leaving a single unpaired
480
residue, or the pair may be entirely deleted;
481
additionally, there may be inserted nonconsensus
482
residues between this pair and the next pair in the
483
stem. Accordingly, each node in the master tree is
484
expanded into one or more
485
states in the CM as follows
486
(Table 3)
487
Here we distinguish between consensus ("M", for
488
"match") states and insert ("I") states. ML and IL, for
489
example, are both L type states with L type
490
productions, but they will have slightly different
491
properties, as described below.
492
The states are grouped into a
493
split set of 1-4 states (shown in
494
brackets above) and an
495
insert set of 0-2 insert states.
496
The split set includes the main consensus state, which
497
by convention is first. One and only one of the states
498
in the split set must be visited in every parse tree
499
(and this fact will be exploited by the divide and
500
conquer algorithm). The insert state(s) are not
501
obligately visited, and they have self-transitions, so
502
they will be visited zero or more times in any given
503
parse tree.
504
State transitions are then assigned as follows. For
505
bifurcation nodes, the B state makes obligate
506
transitions to the S states of the child BEGL and BEGR
507
nodes. For other nodes, each state in a split set has a
508
possible transition to every insert state in the
509
same node, and to every state in
510
the split set of the
511
next node. An IL state makes a
512
transition to itself, to the IR state in the same node
513
(if present), and to every state in the split set of
514
the next node. An IR state makes a transition to itself
515
and to every state in the split set of the next
516
node.
517
This arrangement of transitions guarantees that
518
(given the guide tree) there is unambiguously one and
519
only one parse tree for any given individual structure.
520
This is important. The algorithm will find a maximum
521
likelihood parse tree for a given sequence, and we wish
522
to interpret this result as a maximum likelihood
523
structure, so there must be a one to one relationship
524
between parse trees and secondary structures [ 47 ]
525
.
526
The final CM is an array of
527
M states, connected as a directed
528
graph by transitions
529
t
530
531
v
532
(
533
y ) (or probability 1 transitions
534
535
v → (
536
y,z ) for bifurcations) with the
537
states numbered such that (
538
y,z ) ≥
539
v. There are no cycles in the
540
directed graph other than cycles of length one (e.g.
541
the self-transitions of the insert states). We can
542
think of the CM as an array of states in which all
543
transition dependencies run in one direction; we can do
544
an iterative dynamic programming calculation through
545
the model states starting with the last numbered end
546
state
547
M and ending in the root state 1.
548
An example CM, corresponding to the input alignment of
549
Figure 1, is shown in Figure 3.
550
As a convenient side effect of the construction
551
procedure, it is guaranteed that the transitions from
552
any state are to a
553
contiguous set of child states,
554
so the transitions for state
555
v may be kept as an offset and a
556
count. For example, in Figure 3, state 12 (an MP)
557
connects to states 16, 17, 18, 19, 20, and 21. We can
558
store this as an offset of 4 to the first connected
559
state, and a total count of 6 connected states. We know
560
that the offset is the distance to the next non-split
561
state in the current node; we also know that the count
562
is equal to the number of insert states in the current
563
node, plus the number of split set states in the next
564
node. These properties make establishing the
565
connectivity of the CM trivial. Similarly, all the
566
parents of any given state are also contiguously
567
numbered, and can be determined analogously. We are
568
also guaranteed that the states in a split set are
569
numbered contiguously. This contiguity is exploited by
570
the divide and conquer implementation.
571
572
573
Parameterization
574
Using the guide tree and the final CM, each
575
individual sequence in the input multiple alignment can
576
be converted unambiguously to a CM parse tree, as shown
577
in Figure 4. Counts for observed state transitions and
578
singlet/pair emissions are then collected from these
579
parse trees. The observed counts are converted to
580
transition and emission probabilities by standard
581
procedures. I calculate maximum a posteriori
582
parameters, using Dirichlet priors.
583
584
585
Comparison to profile HMMs
586
The relationship between an SCFG and a covariance
587
model is analogous to the relationship of hidden Markov
588
models (HMMs) and profile HMMs for modeling multiple
589
sequence alignments [ 18 19 24 ] . A comparison may be
590
instructive to readers familiar with profile HMMs. A
591
profile HMM is a repetitive HMM architecture that
592
associates each consensus column of a multiple
593
alignment with a single type of model node - a MATL
594
node, in the above notation. Each node contains a
595
"match", "delete", and "insert" HMM state - ML, IL, and
596
D states, in the above notation. The profile HMM also
597
has special begin and end states. Profile HMMs could
598
therefore be thought of as a special case of CMs. An
599
unstructured RNA multiple alignment would be modeled by
600
a guide tree of all MATL nodes, and converted to an
601
unbifurcated CM that would essentially be identical to
602
a profile HMM. (The only difference is trivial; the CM
603
root node includes a IR state, whereas the start node
604
of a profile HMM does not.) All the other node types
605
(especially MATP, MATR, and BIF) and state types (e.g.
606
MP, MR, IR, and B) are SCFG augmentations necessary to
607
extend profile HMMs to deal with RNA secondary
608
structure.
609
The SCFG inside and outside algorithms are analogous
610
to the Forward and Backward algorithms for HMMs [ 24 48
611
] . The CYK/inside parsing algorithm is analogous to
612
the Viterbi HMM alignment algorithm run in the forward
613
direction. CYK/outside is analogous to a Viterbi DP
614
algorithm run in the backwards direction.
615
616
617
618
Divide and conquer algorithm
619
620
Notation
621
I use
622
r, v, w, y, and
623
z as indices of states in the
624
model, where
625
r ≤ (
626
v ,
627
w ,
628
y ) ≤
629
z. These indices will range from
630
1..
631
M , for a CM
632
G that contains
633
M states. refers to a subgraph of
634
the model, rooted at state
635
r and ending at state
636
z, for a contiguous set of states
637
638
r..z. G
639
r , without a subscript, refers
640
to a subgraph of the model rooted at state
641
r and ending at the highest
642
numbered E state descendant from state
643
r. The complete model is , or
644
G 1, or just
645
G.
646
647
S
648
649
v
650
refers to the
651
type of state
652
v ; it will be one of seven types
653
{D,P,L,R,S,E,B}.
654
C
655
656
v
657
is a list of children for state
658
v (e.g. the states that
659
v can transit to); it will
660
contain up to six contiguous indices
661
y with
662
v ≤
663
y ≤
664
M .
665
P
666
667
v
668
is a list of parents for state
669
v (states that could have
670
transited to state
671
v ); it will contain up to six
672
contiguous indices
673
y with 1 ≤
674
y ≤
675
v. (
676
P
677
678
v
679
parent lists should not be confused with P state
680
types.)
681
I use
682
g, h, i, j, k, p, and
683
q as indices referring to
684
positions in a sequence
685
x, where
686
g ≤
687
h ≤
688
p ≤
689
q and
690
i ≤
691
j for all subsequences of nonzero
692
length. These indices range from 1..
693
L, for a sequence of length
694
L . Some algorithms will also use
695
696
d to refer to a subsequence
697
length, where
698
d =
699
j -
700
i + 1 for a subsequence
701
x
702
703
i
704
..
705
x
706
707
j
708
.
709
The algorithms will have to account for subsequences
710
of zero length (because of deletions). By convention,
711
these will be in the off-diagonal where
712
j =
713
i - 1 or
714
i =
715
j + 1. This special case (usually
716
an initialization condition) is the reason for the
717
qualification that
718
i ≤
719
j for subsequences of
720
nonzero length.
721
The CYK/inside algorithm calculates a
722
three-dimensional matrix of numbers α
723
724
v
725
(
726
i ,
727
j ), and CYK/outside calculates
728
numbers β
729
730
v
731
(
732
i ,
733
j ). I will refer to
734
v (state indices) as
735
deck coordinates in the
736
three-dimensional matrices, whereas
737
j and
738
i (sequence positions) are row
739
and column coordinates within each deck. α
740
741
v
742
and β
743
744
v
745
refer to whole two-dimensional decks containing
746
scores α
747
748
v
749
(
750
i ,
751
j ) and β
752
753
v
754
(
755
i ,
756
j ) for a particular state
757
v. The dividing and conquering
758
will be done in the
759
v dimension, by choosing
760
particular decks as split points.
761
762
763
The CYK/inside algorithm
764
The CYK/inside algorithm iteratively calculates α
765
766
v
767
(
768
i ,
769
j ) - the log probability of the
770
most likely CM parse subtree rooted at state
771
v that generates subsequence
772
x
773
774
i
775
..
776
x
777
778
j
779
of sequence
780
x. The calculation initializes at
781
the smallest subgraphs and subsequences (e.g. subgraphs
782
rooted at E states, generating subsequences of length
783
0), and iterates outwards to progessively longer
784
subsequences and larger CM subgraphs.
785
For example, if we're calculating α
786
787
v
788
(
789
i ,
790
j ) and
791
S
792
793
v
794
= P (that is,
795
v is a pair state),
796
v will generate the pair
797
x
798
799
i
800
,
801
x
802
803
j
804
and transit to a new state
805
y (one of its possible
806
transitions
807
C
808
809
v
810
) which then will have to account for the smaller
811
subsequence
812
x
813
814
i +1 ..
815
x
816
817
j -1 . The log probability
818
for a particular choice of next state
819
y is the sum of three terms: an
820
emission term log
821
e
822
823
v
824
(
825
x
826
827
i
828
,
829
x
830
831
j
832
), a transition term log
833
t
834
835
v
836
(
837
y ), and an already calculated
838
solution for the smaller optimal parse tree rooted at
839
y , α
840
841
y
842
(
843
i + 1,
844
j - 1). The answer for α
845
846
v
847
(
848
i ,
849
j ) is the maximum over all
850
possible choices of child states
851
y that
852
v can transit to.
853
The algorithm INSIDE is as follows:
854
855
Input: A CM subgraph and
856
subsequence
857
x
858
859
g
860
..
861
x
862
863
q
864
.
865
866
Output: Scoring matrix decks α
867
868
r
869
..α
870
871
z
872
.
873
INSIDE(r,z; g,q)
874
875
for
876
v ←
877
z
878
down to
879
r
880
881
for
882
j ←
883
g - 1
884
to
885
q
886
887
for
888
i ←
889
j + 1
890
down to
891
g
892
893
d ←
894
j -
895
i + 1
896
897
if
898
S
899
900
v
901
= D or S:
902
α
903
904
v
905
(
906
i, j ) = [α
907
908
y
909
(
910
i, j ) + log
911
t
912
913
v
914
(
915
y )]
916
917
else if
918
S
919
920
v
921
= P and
922
d ≥ 2:
923
α
924
925
v
926
(
927
i, j ) = log
928
e
929
930
v
931
(
932
x
933
934
i
935
,
936
x
937
938
j
939
) + [α
940
941
y
942
(
943
i + 1,
944
j - 1) + log
945
t
946
947
v
948
(
949
y )]
950
951
else if
952
S
953
954
v
955
= L and
956
d ≥ 1:
957
α
958
959
v
960
(
961
i, j ) = log
962
e
963
964
v
965
(
966
x
967
968
i
969
) + [α
970
971
y
972
(
973
i + 1,
974
j ) + log
975
t
976
977
v
978
(
979
y )]
980
981
else if
982
S
983
984
v
985
= R and
986
d ≥ 1:
987
α
988
989
v
990
(
991
i, j ) = log
992
e
993
994
v
995
(
996
x
997
998
j
999
) + [α
1000
1001
y
1002
(
1003
i ,
1004
j - 1) + log
1005
t
1006
1007
v
1008
(
1009
y )]
1010
1011
else if
1012
S
1013
1014
v
1015
= B:
1016
(
1017
y ,
1018
z ) ← left and right S children
1019
of state
1020
v
1021
α
1022
1023
v
1024
(
1025
i, j ) = [α
1026
1027
y
1028
(
1029
i, k ) + α
1030
1031
z
1032
(
1033
k + 1,
1034
j )]
1035
1036
else if
1037
S
1038
1039
v
1040
= E and
1041
d = 0:
1042
α
1043
1044
v
1045
(
1046
i ,
1047
j ) = 0 (initializations)
1048
1049
else
1050
1051
α
1052
1053
v
1054
(
1055
i ,
1056
j ) = -∞ (initializations)
1057
Given a sequence
1058
x of length
1059
L and a CM
1060
G of length
1061
M , we could call INSIDE (1, M;
1062
1, L) to align the whole model (states 1..
1063
M ) to the whole sequence (
1064
x
1065
1 ..
1066
x
1067
1068
L
1069
). When INSIDE returns, α
1070
1 (1,
1071
L ) would contain the log
1072
probability of the best parse of the complete sequence
1073
with the complete model.
1074
We do not have to keep the entire α
1075
three-dimensional matrix in memory to calculate these
1076
scores. As we reach higher decks α
1077
1078
v
1079
in the three dimensional dynamic programming
1080
matrix, our calculations no longer depend on certain
1081
lower decks. A lower deck
1082
y can be deallocated whenever all
1083
the parent decks
1084
P
1085
1086
y
1087
that depend on it have been calculated. (The
1088
implementation goes even further and recycles decks
1089
when possible, saving some initialization steps and
1090
many memory allocation calls; for example, since values
1091
in all E decks are identical, only one E deck needs to
1092
be calculated and that precalculated deck can be reused
1093
whenever
1094
S
1095
1096
v
1097
=
1098
E .)
1099
This deallocation rule has an important property
1100
that the divide and conquer algorithm takes advantage
1101
of when solving smaller subproblems for CM subgraphs
1102
rooted at some state
1103
w. When the root state
1104
w is an S state, the α matrix
1105
returned by INSIDE contains only one active deck α
1106
1107
w
1108
. (No lower state >
1109
w can be reached from any state
1110
<
1111
w without going through
1112
w, so all lower decks are
1113
deallocated once deck
1114
w is completed.) When the root
1115
state
1116
w is the first state in a split
1117
set
1118
w..y (see below for more
1119
explanation), all (and only) the decks α
1120
1121
w
1122
..α
1123
1124
y
1125
are active when INSIDE returns.
1126
In some cases we want to recover the optimal parse
1127
tree itself, not just its score. The INSIDE τroutine is
1128
a modified version of INSIDE. It keeps an additional
1129
"shadow matrix" τ
1130
1131
v
1132
(
1133
i,j ). A τ
1134
1135
v
1136
(
1137
i,j ) traceback pointer either
1138
records the index
1139
y that maximized α
1140
1141
v
1142
(
1143
i,j ) (for state types D,S,P,L,R)
1144
or records the split point
1145
k that maximized α
1146
1147
v
1148
(
1149
i,j ) for a bifurcation (B)
1150
state. The τ shadow matrix does not use the
1151
deallocation rules - INSIDE τcan only be called for
1152
problems small enough that they can be solved within
1153
our available memory space. Thus the INSIDE τroutine
1154
works by calling INSIDE in a mode that also keeps a
1155
shadow matrix τ, and then calls a recursive traceback
1156
starting with
1157
v, i, j :
1158
1159
Input: A shadow matrix τ for CM
1160
subgraph
1161
G
1162
v rooted at state
1163
v, and subsequence
1164
x
1165
1166
i
1167
..
1168
x
1169
1170
j
1171
.
1172
1173
Output: An optimal parse tree
1174
T .
1175
TRACEBACK(v,i,j)
1176
1177
if
1178
S
1179
1180
v
1181
=
1182
E :
1183
attach
1184
v
1185
1186
else if
1187
S
1188
1189
v
1190
=
1191
S or D:
1192
attach
1193
v
1194
TRACEBACK(τ
1195
1196
v
1197
(
1198
i ,
1199
j ),
1200
i ,
1201
j ))
1202
1203
else if
1204
S
1205
1206
v
1207
= P:
1208
attach
1209
x
1210
1211
i
1212
,
1213
v ,
1214
x
1215
1216
j
1217
1218
TRACEBACK(τ
1219
1220
v
1221
(
1222
i ,
1223
j ),
1224
i + 1,
1225
j - 1)
1226
1227
else if
1228
S
1229
1230
v
1231
= L:
1232
attach
1233
x
1234
1235
i
1236
,
1237
v
1238
TRACEBACK(τ
1239
1240
v
1241
(
1242
i ,
1243
j ),
1244
i + 1,
1245
j )
1246
1247
else if
1248
S
1249
1250
v
1251
= R:
1252
attach
1253
v ,
1254
x
1255
1256
j
1257
1258
TRACEBACK(τ
1259
1260
v
1261
(
1262
i ,
1263
j ),
1264
i ,
1265
j - 1)
1266
1267
else if
1268
S
1269
1270
v
1271
= B:
1272
(
1273
y ,
1274
z ) ← left and right S children
1275
of state
1276
v
1277
attach
1278
v
1279
TRACEBACK(
1280
y ,
1281
i , τ
1282
1283
v
1284
(
1285
i ,
1286
j ))
1287
TRACEBACK(
1288
z , τ
1289
1290
v
1291
(
1292
i ,
1293
j ) + 1,
1294
j )
1295
1296
1297
The CYK/outside algorithm
1298
The CYK/outside algorithm iteratively calculates β
1299
1300
v
1301
(
1302
i ,
1303
j ), the log probability of the
1304
most likely CM parse tree for a CM generating a
1305
sequence
1306
x
1307
1 ..
1308
x
1309
1310
L
1311
1312
excluding the optimal parse
1313
subtree rooted at state
1314
v that accounts for the
1315
subsequence
1316
x
1317
1318
i
1319
..
1320
x
1321
1322
j
1323
. The calculation initializes with the entire
1324
sequence excluded (e.g. β
1325
1 (1,
1326
L ) = 0), and iterates inward to
1327
progressively shorter and shorter excluded subsequences
1328
and smaller CM subgraphs.
1329
A complete implementation of the CYK/outside
1330
algorithm requires first calculating the CYK/inside
1331
matrix α because it is needed to calculate β
1332
1333
v
1334
when the parent of
1335
v is a bifurcation [ 24 43 44 ] .
1336
However, the divide and conquer algorithm described
1337
here only calls OUTSIDE on
1338
unbifurcated, linear CM
1339
subgraphs (only the final state
1340
z may be a B state; there are no
1341
internal bifurcations that lead to branches in the
1342
model). Thus the parent of a state
1343
v is never a bifurcation, and the
1344
implementation can therefore be streamlined as
1345
follows:
1346
1347
Input: An unbifurcated CM subgraph
1348
and subsequence
1349
x
1350
1351
g
1352
..
1353
x
1354
1355
q
1356
.
1357
1358
Output: Scoring matrix decks β
1359
1360
r
1361
..β
1362
1363
z
1364
.
1365
OUTSIDE(r,z; g,q)
1366
β
1367
1368
v
1369
(
1370
i ,
1371
j ) ← - ∞ ∀
1372
v ,
1373
i ,
1374
j
1375
β
1376
1377
r
1378
(
1379
g ,
1380
q ) ← 0
1381
1382
for
1383
v ←
1384
r + 1
1385
to
1386
z
1387
1388
for
1389
j ←
1390
q
1391
down to
1392
g - 1
1393
1394
for
1395
i ←
1396
g
1397
to
1398
j + 1
1399
1400
As with INSIDE, we do not keep the entire β matrix
1401
in memory. A deck β
1402
1403
v
1404
can be deallocated when all child decks
1405
C
1406
1407
v
1408
that depend on the values in β
1409
1410
v
1411
have been calculated. This means that if the last
1412
deck
1413
z is a bifurcation or end state,
1414
β
1415
1416
z
1417
will be the only active allocated deck when
1418
OUTSIDE returns. If
1419
z is the last state in a split
1420
set
1421
w..z, all (and only) the split
1422
set decks β
1423
1424
w
1425
..β
1426
1427
z
1428
will be active when OUTSIDE returns.
1429
1430
1431
Using CYK/inside and CYK/outside to divide and
1432
conquer
1433
Now, for any chosen state
1434
v, argmax
1435
1436
i ,
1437
j [α
1438
1439
v
1440
(
1441
i, j ) + β
1442
1443
v
1444
(
1445
,i,j )] tells us which cell
1446
v, i, j the optimal parse tree
1447
passes through, conditional on using state
1448
v in the parse. We know that any
1449
parse tree must include all the bifurcation and start
1450
states of the CM, so we know that the optimal alignment
1451
1452
must use any chosen bifurcation
1453
state
1454
v and its child start states
1455
w and
1456
y. Thus, we are guaranteed that
1457
(when
1458
S
1459
1460
v
1461
= B and
1462
C
1463
1464
v
1465
=
1466
w, y ):
1467
1468
is the optimal overall alignment score, and we also
1469
know that
1470
1471
gives us a triplet that identifies three cells that
1472
must be in the optimal alignment - (
1473
v, i, j ), (
1474
w, i, k ), and (
1475
y, k + 1,
1476
j ). This splits the remaining
1477
problem into three smaller subproblems - an alignment
1478
of the sequence
1479
x
1480
1481
i
1482
..
1483
x
1484
1485
k
1486
to a CM subgraph
1487
w..y - 1, an alignment of the
1488
sequence
1489
x
1490
1491
k +1 ..
1492
x
1493
1494
j
1495
to a CM subgraph
1496
y..M , and an alignment of the
1497
two-piece sequence
1498
x
1499
1 ..
1500
z
1501
1502
i -1 //
1503
x
1504
1505
j +1 ..
1506
x
1507
1508
L
1509
to a CM subgraph 1..
1510
v .
1511
The subproblems are then themselves split, and this
1512
splitting can continue recursively until all the
1513
bifurcation triplets on the optimal parse tree have
1514
been determined.
1515
At this point the remaining alignment subproblems
1516
might be small enough to be solved by straightforward
1517
application of the standard CYK/inside algorithm (e.g.
1518
INSIDE τ). However, this is not guaranteed to be the
1519
case. A more general division strategy is needed that
1520
does not depend on splitting at bifurcations.
1521
For the more general strategy we take advantage of
1522
the fact that we know that the optimal parse tree must
1523
also include one and only one state from the split set
1524
of each node (e.g. the non-insert states in the node).
1525
Let
1526
w..y be the indices of a split
1527
set of states in the middle of the current model
1528
subgraph. (
1529
w..y can be at most 4 states.) We
1530
know that
1531
1532
gives us a new cell (
1533
v, i, j ) in the optimal parse
1534
tree, and splits the problem into two smaller problems.
1535
This strategy can be applied recursively all the way
1536
down to single nodes, if necessary. We can therefore
1537
guarantee that we will never need to carry out a full
1538
CYK/inside alignment algorithm on any subproblem. The
1539
most memory-intensive alignment problem that needs to
1540
be solved is the very first split. The properties of
1541
the first split determine the memory complexity of the
1542
algorithm.
1543
The bifurcation-dependent strategy is a special case
1544
of this more general splitting strategy, where the B
1545
state is the only member of its split set, and where we
1546
also take advantage of the fact that α
1547
1548
v
1549
(
1550
i ,
1551
j ) = max
1552
1553
k
1554
α
1555
1556
w
1557
(
1558
i, k ) + α
1559
1560
y
1561
(
1562
k + 1,
1563
j ). By carrying out the max
1564
1565
k
1566
operation during the split, rather than before, we
1567
can split the current problem into three optimal pieces
1568
instead of just two.
1569
If we look at the consequences of these splitting
1570
strategies, we see we will have to deal with three
1571
types of problems (Figure 5):
1572
• A
1573
generic problem means finding the
1574
optimal alignment of a CM subgraph to a contiguous
1575
subsequence
1576
x
1577
1578
g
1579
..
1580
x
1581
1582
q
1583
. The subgraph corresponds to a complete subtree
1584
of the CM's guide tree - e.g. state
1585
r is a start (S), and state
1586
z is an end (E). may contain
1587
bifurcations. The problem is solved in one of two ways.
1588
If contains no bifurcations, it is solved as a wedge
1589
problem (see below). Else, the problem is subdivided by
1590
the bifurcation-dependent strategy: an optimal triple
1591
(i,
1592
k, j ) is found for a bifurcation
1593
state
1594
v and its children
1595
w, y, splitting the problem into
1596
a V problem and two generic problems.
1597
• A
1598
wedge problem means finding the
1599
optimal alignment of an
1600
unbifurcated CM subgraph to a
1601
contiguous subsequence
1602
x
1603
1604
g
1605
..
1606
x
1607
1608
q
1609
. State
1610
z does not have to be a start
1611
state (S); it may be a state in a split set (MP, ML,
1612
MR, or D). State
1613
z is an end (E). A wedge problem
1614
is solved by the split set-dependent strategy: an
1615
optimal (
1616
v, i, j ) is found, splitting the
1617
problem into a V problem and a smaller wedge
1618
problem.
1619
• A
1620
V problem consists of finding the
1621
optimal alignment of an
1622
unbifurcated CM subgraph to a
1623
noncontiguous, two-piece sequence
1624
x
1625
1626
g
1627
..
1628
x
1629
1630
h
1631
//
1632
x
1633
1634
p
1635
..
1636
x
1637
1638
g
1639
, exclusive of the residues
1640
x
1641
1642
h
1643
and
1644
x
1645
1646
p
1647
(open circles in Figure 5). State
1648
r can be a start state or any
1649
state in a split set; the same is true for
1650
z. A V problem is solved by a
1651
split set-dependent strategy: an optimal (
1652
v, i, j ) is found, splitting the
1653
problem into two V problems.
1654
The three recursive splitting algorithms to solve
1655
these problems are as follows:
1656
1657
1658
The generic_splitter routine
1659
1660
Input: A generic problem, for CM
1661
subgraph and subsequence
1662
x
1663
1664
g ..
1665
q .
1666
1667
Output: An optimal parse subtree
1668
T .
1669
GENERIC_SPLITTER(r, z; g,q)
1670
1671
if no bifurcation in :
1672
1673
return WEDGE_SPLITTER(r,z; g,q)
1674
1675
else
1676
1677
1678
v ← lowest numbered bifurcation
1679
state in subgraph .
1680
1681
w,y ← left and right S children
1682
of
1683
v.
1684
β
1685
1686
v
1687
← OUTSIDE(r,w; g,q)
1688
α
1689
1690
w
1691
← INSIDE(w,y-1; g,q)
1692
α
1693
1694
y
1695
← INSIDE(y,z; g,q)
1696
1697
1698
T
1699
1 ← V_SPLITTER(r,v; g,i;
1700
j,q)
1701
1702
T
1703
2 ← GENERIC_SPLITTER(w,y-1;
1704
i,k)
1705
1706
T
1707
3 ← GENERIC_SPLITTER(y,z;
1708
k+l,j)
1709
Attach S state
1710
w of
1711
T
1712
2 as left child of B state
1713
v in
1714
T
1715
1 .
1716
Attach S state
1717
y of
1718
T
1719
3 as right child of B state
1720
v in
1721
T
1722
2 .
1723
1724
return T
1725
1 .
1726
1727
1728
The wedge_splitter routine
1729
1730
Input: A wedge problem, for
1731
unbifurcated CM subgraph and subsequence
1732
x
1733
1734
g ..
1735
q .
1736
1737
Output: An optimal parse subtree
1738
T .
1739
WEDGE_SPLITTER(r,z; g,q)
1740
(
1741
w ..
1742
y ) ← a split set chosen from
1743
middle of
1744
1745
1746
w
1747
..α
1748
1749
y
1750
) ← INSIDE(w,z; g,q)
1751
1752
1753
w
1754
..β
1755
1756
y
1757
) ← OUTSIDE(r,y; g,q)
1758
1759
1760
T
1761
1 ← V_SPLITTER(r,v; g,i;
1762
j,q)
1763
1764
T
1765
2 ← WEDGE_SPLITTER(v,z;
1766
i,j)
1767
Attach
1768
T
1769
2 to
1770
T
1771
1 by merging at state
1772
v.
1773
1774
return T
1775
1 .
1776
1777
1778
1779
The V_splitter routine
1780
1781
Input: A V problem, for
1782
unbifurcated CM subgraph and two-part subsequence
1783
x
1784
1785
g ..
1786
h //
1787
x
1788
1789
p ..
1790
q .
1791
1792
Output: An optimal parse subtree
1793
T .
1794
V_SPLITTER(r,z; g,h; p,q)
1795
(
1796
w..y ) ← a split set chosen from
1797
middle of
1798
1799
1800
w
1801
..α
1802
1803
y
1804
) ← VINSIDE(w,z; g,h; p,q)
1805
1806
1807
w
1808
..β
1809
1810
y
1811
) ← VOUTSlDE(r,y; g,h; p,q)
1812
1813
1814
T
1815
1 ← V_SPLITTER(r,v; g,i;
1816
j,q)
1817
1818
T
1819
2 ← V_SPLITTER(v,z; i,h;
1820
p,j)
1821
Attach
1822
T
1823
2 to
1824
T
1825
1 by merging at state
1826
v.
1827
1828
return T
1829
1 .
1830
1831
1832
The vinside and voutside routines
1833
The VINSIDE and VOUTSIDE routines are just INSIDE
1834
and OUTSIDE, modified to deal with a two-piece
1835
subsequence
1836
x
1837
1838
g
1839
..
1840
x
1841
1842
h
1843
//
1844
x
1845
1846
p
1847
..
1848
x
1849
1850
g
1851
instead of a contiguous sequence
1852
x
1853
1854
g
1855
..
1856
x
1857
1858
q
1859
. These modifications are fairly obvious. The
1860
range of
1861
i, j is restricted so that
1862
i ≤
1863
h and
1864
j ≥
1865
p. Also, VINSIDE (w, z; g, h; p,
1866
q) initializes α
1867
1868
z
1869
(
1870
h ,
1871
p ) = 0: that is, we know that
1872
sequence
1873
x
1874
1875
h
1876
..
1877
x
1878
1879
p
1880
has already been accounted for by a CM parse tree
1881
rooted at
1882
z.
1883
1884
1885
Implementation
1886
In the description of the algorithms above, some
1887
technical detail has been omitted - in particular, a
1888
detailed description of efficient initialization steps,
1889
and details of how the the dynamic programming matrices
1890
are laid out in memory. These details are not necessary
1891
for a high level understanding of the divide and
1892
conquer algorithm. However, they may be necessary for
1893
reproducing a working implementation. Commented ANSI/C
1894
source code for a reference implementation is therefore
1895
freely available at
1896
http://www.genetics.wustl.edu/eddy/infernal/under a GNU
1897
General Public License. This code has been tested on
1898
GNU/Linux platforms.
1899
In this codebase, the CM data structure is defined
1900
in structs.h. The CM construction procedure is in
1901
modelmaker. c:Handmodelmaker(). The guide tree is
1902
constructed in HandModelmaker(). A CM is constructed
1903
from the guide tree by cm_from_master(). Individual
1904
parse trees are constructed using the guide tree by
1905
transmogrify(). The divide and conquer algorithm is
1906
implemented in smallcyk. c:CYKDivideAndConquer(), which
1907
will recursively call a set of functions: the three
1908
splitting routines GENERIC_SPLITTER(),
1909
wedge_splitter(), and v_splitter(); the four alignment
1910
engines INSIDE(), OUTSIDE(), VINSIDE(), and VOUTSIDE();
1911
and the two traceback routines INSIDET() and
1912
VINSIDET().
1913
1914
1915
1916
1917
Results and discussion
1918
1919
Memory complexity analysis
1920
The memory complexity of normal CYK/inside is
1921
O (
1922
N 2
1923
M ), for a model of
1924
M states and a query sequence of
1925
N residues, since the full 3D
1926
dynamic programming matrix is indexed
1927
N ×
1928
N ×
1929
M (and since
1930
N ∝
1931
M , we can alternatively state the
1932
upper bound as
1933
O (
1934
N 3)). The memory complexity of the
1935
divide and conquer algorithm is
1936
O (
1937
N 2log
1938
M ). The analysis that leads to
1939
this conclusion is as follows.
1940
For a model with no bifurcations, the divide and
1941
conquer algorithm will never require more than 10 decks
1942
in memory at once. In the case of two adjacent MATP
1943
nodes, we will need six decks to store the scores for the
1944
current node we're calculating, and four decks for the
1945
split set of the adjacent node that we're connecting to
1946
(and dependent upon) (Figure 3).
1947
Bifurcations will require some number of additional
1948
decks for start states (BEGL_S and BEGR_S) to be kept. In
1949
INSIDE, whenever we reach a deck for a start state, we
1950
will keep that deck in memory until we reach the parent
1951
bifurcation state. Half the time, that will mean waiting
1952
until another complete subgraph of the model is
1953
calculated (e.g. the subgraph rooted at the other start
1954
child of that bifurcation); that is, to calculate deck α
1955
1956
v
1957
for a bifurcation
1958
v, we need both decks α
1959
1960
w
1961
and α
1962
1963
y
1964
for its child start states
1965
w and
1966
y, so we have to hold on to α
1967
1968
y
1969
until we reach α
1970
1971
w
1972
. In turn, the subgraph rooted at
1973
w might contain bifurcations, so
1974
our calculation of α
1975
1976
w
1977
might require additional decks to be kept. Each
1978
start deck we reach in the INSIDE iteration means holding
1979
one extra deck in memory, and each bifurcation we reach
1980
means deallocating the two start decks it depends on;
1981
therefore we can iteratively calculate the maximum number
1982
of extra decks we will require:
1983
1984
x
1985
1986
M
1987
← -1
1988
1989
for v ←
1990
M -1
1991
to 1
1992
1993
1994
return max
1995
1996
v
1997
1998
x
1999
2000
v
2001
2002
This number depends on the topology and order of
2003
evaluation of the states in the CM. Think of the
2004
bifurcating structure of the CM as a binary tree numbered
2005
in preorder traversal (e.g. left children are visited
2006
first, and have lower indices than right children). If
2007
this is a complete balanced tree with
2008
B bifurcations, we will need log
2009
2
2010
B extra decks. If it is a maximally
2011
unbalanced tree in which bifurcations only occur in left
2012
children, we will need
2013
B extra decks (all the right
2014
children). If it is a maximally unbalanced tree in which
2015
bifurcations only occur in right children, we will only
2016
ever need 1 extra deck. A left-unbalanced binary tree can
2017
be converted to a right-unbalanced binary tree just by
2018
swapping branches. For a CM, we can't swap branches
2019
without affecting the order of the sequence that's
2020
generated. We can, however, get the same effect by
2021
renumbering the CM states in a modified preorder
2022
traversal. Instead of always visiting the left subtree
2023
first, we visit the best subtree first, where "best"
2024
means the choice that will optimize memory usage. This
2025
reordering is readily calculated in
2026
O (
2027
M ) time (not shown; see cm.
2028
c:CMRebalance() in the implementation). This way, we can
2029
never do worse than the balanced case, and we will often
2030
do better. We never need more than log
2031
2
2032
B extra decks. Since
2033
B <
2034
M, we can guarantee a
2035
O (
2036
N 2log
2037
M ) bound on memory complexity.
2038
2039
2040
Time complexity analysis
2041
The time complexity of the standard algorithm is
2042
O (
2043
MN 2+
2044
BN 3), for a model of
2045
M states (
2046
B of which are bifurcations)
2047
aligned to a sequence of
2048
N residues. Since
2049
B <
2050
M, and
2051
M ∝
2052
N, we can also state the upper
2053
bound as
2054
O (
2055
MN 3) or
2056
O (
2057
N 4).
2058
The time complexity of the divide and conquer
2059
algorithm depends on how close each split is to dividing
2060
a problem into equal sized subproblems. In the most ideal
2061
case, each call to GENERIC_SPLITTER could split into
2062
three subproblems that each contained 1/3 of the states
2063
and residues: splitting those three subproblems would
2064
only cost:
2065
3 × ×
2066
MN 3
2067
in time, e.g. only about 1/27 the time it took to
2068
split the first problem. Thus in an ideal case the time
2069
requirement is almost completely dominated by the first
2070
split, and the extra time required to do the complete
2071
divide and conquer algorithm could be negligible. In
2072
pathological cases, optimal splits might lead to a series
2073
of very unequally sized problems. We never need to do
2074
more splits than there are states in the model, so we
2075
cannot do worse than
2076
O (
2077
M 2
2078
N 3) in time.
2079
An example of a pathological case is an RNA structure
2080
composed of a series of multifurcation loops such that
2081
each bifurcation leads to a small stem on one side, and
2082
the rest of the structure on the other. In such a case,
2083
every call to GENERIC_SPLITTER will split into a small
2084
subproblem containing a small stem (e.g. only removing a
2085
constant number of states and residues per split) and a
2086
large subproblem containing all the remaining states and
2087
sequence. This case can be avoided. It only arises
2088
because of the decision to implement a simplified
2089
CYK/Outside algorithm and always split at the highest
2090
bifurcations. Better time performance could be guaranteed
2091
if a complete CYK/Outside algorithm were implemented (at
2092
the cost of complexity in the description and
2093
implementation of the algorithm). This would allow us to
2094
choose a split point in a generic problem at any state in
2095
the CM (for instance, in the middle) regardless of its
2096
bifurcating structure.
2097
In practice, empirical results on a variety of real
2098
RNAs (see below) indicate that the extra time required to
2099
do the divide and conquer is a small constant factor. A
2100
more complex implementation does not seem to be
2101
necessary.
2102
2103
2104
Empirical results
2105
Six structural RNA families were chosen for empirical
2106
evaluations of the algorithm, using available RNA
2107
database resources - tRNA [ 49 ] , 5S ribosomal RNA [ 50
2108
] , signal recognition particle (SRP) RNA [ 51 ] , RNase
2109
P [ 52 ] , small subunit (SSU) ribosomal RNA [ 53 ] , and
2110
large subunit (LSU) ribosomal RNA [ 54 ] . For each
2111
family, a secondary structure annotated multiple
2112
alignment of four or five example sequences was
2113
extracted, and used to construct a CM.
2114
The top of Table 4shows some statistics about these
2115
alignments and CMs. The number of consensus columns in
2116
the alignments ranges from 72 (tRNA) to 2898 (LSU rRNA);
2117
about 55-60% are involved in consensus base pairs. The
2118
number of CM states is consistently about 3-fold more
2119
than the consensus alignment length, ranging from 230
2120
states (tRNA) to 9023 (LSU). About 1/150 of the states in
2121
each model are bifurcations. After optimal reordering of
2122
the model states, the number of extra decks required by
2123
the alignment algorithm is small, ranging up to 3 for SSU
2124
and 5 for LSU rRNA. Therefore the minimum constant of 10
2125
decks required in iterations across unbifurcated model
2126
segments dominates the memory requirement. The memory
2127
required for extra decks does not have much impact even
2128
for the largest structural RNAs. (Even without optimal
2129
reordering, the number of extra decks required for SSU
2130
and LSU are only 7 and 9, respectively. State reordering
2131
was only needed to assure a
2132
O (
2133
N 2log
2134
M ) memory complexity bound.)
2135
To determine the memory and CPU time requirements for
2136
a structural alignment, one example sequence from each
2137
family was aligned to the CM. CPU time was measured and
2138
memory requirements were calculated for three algorithms:
2139
(1) the full CYK/inside algorithm, but in memory-saving
2140
score-only mode (e.g. INSIDE(1,M; 1,L); (2) the full
2141
CYK/inside algorithm, with shadow matrix and traceback to
2142
recover the optimal alignment (e.g. INSIDE τ(1,M; 1,L));
2143
and (3) the divide and conquer algorithm to recover an
2144
optimal alignment (e.g. GENERIC_SPLITTER (1,M; 1,L)). The
2145
most important comparison is between the full CYK/inside
2146
algorithm and the divide and conquer algorithm. The
2147
score-only CYK/inside algorithm was included, because a
2148
complete CYK alignment couldn't be done on SSU and LSU
2149
rRNA (because of the steep memory requirement). In all
2150
cases where comparison could be done, the scores and
2151
alignments produced by these algorithms were verified to
2152
be identical.
2153
The results of these tests are shown in the bottom
2154
half of Table 1. The memory required by divide and
2155
conquer alignment ranges up to 271 MB for LSU rRNA,
2156
compared to a prohibitive 151 GB for the standard CYK
2157
algorithm. The extra CPU time required by the divide and
2158
conquer is small; usually about 20% more, with a maximum
2159
of about two-fold more for SRP-RNA.
2160
The same results are plotted in Figure 6. Memory
2161
requirements scale as expected:
2162
N 3for standard CYK alignment, and
2163
better than
2164
N 2log
2165
N for the divide and conquer
2166
algorithm. Empirical CPU time requirements scale
2167
similarly for the two algorithms (
2168
N 3.24-
2169
N 3.29). The observed performance
2170
is better than the theoretical worst case of
2171
O (
2172
N 4). The proportion of extra time
2173
required by divide and conquer is roughly constant over a
2174
wide range of RNAs. The difference shown in Figure 6is
2175
exaggerated because times are plotted for score-only CYK,
2176
not complete CYK alignment, in order to include CPU times
2177
for SSU and LSU rRNA. Because score-only CYK does not
2178
keep a shadow traceback matrix nor perform the traceback,
2179
it is about 20% faster than CYK alignment, as seen in the
2180
data in Table 4.
2181
2182
2183
2184
Conclusions
2185
The divide and conquer algorithm described here makes it
2186
possible to align even the largest structural RNAs to
2187
secondary structure consensus models, without exceeding the
2188
available memory on current computational hardware. Optimal
2189
SSU and LSU rRNA structural alignments can be performed in
2190
70 MB and 270 MB of memory, respectively. Previous
2191
structural alignment algorithms had to sacrifice
2192
mathematical optimality to achieve ribosomal RNA
2193
alignments.
2194
The CPU time requirement of the alignment algorithm is
2195
still significant, and even prohibitive for certain
2196
important applications. However, CPU time is generally an
2197
easier issue to deal with. A variety of simple
2198
parallelization strategies are possible. Banded dynamic
2199
programming algorithms (e.g. calculating only relevant
2200
parts of the matrix) of various forms can also be explored,
2201
including not only heuristic schemes, but also optimal
2202
algorithms based on branch and bound ideas. (Properly
2203
implemented, banded DP algorithms would also save
2204
additional memory.)
2205
The algorithm takes advantage of the structure of
2206
covariance models (profile SCFGs), splitting in the
2207
dimension of the states of the model rather than in the
2208
sequence dimensions. The approach does not readily apply,
2209
therefore, to unprofiled SCFG applications in RNA secondary
2210
structure prediction [ 34 36 55 56 ] , where the states are
2211
fewer and more fully interconnected. For these
2212
applications, it would seem to be necessary to divide and
2213
conquer in the sequence dimensions to obtain any
2214
significant improvement in memory requirements, and it is
2215
not immediately apparent how one would do this.
2216
The current implementation of the algorithm is not
2217
biologically useful. It is meant only as a testbed for the
2218
algorithm. It outputs a raw traceback structure and
2219
alignment score, not a standardly formatted alignment file.
2220
Most importantly, the probability parameters for models are
2221
calculated in a very quick and simple minded fashion, and
2222
are far from being reasonable for producing robustly
2223
accurate structural alignments. The next step along this
2224
line is to produce good prior distributions for estimating
2225
better parameters, by estimating mixture Dirichlet priors
2226
from known RNA structures [ 57 ] . At this stage it would
2227
not be meaningful to compare the biological alignment
2228
accuracy of this implementation to (for instance) the
2229
excellent performance of the RAGA genetic algorithm [ 27 ]
2230
. A biologically useful implementation with accurate
2231
alignment performance is of course the eventual goal of
2232
this line of work, but is not the point of the present
2233
paper.
2234
2235
2236
2237
2238