CoCalc -- 1471-2105-3-18.txt

OANC_GrAF / data / written_2 / technical / biomed / 1471-2105-3-18.txt
³⁹⁶⁷³ views
1

2
  
3
    
4
      
5
        Background
6
        There are a growing number of RNA gene families and RNA
7
        motifs [ 1 2 ] . Many (though not all) RNAs conserve a
8
        base-paired RNA secondary structure. Computational analyses
9
        of RNA sequence families are more powerful if they take
10
        into account both primary sequence and secondary structure
11
        consensus [ 3 4 ] .
12
        Some excellent approaches have been developed for
13
        database searching with RNA secondary structure consensus
14
        patterns. Exact- and approximate-match pattern searches
15
        (analogous to PROSITE patterns for proteins) have been
16
        extended to allow patterns to specify long-range base
17
        pairing constraints [ 5 6 ] . In several cases, specialized
18
        programs have been developed to recognize specific RNA
19
        structures [ 4 ] - for example, programs exist for
20
        detecting transfer RNA genes [ 7 8 9 ] , group I catalytic
21
        introns [ 10 ] , and small nucleolar RNAs [ 11 12 ] . All
22
        of these approaches, though powerful, lack generality, and
23
        they require expert knowledge about each particular RNA
24
        family of interest.
25
        In primary sequence analysis, the most useful analysis
26
        techniques are general primary sequence alignment
27
        algorithms with probabilistically based scoring systems -
28
        for example, the BLAST [ 13 ] , FASTA [ 14 ] , or CLUSTALW
29
        [ 15 ] algorithms, and the PAM [ 16 ] or BLOSUM [ 17 ]
30
        score matrices. Unlike specialized programs, a general
31
        alignment algorithm can be applied to find homologs of any
32
        query sequence(s). Unlike pattern searches, which give
33
        yes/no answers for whether a candidate sequence is a match,
34
        a scoring system gives a meaningful score that allows
35
        ranking candidate hits by their statistical significance.
36
        It is of interest to develop general alignment algorithms
37
        for RNA secondary structures.
38
        The problem I consider here is as follows. I am given a
39
        multiple alignment of an RNA sequence family for which I
40
        know the consensus secondary structure. I want to search a
41
        sequence database for homologs that significantly match the
42
        sequence 
43
        and structure of my query. The
44
        sequence analysis analogue is the use of profile hidden
45
        Markov models (profile HMMs) to model multiple alignments
46
        of conserved protein domains, and to discover new
47
        homologues in sequence databases [ 18 19 ] . For instance,
48
        if we had an RNA structure equivalent of the HMMER profile
49
        HMM program suite http://hmmer.wustl.edu/it would be
50
        possible to develop and efficiently maintain databases of
51
        conserved RNA structures and multiple alignments, analogous
52
        to the Pfam or SMART databases of conserved protein domains
53
        [ 20 21 ] .
54
        Stochastic context free grammar (SCFG) algorithms
55
        provide a general approach to RNA structure alignment [ 22
56
        23 24 ] . SCFGs allow the strong pairwise residue
57
        correlations in non-pseudoknotted RNA secondary structure
58
        to be taken into account in RNA alignments. SCFGs can be
59
        aligned to sequences using a dynamic programming algorithm
60
        that guarantees finding a mathematically optimal solution
61
        in polynomial time. SCFG alignment algorithms can be
62
        thought of as an extension of sequence alignment algorithms
63
        (particularly those with fully probabilistic, hidden Markov
64
        model formulations) into an additional dimension necessary
65
        to deal with 2D RNA secondary structure.
66
        While SCFGs provide a natural mathematical framework for
67
        RNA secondary structure alignment problems, SCFG algorithms
68
        have high computational complexity that has impeded their
69
        practical application. Optimal SCFG-based structural
70
        alignment of an RNA structure to a sequence costs 
71
        O ( 
72
        N 3) memory and 
73
        O ( 
74
        N 4) time for a sequence of length 
75
        N, compared to 
76
        O ( 
77
        N 2) memory and time for sequence
78
        alignment algorithms. (Corpet and Michot described a
79
        program that implements a different general dynamic
80
        programming algorithm for RNA alignment; their algorithm
81
        solves the same problem but even less efficiently,
82
        requiring 
83
        O ( 
84
        N 4) memory and 
85
        O ( 
86
        N 5) time [ 25 ] .) SCFG-based
87
        alignments of small structural RNAs are feasible. Using my
88
        COVE software
89
        http://www.genetics.wustl.edu/eddy/software#cove, transfer
90
        RNA alignments (~75 nucleotides) take about 0.2 cpu second
91
        and 3 Mb of memory. Most genome centers now use an
92
        COVE-based search program, tRNAscan-SE, for annotating
93
        transfer RNA genes [ 9 ] . However, many larger RNAs of
94
        interest are OUTSIDE the capabilities of the standard SCFG
95
        alignment algorithm. Alignment of a small subunit (SSU)
96
        ribosomal RNA sequence to the SSU rRNA consensus structure
97
        would take about 23 GB of RAM and an hour of CPU time.
98
        Applying SCFG methods to RNAs this large has required
99
        clever heuristics, such as using a precalculation of
100
        confidently predicted regions of primary sequence alignment
101
        to strongly constrain which parts of the SCFG dynamic
102
        programming matrix need to be calculated [ 26 ] . The steep
103
        memory requirement remains a significant barrier to the
104
        practicality of SCFG algorithms.
105
        Notredame 
106
        et al. pointed specifically to this
107
        problem [ 27 ] . They described RAGA, a program that uses a
108
        genetic algorithm (GA) to optimize a pairwise RNA alignment
109
        using an objective function that includes base pairing
110
        terms. Because GAs have an O( 
111
        N ) memory requirement, RAGA can find
112
        reasonable solutions for large RNA alignment problems,
113
        including ribosomal RNA alignments. A different
114
        memory-efficient approach has also been described [ 28 29 ]
115
        . However, both approaches are approximate and cannot
116
        guarantee a mathematically optimal solution, in contrast to
117
        the (mathematically) optimal but more expensive dynamic
118
        programming approaches.
119
        Here, I introduce a dynamic programming solution to the
120
        problem of structural alignment of large RNAs. The central
121
        idea is a divide and conquer strategy. For linear sequence
122
        alignment, a divide and conquer algorithm was introduced by
123
        Hirschberg [ 30 ] , an algorithm known in the computational
124
        biology community as the Myers/Miller algorithm [ 31 ] .
125
        (Ironically, at the time, dynamic programming methods for
126
        optimal sequence alignment were well known, but were
127
        considered impractical on 1970's era computers because of
128
        the "extreme" 
129
        O ( 
130
        N 2) memory requirement.)
131
        Myers/Miller reduces the memory complexity of a dynamic
132
        programming sequence alignment algorithm from 
133
        O ( 
134
        N 2) to 
135
        O ( 
136
        N ), at the cost of a roughly
137
        two-fold increase in CPU time. Here I show that a divide
138
        and conquer strategy can also be applied to the RNA
139
        structural alignment problem, greatly reducing the memory
140
        requirement of SCFG alignments and making optimal
141
        structural alignment of large RNAs possible.
142
        I will strictly be dealing with the problem of aligning
143
        a target sequence of unknown secondary structure to a query
144
        of known RNA structure. By "secondary structure" I mean
145
        nested (nonpseudoknotted) pairwise RNA secondary structure
146
        interactions, primarily Watson-Crick base pairs but also
147
        permitting noncanonical base pairs. This RNA structural
148
        alignment problem is different from the problem of aligning
149
        two known RNA secondary structures together [ 32 ] , and
150
        from the problem of aligning two RNA sequences of unknown
151
        structure together under a secondary structure-aware
152
        scoring system [ 33 34 35 36 37 ] .
153
      
154
      
155
        Algorithm
156
        
157
          Prelude: the simpler case of sequence
158
          alignment
159
          The essential concepts of a divide and conquer
160
          alignment algorithm are most easily understood for the
161
          case of linear sequence alignment [ 30 31 ] .
162
          Dynamic programming (DP) algorithms for sequence
163
          alignment fill in an 
164
          N × 
165
          M DP matrix of scores 
166
          F(i,j) for two sequences of lengths
167
          
168
          N and 
169
          M ( 
170
          N < 
171
          M ) [ 38 39 ] . Each score 
172
          F(i,j) is the score of the optimal
173
          alignment of prefix 
174
          x 
175
          1 .. 
176
          x 
177
          
178
            i 
179
           of one sequence to prefix 
180
          y 
181
          1 .. 
182
          y 
183
          
184
            j 
185
           of the other. These scores are calculated
186
          iteratively, e.g. for global (Needleman/Wunsch)
187
          alignment:
188
          
189
          At the end, 
190
          F(N, M) contains the score of the
191
          optimal alignment. The alignment itself is recovered by
192
          tracing the individual optimal steps backwards through
193
          the matrix, starting from cell ( 
194
          N,M ). The algorithm is 
195
          O(NM) in both time and memory.
196
          If we are only interested in the score, not the
197
          alignment itself, the whole 
198
          F matrix does not have to be kept
199
          in memory. The iterative calculation only depends on the
200
          current and previous row of the matrix. Keeping two rows
201
          in memory suffices (in fact, a compulsively efficient
202
          implementation can get away with 
203
          N + 1 cells). A score-only
204
          calculation can be done in 
205
          O ( 
206
          N ) space.
207
          The fill stage of DP alignment algorithms may be run
208
          either forwards and backwards. We can just as easily
209
          calculate the optimal score 
210
          B(i, j) of the best alignment of
211
          the 
212
          suffix i + 1.. 
213
          N of sequence 1 to the suffix 
214
          j + 1.. 
215
          M of sequence 2, until one obtains
216
          B(0,0), the overall optimal score - the same number as 
217
          F(N,M). 
218
          The sum of 
219
          F(i,j) and 
220
          B(i,j) at any cell in the optimal
221
          path through the DP matrix is also the optimal overall
222
          alignment score. More generally, 
223
          F(i,j) + B(i,j) at any cell ( 
224
          i,j ) is the score of the best
225
          alignment that uses that cell. Therefore, since we know
226
          the optimal alignment must pass through any given row 
227
          i somewhere, we can pick some row 
228
          i in the middle of sequence 
229
          x, run the forward calculation to 
230
          i to obtain row 
231
          F(i), run the backwards calculation
232
          back to 
233
          i to get row 
234
          B(i), and then find argmax 
235
          
236
            j 
237
           
238
          F ( 
239
          i , 
240
          j )+ 
241
          B ( 
242
          i , 
243
          j ). Now I know the optimal
244
          alignment passes through cell ( 
245
          i,j ). (For clarity, I am leaving
246
          out details of how indels and local alignments are
247
          handled.)
248
          This divides the alignment into two smaller alignment
249
          problems, and these smaller problems can themselves be
250
          subdivided by the same trick. Thus, the complete optimal
251
          alignment can be found by a recursive series of split
252
          point calculations. Although this seems laborious - each
253
          calculation is giving us only a single point in the
254
          alignment - if we choose our split row 
255
          i to be in the middle, the size of
256
          the two smaller DP problems is decreased by about 4-fold
257
          at each split. A complete alignment thus costs only about
258
          times as much CPU time as doing the alignment in a single
259
          DP matrix calculation, but the algorithm is 
260
          O ( 
261
          N ) in memory.
262
          A standard dynamic programming alignment algorithm for
263
          SCFGs is the Cocke-Younger-Kasami (CYK) algorithm, which
264
          finds an optimal parse tree (e.g. alignment) for a model
265
          and a sequence [ 24 40 41 42 ] . (CYK is usually
266
          described in the literature as a dynamic programming
267
          recognition algorithm for nonstochastic CFGs in Chomsky
268
          normal form, rather than as a dynamic programming parsing
269
          algorithm for SCFGs in any form. The use of the name
270
          "CYK" here is therefore a little imprecise [ 24 ] .) CYK
271
          can be run in a memory-saving "score only" mode. The DP
272
          matrix for CYK can also be filled in two opposite
273
          directions - either "inside" or "outside", analogous to
274
          forward and backward DP matrix fills for linear sequence
275
          alignment. I will refer to these algorithms as CYK/inside
276
          and CYK/outside (or just inside and outside), but readers
277
          familiar with SCFG algorithms should not confuse them
278
          with the SCFG Inside and Outside algorithms [ 43 44 ]
279
          which sum over all possible parse trees rather than
280
          finding one optimal parse tree. I am always talking about
281
          the CYK algorithm in this paper, and by "inside" and
282
          "outside" I am only referring generically to the
283
          direction of the CYK DP calculation.
284
          The CYK/inside and CYK/outside algorithms are not as
285
          nicely symmetrical as the forward and backward DP fills
286
          are in sequence alignment algorithms. The splitting
287
          procedure that one obtains does not generate identical
288
          types of subproblems, so the divide and conquer procedure
289
          for SCFG-based RNA alignment is not as obvious.
290
        
291
        
292
          Definition and construction of a covariance
293
          model
294
          
295
            Definition of a stochastic context free
296
            grammar
297
            A stochastic context free grammar (SCFG) consists of
298
            the following:
299
            • 
300
            M different nonterminals (here
301
            called 
302
            states ). I will use capital
303
            letters to refer to specific nonterminals; 
304
            V and 
305
            Y will be used to refer
306
            generically to unspecified nonterminals.
307
            • 
308
            K different terminal symbols
309
            (e.g. the observable alphabet, a,c,g,u for RNA). I will
310
            use small letters 
311
            a, b to refer generically to
312
            terminal symbols.
313
            • a number of 
314
            production rules of the form: 
315
            V → γ, where γ can be any string
316
            of nonterminal and/or terminal symbols, including (as a
317
            special case) the empty string ε.
318
            • Each production rule is associated with a
319
            probability, such that the sum of the production
320
            probabilities for any given nonterminal 
321
            V is equal to 1.
322
          
323
          
324
            SCFG productions allowed in covariance
325
            models
326
            A covariance model is a specific repetitive "profile
327
            SCFG" architecture consisting of groups of model states
328
            that are associated with base pairs and single-stranded
329
            positions in an RNA secondary structure consensus. A
330
            covariance model has seven types of states and
331
            production rules (Table 1).
332
            Each overall production probability is the
333
            independent product of an emission probability 
334
            e 
335
            
336
              v 
337
             and a transition probability 
338
            t 
339
            
340
              v 
341
             , both of which are position-dependent parameters
342
            that depend on the state 
343
            v (analogous to hidden Markov
344
            models). For example, a particular pair (P) state 
345
            v produces two correlated letters
346
            
347
            a and 
348
            b (e.g. one of 16 possible base
349
            pairs) with probability 
350
            e 
351
            
352
              v 
353
             ( 
354
            a , 
355
            b ) and transits to one of
356
            several possible new states 
357
            Y of various types with
358
            probability 
359
            t 
360
            
361
              v 
362
             ( 
363
            Y ). A bifurcation (B) state
364
            splits into two new start ( 
365
            S ) states with probability 1.
366
            The E state is a special case ε production that
367
            terminates a derivation.
368
            A CM consists of many states of these seven basic
369
            types, each with its own emission and transition
370
            probability distributions, and its own set of states
371
            that it can transition to. Consensus base pairs will be
372
            modeled by P states, consensus single stranded residues
373
            by L and R states, insertions relative to the consensus
374
            by more L and R states, deletions relative to consensus
375
            by D states, and the branching topology of the RNA
376
            secondary structure by B, S, and E states. The
377
            procedure for starting from an input multiple alignment
378
            and determining how many states, what types of states,
379
            and how they are interconnected by transition
380
            probabilities is described next.
381
          
382
          
383
            From consensus structural alignment to guide
384
            tree
385
            Figure 1shows an example input file: a multiple
386
            sequence alignment of homologous RNAs, with a line
387
            describing the consensus RNA secondary structure. The
388
            first step of building a CM is to produce a binary 
389
            guide tree of nodes representing
390
            the consensus secondary structure. The guide tree is a
391
            parse tree for the consensus structure, with nodes as
392
            nonterminals and alignment columns as terminals.
393
            The guide tree has eight types of nodes (Table
394
            2).
395
            These consensus node types correspond closely with a
396
            CM's final state types. Each node will eventually
397
            contain one or more states. The guide tree deals with
398
            the consensus structure. For individual sequences, we
399
            will need to deal with insertions and deletions with
400
            respect to this consensus. The guide tree is the
401
            skeleton on which we will organize the CM. For example,
402
            a MATP node will contain a P-type state to model a
403
            consensus base pair; but it will also contain several
404
            other states to model infrequent insertions and
405
            deletions at or adjacent to this pair.
406
            The input alignment is first used to construct a
407
            consensus secondary structure (Figure 2) that defines
408
            which aligned columns will be ignored as non-consensus
409
            (and later modeled as insertions relative to the
410
            consensus), and which consensus alignment columns are
411
            base-paired to each other. Here I assume that both the
412
            structural annotation and the labeling of insert versus
413
            consensus columns is given in the input file, as shown
414
            in the line marked "[structure]" in the alignment in
415
            Figure 1. Alternatively, automatic methods might be
416
            employed. A consensus structure could be predicted from
417
            comparative analysis of the alignment [ 22 45 46 ] .
418
            The consensus columns could be chosen as those columns
419
            with less than a certain fraction of gap symbols, or by
420
            a maximum likelihood criterion, as is done for profile
421
            HMM construction [ 18 24 ] .
422
            Given the consensus structure, consensus base pairs
423
            are assigned to MATP nodes and consensus unpaired
424
            columns are assigned to MATL or MATR nodes. One ROOT
425
            node is used at the head of the tree. Multifurcation
426
            loops and/or multiple stems are dealt with by assigning
427
            one or more BIF nodes that branch to subtrees starting
428
            with BEGL or BEGR head nodes. (ROOT, BEGL, and BEGR
429
            start nodes are labeled differently because they will
430
            be expanded to different groups of states; this has to
431
            do with avoiding ambiguous parse trees for individual
432
            sequences, as described below.) Alignment columns that
433
            are considered to be insertions relative to the
434
            consensus structure are ignored at this stage.
435
            In general there will be more than one possible
436
            guide tree for any given consensus structure. Almost
437
            all of this ambiguity is eliminated by three
438
            conventions: (1) MATL nodes are always used instead of
439
            MATR nodes where possible, for instance in hairpin
440
            loops; (2) in describing interior loops, MATL nodes are
441
            used before MATR nodes; and (3) BIF nodes are only
442
            invoked where necessary to explain branching secondary
443
            structure stems (as opposed to unnecessarily
444
            bifurcating in single stranded sequence). One source of
445
            ambiguity remains. In invoking a bifurcation to explain
446
            alignment columns 
447
            i..j by two substructures on
448
            columns 
449
            i..k and 
450
            k + 1.. 
451
            j, there will be more than one
452
            possible choice of 
453
            k if 
454
            i..j is a multifurcation loop
455
            containing three or more stems. The choice of 
456
            k impacts the performance of the
457
            divide and conquer algorithm; for optimal time
458
            performance, we will want bifurcations to split into
459
            roughly equal sized alignment problems, so I choose the
460
            
461
            k that makes 
462
            i..k and 
463
            k + 1.. 
464
            j as close to the same length as
465
            possible.
466
            The result of this procedure is the guide tree. The
467
            nodes of the guide tree are numbered in preorder
468
            traversal (e.g. a recursion of "number the current
469
            node, visit its left child, visit its right child":
470
            thus parent nodes always have lower indices than their
471
            children). The guide tree corresponding to the input
472
            multiple alignment in Figure 1is shown in Figure 2.
473
          
474
          
475
            From guide tree to covariance model
476
            A CM must deal with insertions and deletions in
477
            individual sequences relative to the consensus
478
            structure. For example, for a consensus base pair,
479
            either partner may be deleted leaving a single unpaired
480
            residue, or the pair may be entirely deleted;
481
            additionally, there may be inserted nonconsensus
482
            residues between this pair and the next pair in the
483
            stem. Accordingly, each node in the master tree is
484
            expanded into one or more 
485
            states in the CM as follows
486
            (Table 3)
487
            Here we distinguish between consensus ("M", for
488
            "match") states and insert ("I") states. ML and IL, for
489
            example, are both L type states with L type
490
            productions, but they will have slightly different
491
            properties, as described below.
492
            The states are grouped into a 
493
            split set of 1-4 states (shown in
494
            brackets above) and an 
495
            insert set of 0-2 insert states.
496
            The split set includes the main consensus state, which
497
            by convention is first. One and only one of the states
498
            in the split set must be visited in every parse tree
499
            (and this fact will be exploited by the divide and
500
            conquer algorithm). The insert state(s) are not
501
            obligately visited, and they have self-transitions, so
502
            they will be visited zero or more times in any given
503
            parse tree.
504
            State transitions are then assigned as follows. For
505
            bifurcation nodes, the B state makes obligate
506
            transitions to the S states of the child BEGL and BEGR
507
            nodes. For other nodes, each state in a split set has a
508
            possible transition to every insert state in the 
509
            same node, and to every state in
510
            the split set of the 
511
            next node. An IL state makes a
512
            transition to itself, to the IR state in the same node
513
            (if present), and to every state in the split set of
514
            the next node. An IR state makes a transition to itself
515
            and to every state in the split set of the next
516
            node.
517
            This arrangement of transitions guarantees that
518
            (given the guide tree) there is unambiguously one and
519
            only one parse tree for any given individual structure.
520
            This is important. The algorithm will find a maximum
521
            likelihood parse tree for a given sequence, and we wish
522
            to interpret this result as a maximum likelihood
523
            structure, so there must be a one to one relationship
524
            between parse trees and secondary structures [ 47 ]
525
            .
526
            The final CM is an array of 
527
            M states, connected as a directed
528
            graph by transitions 
529
            t 
530
            
531
              v 
532
             ( 
533
            y ) (or probability 1 transitions
534
            
535
            v → ( 
536
            y,z ) for bifurcations) with the
537
            states numbered such that ( 
538
            y,z ) ≥ 
539
            v. There are no cycles in the
540
            directed graph other than cycles of length one (e.g.
541
            the self-transitions of the insert states). We can
542
            think of the CM as an array of states in which all
543
            transition dependencies run in one direction; we can do
544
            an iterative dynamic programming calculation through
545
            the model states starting with the last numbered end
546
            state 
547
            M and ending in the root state 1.
548
            An example CM, corresponding to the input alignment of
549
            Figure 1, is shown in Figure 3.
550
            As a convenient side effect of the construction
551
            procedure, it is guaranteed that the transitions from
552
            any state are to a 
553
            contiguous set of child states,
554
            so the transitions for state 
555
            v may be kept as an offset and a
556
            count. For example, in Figure 3, state 12 (an MP)
557
            connects to states 16, 17, 18, 19, 20, and 21. We can
558
            store this as an offset of 4 to the first connected
559
            state, and a total count of 6 connected states. We know
560
            that the offset is the distance to the next non-split
561
            state in the current node; we also know that the count
562
            is equal to the number of insert states in the current
563
            node, plus the number of split set states in the next
564
            node. These properties make establishing the
565
            connectivity of the CM trivial. Similarly, all the
566
            parents of any given state are also contiguously
567
            numbered, and can be determined analogously. We are
568
            also guaranteed that the states in a split set are
569
            numbered contiguously. This contiguity is exploited by
570
            the divide and conquer implementation.
571
          
572
          
573
            Parameterization
574
            Using the guide tree and the final CM, each
575
            individual sequence in the input multiple alignment can
576
            be converted unambiguously to a CM parse tree, as shown
577
            in Figure 4. Counts for observed state transitions and
578
            singlet/pair emissions are then collected from these
579
            parse trees. The observed counts are converted to
580
            transition and emission probabilities by standard
581
            procedures. I calculate maximum a posteriori
582
            parameters, using Dirichlet priors.
583
          
584
          
585
            Comparison to profile HMMs
586
            The relationship between an SCFG and a covariance
587
            model is analogous to the relationship of hidden Markov
588
            models (HMMs) and profile HMMs for modeling multiple
589
            sequence alignments [ 18 19 24 ] . A comparison may be
590
            instructive to readers familiar with profile HMMs. A
591
            profile HMM is a repetitive HMM architecture that
592
            associates each consensus column of a multiple
593
            alignment with a single type of model node - a MATL
594
            node, in the above notation. Each node contains a
595
            "match", "delete", and "insert" HMM state - ML, IL, and
596
            D states, in the above notation. The profile HMM also
597
            has special begin and end states. Profile HMMs could
598
            therefore be thought of as a special case of CMs. An
599
            unstructured RNA multiple alignment would be modeled by
600
            a guide tree of all MATL nodes, and converted to an
601
            unbifurcated CM that would essentially be identical to
602
            a profile HMM. (The only difference is trivial; the CM
603
            root node includes a IR state, whereas the start node
604
            of a profile HMM does not.) All the other node types
605
            (especially MATP, MATR, and BIF) and state types (e.g.
606
            MP, MR, IR, and B) are SCFG augmentations necessary to
607
            extend profile HMMs to deal with RNA secondary
608
            structure.
609
            The SCFG inside and outside algorithms are analogous
610
            to the Forward and Backward algorithms for HMMs [ 24 48
611
            ] . The CYK/inside parsing algorithm is analogous to
612
            the Viterbi HMM alignment algorithm run in the forward
613
            direction. CYK/outside is analogous to a Viterbi DP
614
            algorithm run in the backwards direction.
615
          
616
        
617
        
618
          Divide and conquer algorithm
619
          
620
            Notation
621
            I use 
622
            r, v, w, y, and 
623
            z as indices of states in the
624
            model, where 
625
            r ≤ ( 
626
            v , 
627
            w , 
628
            y ) ≤ 
629
            z. These indices will range from
630
            1.. 
631
            M , for a CM 
632
            G that contains 
633
            M states. refers to a subgraph of
634
            the model, rooted at state 
635
            r and ending at state 
636
            z, for a contiguous set of states
637
            
638
            r..z. G 
639
            r , without a subscript, refers
640
            to a subgraph of the model rooted at state 
641
            r and ending at the highest
642
            numbered E state descendant from state 
643
            r. The complete model is , or 
644
            G 1, or just 
645
            G. 
646
            
647
            S 
648
            
649
              v 
650
             refers to the 
651
            type of state 
652
            v ; it will be one of seven types
653
            {D,P,L,R,S,E,B}. 
654
            C 
655
            
656
              v 
657
             is a list of children for state 
658
            v (e.g. the states that 
659
            v can transit to); it will
660
            contain up to six contiguous indices 
661
            y with 
662
            v ≤ 
663
            y ≤ 
664
            M . 
665
            P 
666
            
667
              v 
668
             is a list of parents for state 
669
            v (states that could have
670
            transited to state 
671
            v ); it will contain up to six
672
            contiguous indices 
673
            y with 1 ≤ 
674
            y ≤ 
675
            v. ( 
676
            P 
677
            
678
              v 
679
             parent lists should not be confused with P state
680
            types.)
681
            I use 
682
            g, h, i, j, k, p, and 
683
            q as indices referring to
684
            positions in a sequence 
685
            x, where 
686
            g ≤ 
687
            h ≤ 
688
            p ≤ 
689
            q and 
690
            i ≤ 
691
            j for all subsequences of nonzero
692
            length. These indices range from 1.. 
693
            L, for a sequence of length 
694
            L . Some algorithms will also use
695
            
696
            d to refer to a subsequence
697
            length, where 
698
            d = 
699
            j - 
700
            i + 1 for a subsequence 
701
            x 
702
            
703
              i 
704
             .. 
705
            x 
706
            
707
              j 
708
             .
709
            The algorithms will have to account for subsequences
710
            of zero length (because of deletions). By convention,
711
            these will be in the off-diagonal where 
712
            j = 
713
            i - 1 or 
714
            i = 
715
            j + 1. This special case (usually
716
            an initialization condition) is the reason for the
717
            qualification that 
718
            i ≤ 
719
            j for subsequences of 
720
            nonzero length.
721
            The CYK/inside algorithm calculates a
722
            three-dimensional matrix of numbers α 
723
            
724
              v 
725
             ( 
726
            i , 
727
            j ), and CYK/outside calculates
728
            numbers β 
729
            
730
              v 
731
             ( 
732
            i , 
733
            j ). I will refer to 
734
            v (state indices) as 
735
            deck coordinates in the
736
            three-dimensional matrices, whereas 
737
            j and 
738
            i (sequence positions) are row
739
            and column coordinates within each deck. α 
740
            
741
              v 
742
             and β 
743
            
744
              v 
745
             refer to whole two-dimensional decks containing
746
            scores α 
747
            
748
              v 
749
             ( 
750
            i , 
751
            j ) and β 
752
            
753
              v 
754
             ( 
755
            i , 
756
            j ) for a particular state 
757
            v. The dividing and conquering
758
            will be done in the 
759
            v dimension, by choosing
760
            particular decks as split points.
761
          
762
          
763
            The CYK/inside algorithm
764
            The CYK/inside algorithm iteratively calculates α 
765
            
766
              v 
767
             ( 
768
            i , 
769
            j ) - the log probability of the
770
            most likely CM parse subtree rooted at state 
771
            v that generates subsequence 
772
            x 
773
            
774
              i 
775
             .. 
776
            x 
777
            
778
              j 
779
             of sequence 
780
            x. The calculation initializes at
781
            the smallest subgraphs and subsequences (e.g. subgraphs
782
            rooted at E states, generating subsequences of length
783
            0), and iterates outwards to progessively longer
784
            subsequences and larger CM subgraphs.
785
            For example, if we're calculating α 
786
            
787
              v 
788
             ( 
789
            i , 
790
            j ) and 
791
            S 
792
            
793
              v 
794
             = P (that is, 
795
            v is a pair state), 
796
            v will generate the pair 
797
            x 
798
            
799
              i 
800
             , 
801
            x 
802
            
803
              j 
804
             and transit to a new state 
805
            y (one of its possible
806
            transitions 
807
            C 
808
            
809
              v 
810
             ) which then will have to account for the smaller
811
            subsequence 
812
            x 
813
            
814
            i +1 .. 
815
            x 
816
            
817
            j -1 . The log probability
818
            for a particular choice of next state 
819
            y is the sum of three terms: an
820
            emission term log 
821
            e 
822
            
823
              v 
824
             ( 
825
            x 
826
            
827
              i 
828
             , 
829
            x 
830
            
831
              j 
832
             ), a transition term log 
833
            t 
834
            
835
              v 
836
             ( 
837
            y ), and an already calculated
838
            solution for the smaller optimal parse tree rooted at 
839
            y , α 
840
            
841
              y 
842
             ( 
843
            i + 1, 
844
            j - 1). The answer for α 
845
            
846
              v 
847
             ( 
848
            i , 
849
            j ) is the maximum over all
850
            possible choices of child states 
851
            y that 
852
            v can transit to.
853
            The algorithm INSIDE is as follows:
854
            
855
            Input: A CM subgraph and
856
            subsequence 
857
            x 
858
            
859
              g 
860
             .. 
861
            x 
862
            
863
              q 
864
             .
865
            
866
            Output: Scoring matrix decks α 
867
            
868
              r 
869
             ..α 
870
            
871
              z 
872
             .
873
            INSIDE(r,z; g,q)
874
            
875
            for 
876
            v ← 
877
            z 
878
            down to 
879
            r 
880
            
881
            for 
882
            j ← 
883
            g - 1 
884
            to 
885
            q 
886
            
887
            for 
888
            i ← 
889
            j + 1 
890
            down to 
891
            g 
892
            
893
            d ← 
894
            j - 
895
            i + 1
896
            
897
            if 
898
            S 
899
            
900
              v 
901
             = D or S:
902
            α 
903
            
904
              v 
905
             ( 
906
            i, j ) = [α 
907
            
908
              y 
909
             ( 
910
            i, j ) + log 
911
            t 
912
            
913
              v 
914
             ( 
915
            y )]
916
            
917
            else if 
918
            S 
919
            
920
              v 
921
             = P and 
922
            d ≥ 2:
923
            α 
924
            
925
              v 
926
             ( 
927
            i, j ) = log 
928
            e 
929
            
930
              v 
931
             ( 
932
            x 
933
            
934
              i 
935
             , 
936
            x 
937
            
938
              j 
939
             ) + [α 
940
            
941
              y 
942
             ( 
943
            i + 1, 
944
            j - 1) + log 
945
            t 
946
            
947
              v 
948
             ( 
949
            y )]
950
            
951
            else if 
952
            S 
953
            
954
              v 
955
             = L and 
956
            d ≥ 1:
957
            α 
958
            
959
              v 
960
             ( 
961
            i, j ) = log 
962
            e 
963
            
964
              v 
965
             ( 
966
            x 
967
            
968
              i 
969
             ) + [α 
970
            
971
              y 
972
             ( 
973
            i + 1, 
974
            j ) + log 
975
            t 
976
            
977
              v 
978
             ( 
979
            y )]
980
            
981
            else if 
982
            S 
983
            
984
              v 
985
             = R and 
986
            d ≥ 1:
987
            α 
988
            
989
              v 
990
             ( 
991
            i, j ) = log 
992
            e 
993
            
994
              v 
995
             ( 
996
            x 
997
            
998
              j 
999
             ) + [α 
1000
            
1001
              y 
1002
             ( 
1003
            i , 
1004
            j - 1) + log 
1005
            t 
1006
            
1007
              v 
1008
             ( 
1009
            y )]
1010
            
1011
            else if 
1012
            S 
1013
            
1014
              v 
1015
             = B:
1016
            ( 
1017
            y , 
1018
            z ) ← left and right S children
1019
            of state 
1020
            v 
1021
            α 
1022
            
1023
              v 
1024
             ( 
1025
            i, j ) = [α 
1026
            
1027
              y 
1028
             ( 
1029
            i, k ) + α 
1030
            
1031
              z 
1032
             ( 
1033
            k + 1, 
1034
            j )]
1035
            
1036
            else if 
1037
            S 
1038
            
1039
              v 
1040
             = E and 
1041
            d = 0:
1042
            α 
1043
            
1044
              v 
1045
             ( 
1046
            i , 
1047
            j ) = 0 (initializations)
1048
            
1049
              else 
1050
            
1051
            α 
1052
            
1053
              v 
1054
             ( 
1055
            i , 
1056
            j ) = -∞ (initializations)
1057
            Given a sequence 
1058
            x of length 
1059
            L and a CM 
1060
            G of length 
1061
            M , we could call INSIDE (1, M;
1062
            1, L) to align the whole model (states 1.. 
1063
            M ) to the whole sequence ( 
1064
            x 
1065
            1 .. 
1066
            x 
1067
            
1068
              L 
1069
             ). When INSIDE returns, α 
1070
            1 (1, 
1071
            L ) would contain the log
1072
            probability of the best parse of the complete sequence
1073
            with the complete model.
1074
            We do not have to keep the entire α
1075
            three-dimensional matrix in memory to calculate these
1076
            scores. As we reach higher decks α 
1077
            
1078
              v 
1079
             in the three dimensional dynamic programming
1080
            matrix, our calculations no longer depend on certain
1081
            lower decks. A lower deck 
1082
            y can be deallocated whenever all
1083
            the parent decks 
1084
            P 
1085
            
1086
              y 
1087
             that depend on it have been calculated. (The
1088
            implementation goes even further and recycles decks
1089
            when possible, saving some initialization steps and
1090
            many memory allocation calls; for example, since values
1091
            in all E decks are identical, only one E deck needs to
1092
            be calculated and that precalculated deck can be reused
1093
            whenever 
1094
            S 
1095
            
1096
              v 
1097
             = 
1098
            E .)
1099
            This deallocation rule has an important property
1100
            that the divide and conquer algorithm takes advantage
1101
            of when solving smaller subproblems for CM subgraphs
1102
            rooted at some state 
1103
            w. When the root state 
1104
            w is an S state, the α matrix
1105
            returned by INSIDE contains only one active deck α 
1106
            
1107
              w 
1108
             . (No lower state > 
1109
            w can be reached from any state
1110
            < 
1111
            w without going through 
1112
            w, so all lower decks are
1113
            deallocated once deck 
1114
            w is completed.) When the root
1115
            state 
1116
            w is the first state in a split
1117
            set 
1118
            w..y (see below for more
1119
            explanation), all (and only) the decks α 
1120
            
1121
              w 
1122
             ..α 
1123
            
1124
              y 
1125
             are active when INSIDE returns.
1126
            In some cases we want to recover the optimal parse
1127
            tree itself, not just its score. The INSIDE τroutine is
1128
            a modified version of INSIDE. It keeps an additional
1129
            "shadow matrix" τ 
1130
            
1131
              v 
1132
             ( 
1133
            i,j ). A τ 
1134
            
1135
              v 
1136
             ( 
1137
            i,j ) traceback pointer either
1138
            records the index 
1139
            y that maximized α 
1140
            
1141
              v 
1142
             ( 
1143
            i,j ) (for state types D,S,P,L,R)
1144
            or records the split point 
1145
            k that maximized α 
1146
            
1147
              v 
1148
             ( 
1149
            i,j ) for a bifurcation (B)
1150
            state. The τ shadow matrix does not use the
1151
            deallocation rules - INSIDE τcan only be called for
1152
            problems small enough that they can be solved within
1153
            our available memory space. Thus the INSIDE τroutine
1154
            works by calling INSIDE in a mode that also keeps a
1155
            shadow matrix τ, and then calls a recursive traceback
1156
            starting with 
1157
            v, i, j :
1158
            
1159
            Input: A shadow matrix τ for CM
1160
            subgraph 
1161
            G 
1162
            v rooted at state 
1163
            v, and subsequence 
1164
            x 
1165
            
1166
              i 
1167
             .. 
1168
            x 
1169
            
1170
              j 
1171
             .
1172
            
1173
            Output: An optimal parse tree 
1174
            T .
1175
            TRACEBACK(v,i,j)
1176
            
1177
            if 
1178
            S 
1179
            
1180
              v 
1181
             = 
1182
            E :
1183
            attach 
1184
            v 
1185
            
1186
            else if 
1187
            S 
1188
            
1189
              v 
1190
             = 
1191
            S or D:
1192
            attach 
1193
            v 
1194
            TRACEBACK(τ 
1195
            
1196
              v 
1197
             ( 
1198
            i , 
1199
            j ), 
1200
            i , 
1201
            j ))
1202
            
1203
            else if 
1204
            S 
1205
            
1206
              v 
1207
             = P:
1208
            attach 
1209
            x 
1210
            
1211
              i 
1212
             , 
1213
            v , 
1214
            x 
1215
            
1216
              j 
1217
             
1218
            TRACEBACK(τ 
1219
            
1220
              v 
1221
             ( 
1222
            i , 
1223
            j ), 
1224
            i + 1, 
1225
            j - 1)
1226
            
1227
            else if 
1228
            S 
1229
            
1230
              v 
1231
             = L:
1232
            attach 
1233
            x 
1234
            
1235
              i 
1236
             , 
1237
            v 
1238
            TRACEBACK(τ 
1239
            
1240
              v 
1241
             ( 
1242
            i , 
1243
            j ), 
1244
            i + 1, 
1245
            j )
1246
            
1247
            else if 
1248
            S 
1249
            
1250
              v 
1251
             = R:
1252
            attach 
1253
            v , 
1254
            x 
1255
            
1256
              j 
1257
             
1258
            TRACEBACK(τ 
1259
            
1260
              v 
1261
             ( 
1262
            i , 
1263
            j ), 
1264
            i , 
1265
            j - 1)
1266
            
1267
            else if 
1268
            S 
1269
            
1270
              v 
1271
             = B:
1272
            ( 
1273
            y , 
1274
            z ) ← left and right S children
1275
            of state 
1276
            v 
1277
            attach 
1278
            v 
1279
            TRACEBACK( 
1280
            y , 
1281
            i , τ 
1282
            
1283
              v 
1284
             ( 
1285
            i , 
1286
            j ))
1287
            TRACEBACK( 
1288
            z , τ 
1289
            
1290
              v 
1291
             ( 
1292
            i , 
1293
            j ) + 1, 
1294
            j )
1295
          
1296
          
1297
            The CYK/outside algorithm
1298
            The CYK/outside algorithm iteratively calculates β 
1299
            
1300
              v 
1301
             ( 
1302
            i , 
1303
            j ), the log probability of the
1304
            most likely CM parse tree for a CM generating a
1305
            sequence 
1306
            x 
1307
            1 .. 
1308
            x 
1309
            
1310
              L 
1311
             
1312
            excluding the optimal parse
1313
            subtree rooted at state 
1314
            v that accounts for the
1315
            subsequence 
1316
            x 
1317
            
1318
              i 
1319
             .. 
1320
            x 
1321
            
1322
              j 
1323
             . The calculation initializes with the entire
1324
            sequence excluded (e.g. β 
1325
            1 (1, 
1326
            L ) = 0), and iterates inward to
1327
            progressively shorter and shorter excluded subsequences
1328
            and smaller CM subgraphs.
1329
            A complete implementation of the CYK/outside
1330
            algorithm requires first calculating the CYK/inside
1331
            matrix α because it is needed to calculate β 
1332
            
1333
              v 
1334
             when the parent of 
1335
            v is a bifurcation [ 24 43 44 ] .
1336
            However, the divide and conquer algorithm described
1337
            here only calls OUTSIDE on 
1338
            unbifurcated, linear CM
1339
            subgraphs (only the final state 
1340
            z may be a B state; there are no
1341
            internal bifurcations that lead to branches in the
1342
            model). Thus the parent of a state 
1343
            v is never a bifurcation, and the
1344
            implementation can therefore be streamlined as
1345
            follows:
1346
            
1347
            Input: An unbifurcated CM subgraph
1348
            and subsequence 
1349
            x 
1350
            
1351
              g 
1352
             .. 
1353
            x 
1354
            
1355
              q 
1356
             .
1357
            
1358
            Output: Scoring matrix decks β 
1359
            
1360
              r 
1361
             ..β 
1362
            
1363
              z 
1364
             .
1365
            OUTSIDE(r,z; g,q)
1366
            β 
1367
            
1368
              v 
1369
             ( 
1370
            i , 
1371
            j ) ← - ∞ ∀ 
1372
            v , 
1373
            i , 
1374
            j 
1375
            β 
1376
            
1377
              r 
1378
             ( 
1379
            g , 
1380
            q ) ← 0
1381
            
1382
            for 
1383
            v ← 
1384
            r + 1 
1385
            to 
1386
            z 
1387
            
1388
            for 
1389
            j ← 
1390
            q 
1391
            down to 
1392
            g - 1
1393
            
1394
            for 
1395
            i ← 
1396
            g 
1397
            to 
1398
            j + 1
1399
            
1400
            As with INSIDE, we do not keep the entire β matrix
1401
            in memory. A deck β 
1402
            
1403
              v 
1404
             can be deallocated when all child decks 
1405
            C 
1406
            
1407
              v 
1408
             that depend on the values in β 
1409
            
1410
              v 
1411
             have been calculated. This means that if the last
1412
            deck 
1413
            z is a bifurcation or end state,
1414
            β 
1415
            
1416
              z 
1417
             will be the only active allocated deck when
1418
            OUTSIDE returns. If 
1419
            z is the last state in a split
1420
            set 
1421
            w..z, all (and only) the split
1422
            set decks β 
1423
            
1424
              w 
1425
             ..β 
1426
            
1427
              z 
1428
             will be active when OUTSIDE returns.
1429
          
1430
          
1431
            Using CYK/inside and CYK/outside to divide and
1432
            conquer
1433
            Now, for any chosen state 
1434
            v, argmax 
1435
            
1436
            i , 
1437
            j  [α 
1438
            
1439
              v 
1440
             ( 
1441
            i, j ) + β 
1442
            
1443
              v 
1444
             ( 
1445
            ,i,j )] tells us which cell 
1446
            v, i, j the optimal parse tree
1447
            passes through, conditional on using state 
1448
            v in the parse. We know that any
1449
            parse tree must include all the bifurcation and start
1450
            states of the CM, so we know that the optimal alignment
1451
            
1452
            must use any chosen bifurcation
1453
            state 
1454
            v and its child start states 
1455
            w and 
1456
            y. Thus, we are guaranteed that
1457
            (when 
1458
            S 
1459
            
1460
              v 
1461
             = B and 
1462
            C 
1463
            
1464
              v 
1465
             = 
1466
            w, y ):
1467
            
1468
            is the optimal overall alignment score, and we also
1469
            know that
1470
            
1471
            gives us a triplet that identifies three cells that
1472
            must be in the optimal alignment - ( 
1473
            v, i, j ), ( 
1474
            w, i, k ), and ( 
1475
            y, k + 1, 
1476
            j ). This splits the remaining
1477
            problem into three smaller subproblems - an alignment
1478
            of the sequence 
1479
            x 
1480
            
1481
              i 
1482
             .. 
1483
            x 
1484
            
1485
              k 
1486
             to a CM subgraph 
1487
            w..y - 1, an alignment of the
1488
            sequence 
1489
            x 
1490
            
1491
            k +1 .. 
1492
            x 
1493
            
1494
              j 
1495
             to a CM subgraph 
1496
            y..M , and an alignment of the
1497
            two-piece sequence 
1498
            x 
1499
            1 .. 
1500
            z 
1501
            
1502
            i -1 // 
1503
            x 
1504
            
1505
            j +1 .. 
1506
            x 
1507
            
1508
              L 
1509
             to a CM subgraph 1.. 
1510
            v .
1511
            The subproblems are then themselves split, and this
1512
            splitting can continue recursively until all the
1513
            bifurcation triplets on the optimal parse tree have
1514
            been determined.
1515
            At this point the remaining alignment subproblems
1516
            might be small enough to be solved by straightforward
1517
            application of the standard CYK/inside algorithm (e.g.
1518
            INSIDE τ). However, this is not guaranteed to be the
1519
            case. A more general division strategy is needed that
1520
            does not depend on splitting at bifurcations.
1521
            For the more general strategy we take advantage of
1522
            the fact that we know that the optimal parse tree must
1523
            also include one and only one state from the split set
1524
            of each node (e.g. the non-insert states in the node).
1525
            Let 
1526
            w..y be the indices of a split
1527
            set of states in the middle of the current model
1528
            subgraph. ( 
1529
            w..y can be at most 4 states.) We
1530
            know that
1531
            
1532
            gives us a new cell ( 
1533
            v, i, j ) in the optimal parse
1534
            tree, and splits the problem into two smaller problems.
1535
            This strategy can be applied recursively all the way
1536
            down to single nodes, if necessary. We can therefore
1537
            guarantee that we will never need to carry out a full
1538
            CYK/inside alignment algorithm on any subproblem. The
1539
            most memory-intensive alignment problem that needs to
1540
            be solved is the very first split. The properties of
1541
            the first split determine the memory complexity of the
1542
            algorithm.
1543
            The bifurcation-dependent strategy is a special case
1544
            of this more general splitting strategy, where the B
1545
            state is the only member of its split set, and where we
1546
            also take advantage of the fact that α 
1547
            
1548
              v 
1549
             ( 
1550
            i , 
1551
            j ) = max 
1552
            
1553
              k 
1554
             α 
1555
            
1556
              w 
1557
             ( 
1558
            i, k ) + α 
1559
            
1560
              y 
1561
             ( 
1562
            k + 1, 
1563
            j ). By carrying out the max 
1564
            
1565
              k 
1566
             operation during the split, rather than before, we
1567
            can split the current problem into three optimal pieces
1568
            instead of just two.
1569
            If we look at the consequences of these splitting
1570
            strategies, we see we will have to deal with three
1571
            types of problems (Figure 5):
1572
            • A 
1573
            generic problem means finding the
1574
            optimal alignment of a CM subgraph to a contiguous
1575
            subsequence 
1576
            x 
1577
            
1578
              g 
1579
             .. 
1580
            x 
1581
            
1582
              q 
1583
             . The subgraph corresponds to a complete subtree
1584
            of the CM's guide tree - e.g. state 
1585
            r is a start (S), and state 
1586
            z is an end (E). may contain
1587
            bifurcations. The problem is solved in one of two ways.
1588
            If contains no bifurcations, it is solved as a wedge
1589
            problem (see below). Else, the problem is subdivided by
1590
            the bifurcation-dependent strategy: an optimal triple
1591
            (i, 
1592
            k, j ) is found for a bifurcation
1593
            state 
1594
            v and its children 
1595
            w, y, splitting the problem into
1596
            a V problem and two generic problems.
1597
            • A 
1598
            wedge problem means finding the
1599
            optimal alignment of an 
1600
            unbifurcated CM subgraph to a
1601
            contiguous subsequence 
1602
            x 
1603
            
1604
              g 
1605
             .. 
1606
            x 
1607
            
1608
              q 
1609
             . State 
1610
            z does not have to be a start
1611
            state (S); it may be a state in a split set (MP, ML,
1612
            MR, or D). State 
1613
            z is an end (E). A wedge problem
1614
            is solved by the split set-dependent strategy: an
1615
            optimal ( 
1616
            v, i, j ) is found, splitting the
1617
            problem into a V problem and a smaller wedge
1618
            problem.
1619
            • A 
1620
            V problem consists of finding the
1621
            optimal alignment of an 
1622
            unbifurcated CM subgraph to a
1623
            noncontiguous, two-piece sequence 
1624
            x 
1625
            
1626
              g 
1627
             .. 
1628
            x 
1629
            
1630
              h 
1631
             // 
1632
            x 
1633
            
1634
              p 
1635
             .. 
1636
            x 
1637
            
1638
              g 
1639
             , exclusive of the residues 
1640
            x 
1641
            
1642
              h 
1643
             and 
1644
            x 
1645
            
1646
              p 
1647
             (open circles in Figure 5). State 
1648
            r can be a start state or any
1649
            state in a split set; the same is true for 
1650
            z. A V problem is solved by a
1651
            split set-dependent strategy: an optimal ( 
1652
            v, i, j ) is found, splitting the
1653
            problem into two V problems.
1654
            The three recursive splitting algorithms to solve
1655
            these problems are as follows:
1656
          
1657
          
1658
            The generic_splitter routine
1659
            
1660
            Input: A generic problem, for CM
1661
            subgraph and subsequence 
1662
            x 
1663
            
1664
            g .. 
1665
            q  .
1666
            
1667
            Output: An optimal parse subtree 
1668
            T .
1669
            GENERIC_SPLITTER(r, z; g,q)
1670
            
1671
            if no bifurcation in :
1672
            
1673
            return WEDGE_SPLITTER(r,z; g,q)
1674
            
1675
              else 
1676
            
1677
            
1678
            v ← lowest numbered bifurcation
1679
            state in subgraph .
1680
            
1681
            w,y ← left and right S children
1682
            of 
1683
            v. 
1684
            β 
1685
            
1686
              v 
1687
             ← OUTSIDE(r,w; g,q)
1688
            α 
1689
            
1690
              w 
1691
             ← INSIDE(w,y-1; g,q)
1692
            α 
1693
            
1694
              y 
1695
             ← INSIDE(y,z; g,q)
1696
            
1697
            
1698
            T 
1699
            1  ← V_SPLITTER(r,v; g,i;
1700
            j,q)
1701
            
1702
            T 
1703
            2  ← GENERIC_SPLITTER(w,y-1;
1704
            i,k)
1705
            
1706
            T 
1707
            3  ← GENERIC_SPLITTER(y,z;
1708
            k+l,j)
1709
            Attach S state 
1710
            w of 
1711
            T 
1712
            2  as left child of B state 
1713
            v in 
1714
            T 
1715
            1  .
1716
            Attach S state 
1717
            y of 
1718
            T 
1719
            3  as right child of B state 
1720
            v in 
1721
            T 
1722
            2  .
1723
            
1724
            return T 
1725
            1  .
1726
          
1727
          
1728
            The wedge_splitter routine
1729
            
1730
            Input: A wedge problem, for
1731
            unbifurcated CM subgraph and subsequence 
1732
            x 
1733
            
1734
            g .. 
1735
            q  .
1736
            
1737
            Output: An optimal parse subtree 
1738
            T .
1739
            WEDGE_SPLITTER(r,z; g,q)
1740
            ( 
1741
            w .. 
1742
            y ) ← a split set chosen from
1743
            middle of
1744
            (α 
1745
            
1746
              w 
1747
             ..α 
1748
            
1749
              y 
1750
             ) ← INSIDE(w,z; g,q)
1751
            (β 
1752
            
1753
              w 
1754
             ..β 
1755
            
1756
              y 
1757
             ) ← OUTSIDE(r,y; g,q)
1758
            
1759
            
1760
            T 
1761
            1  ← V_SPLITTER(r,v; g,i;
1762
            j,q)
1763
            
1764
            T 
1765
            2  ← WEDGE_SPLITTER(v,z;
1766
            i,j)
1767
            Attach 
1768
            T 
1769
            2  to 
1770
            T 
1771
            1  by merging at state 
1772
            v. 
1773
            
1774
            return T 
1775
            1  .
1776

1777
          
1778
          
1779
            The V_splitter routine
1780
            
1781
            Input: A V problem, for
1782
            unbifurcated CM subgraph and two-part subsequence 
1783
            x 
1784
            
1785
            g .. 
1786
            h  // 
1787
            x 
1788
            
1789
            p .. 
1790
            q  .
1791
            
1792
            Output: An optimal parse subtree 
1793
            T .
1794
            V_SPLITTER(r,z; g,h; p,q)
1795
            ( 
1796
            w..y ) ← a split set chosen from
1797
            middle of
1798
            (α 
1799
            
1800
              w 
1801
             ..α 
1802
            
1803
              y 
1804
             ) ← VINSIDE(w,z; g,h; p,q)
1805
            (β 
1806
            
1807
              w 
1808
             ..β 
1809
            
1810
              y 
1811
             ) ← VOUTSlDE(r,y; g,h; p,q)
1812
            
1813
            
1814
            T 
1815
            1  ← V_SPLITTER(r,v; g,i;
1816
            j,q)
1817
            
1818
            T 
1819
            2  ← V_SPLITTER(v,z; i,h;
1820
            p,j)
1821
            Attach 
1822
            T 
1823
            2  to 
1824
            T 
1825
            1  by merging at state 
1826
            v. 
1827
            
1828
            return T 
1829
            1  .
1830
          
1831
          
1832
            The vinside and voutside routines
1833
            The VINSIDE and VOUTSIDE routines are just INSIDE
1834
            and OUTSIDE, modified to deal with a two-piece
1835
            subsequence 
1836
            x 
1837
            
1838
              g 
1839
             .. 
1840
            x 
1841
            
1842
              h 
1843
             // 
1844
            x 
1845
            
1846
              p 
1847
             .. 
1848
            x 
1849
            
1850
              g 
1851
             instead of a contiguous sequence 
1852
            x 
1853
            
1854
              g 
1855
             .. 
1856
            x 
1857
            
1858
              q 
1859
             . These modifications are fairly obvious. The
1860
            range of 
1861
            i, j is restricted so that 
1862
            i ≤ 
1863
            h and 
1864
            j ≥ 
1865
            p. Also, VINSIDE (w, z; g, h; p,
1866
            q) initializes α 
1867
            
1868
              z 
1869
             ( 
1870
            h , 
1871
            p ) = 0: that is, we know that
1872
            sequence 
1873
            x 
1874
            
1875
              h 
1876
             .. 
1877
            x 
1878
            
1879
              p 
1880
             has already been accounted for by a CM parse tree
1881
            rooted at 
1882
            z. 
1883
          
1884
          
1885
            Implementation
1886
            In the description of the algorithms above, some
1887
            technical detail has been omitted - in particular, a
1888
            detailed description of efficient initialization steps,
1889
            and details of how the the dynamic programming matrices
1890
            are laid out in memory. These details are not necessary
1891
            for a high level understanding of the divide and
1892
            conquer algorithm. However, they may be necessary for
1893
            reproducing a working implementation. Commented ANSI/C
1894
            source code for a reference implementation is therefore
1895
            freely available at
1896
            http://www.genetics.wustl.edu/eddy/infernal/under a GNU
1897
            General Public License. This code has been tested on
1898
            GNU/Linux platforms.
1899
            In this codebase, the CM data structure is defined
1900
            in structs.h. The CM construction procedure is in
1901
            modelmaker. c:Handmodelmaker(). The guide tree is
1902
            constructed in HandModelmaker(). A CM is constructed
1903
            from the guide tree by cm_from_master(). Individual
1904
            parse trees are constructed using the guide tree by
1905
            transmogrify(). The divide and conquer algorithm is
1906
            implemented in smallcyk. c:CYKDivideAndConquer(), which
1907
            will recursively call a set of functions: the three
1908
            splitting routines GENERIC_SPLITTER(),
1909
            wedge_splitter(), and v_splitter(); the four alignment
1910
            engines INSIDE(), OUTSIDE(), VINSIDE(), and VOUTSIDE();
1911
            and the two traceback routines INSIDET() and
1912
            VINSIDET().
1913
          
1914
        
1915
      
1916
      
1917
        Results and discussion
1918
        
1919
          Memory complexity analysis
1920
          The memory complexity of normal CYK/inside is 
1921
          O ( 
1922
          N 2 
1923
          M ), for a model of 
1924
          M states and a query sequence of 
1925
          N residues, since the full 3D
1926
          dynamic programming matrix is indexed 
1927
          N × 
1928
          N × 
1929
          M (and since 
1930
          N ∝ 
1931
          M , we can alternatively state the
1932
          upper bound as 
1933
          O ( 
1934
          N 3)). The memory complexity of the
1935
          divide and conquer algorithm is 
1936
          O ( 
1937
          N 2log 
1938
          M ). The analysis that leads to
1939
          this conclusion is as follows.
1940
          For a model with no bifurcations, the divide and
1941
          conquer algorithm will never require more than 10 decks
1942
          in memory at once. In the case of two adjacent MATP
1943
          nodes, we will need six decks to store the scores for the
1944
          current node we're calculating, and four decks for the
1945
          split set of the adjacent node that we're connecting to
1946
          (and dependent upon) (Figure 3).
1947
          Bifurcations will require some number of additional
1948
          decks for start states (BEGL_S and BEGR_S) to be kept. In
1949
          INSIDE, whenever we reach a deck for a start state, we
1950
          will keep that deck in memory until we reach the parent
1951
          bifurcation state. Half the time, that will mean waiting
1952
          until another complete subgraph of the model is
1953
          calculated (e.g. the subgraph rooted at the other start
1954
          child of that bifurcation); that is, to calculate deck α 
1955
          
1956
            v 
1957
           for a bifurcation 
1958
          v, we need both decks α 
1959
          
1960
            w 
1961
           and α 
1962
          
1963
            y 
1964
           for its child start states 
1965
          w and 
1966
          y, so we have to hold on to α 
1967
          
1968
            y 
1969
           until we reach α 
1970
          
1971
            w 
1972
           . In turn, the subgraph rooted at 
1973
          w might contain bifurcations, so
1974
          our calculation of α 
1975
          
1976
            w 
1977
           might require additional decks to be kept. Each
1978
          start deck we reach in the INSIDE iteration means holding
1979
          one extra deck in memory, and each bifurcation we reach
1980
          means deallocating the two start decks it depends on;
1981
          therefore we can iteratively calculate the maximum number
1982
          of extra decks we will require:
1983
          
1984
          x 
1985
          
1986
            M 
1987
           ← -1
1988
          
1989
          for v ← 
1990
          M -1 
1991
          to 1
1992
          
1993
          
1994
          return max 
1995
          
1996
            v 
1997
           
1998
          x 
1999
          
2000
            v 
2001
           
2002
          This number depends on the topology and order of
2003
          evaluation of the states in the CM. Think of the
2004
          bifurcating structure of the CM as a binary tree numbered
2005
          in preorder traversal (e.g. left children are visited
2006
          first, and have lower indices than right children). If
2007
          this is a complete balanced tree with 
2008
          B bifurcations, we will need log 
2009
          2 
2010
          B extra decks. If it is a maximally
2011
          unbalanced tree in which bifurcations only occur in left
2012
          children, we will need 
2013
          B extra decks (all the right
2014
          children). If it is a maximally unbalanced tree in which
2015
          bifurcations only occur in right children, we will only
2016
          ever need 1 extra deck. A left-unbalanced binary tree can
2017
          be converted to a right-unbalanced binary tree just by
2018
          swapping branches. For a CM, we can't swap branches
2019
          without affecting the order of the sequence that's
2020
          generated. We can, however, get the same effect by
2021
          renumbering the CM states in a modified preorder
2022
          traversal. Instead of always visiting the left subtree
2023
          first, we visit the best subtree first, where "best"
2024
          means the choice that will optimize memory usage. This
2025
          reordering is readily calculated in 
2026
          O ( 
2027
          M ) time (not shown; see cm.
2028
          c:CMRebalance() in the implementation). This way, we can
2029
          never do worse than the balanced case, and we will often
2030
          do better. We never need more than log 
2031
          2 
2032
          B extra decks. Since 
2033
          B < 
2034
          M, we can guarantee a 
2035
          O ( 
2036
          N 2log 
2037
          M ) bound on memory complexity.
2038
        
2039
        
2040
          Time complexity analysis
2041
          The time complexity of the standard algorithm is 
2042
          O ( 
2043
          MN 2+ 
2044
          BN 3), for a model of 
2045
          M states ( 
2046
          B of which are bifurcations)
2047
          aligned to a sequence of 
2048
          N residues. Since 
2049
          B < 
2050
          M, and 
2051
          M ∝ 
2052
          N, we can also state the upper
2053
          bound as 
2054
          O ( 
2055
          MN 3) or 
2056
          O ( 
2057
          N 4).
2058
          The time complexity of the divide and conquer
2059
          algorithm depends on how close each split is to dividing
2060
          a problem into equal sized subproblems. In the most ideal
2061
          case, each call to GENERIC_SPLITTER could split into
2062
          three subproblems that each contained 1/3 of the states
2063
          and residues: splitting those three subproblems would
2064
          only cost:
2065
          3 × × 
2066
          MN 3
2067
          in time, e.g. only about 1/27 the time it took to
2068
          split the first problem. Thus in an ideal case the time
2069
          requirement is almost completely dominated by the first
2070
          split, and the extra time required to do the complete
2071
          divide and conquer algorithm could be negligible. In
2072
          pathological cases, optimal splits might lead to a series
2073
          of very unequally sized problems. We never need to do
2074
          more splits than there are states in the model, so we
2075
          cannot do worse than 
2076
          O ( 
2077
          M 2 
2078
          N 3) in time.
2079
          An example of a pathological case is an RNA structure
2080
          composed of a series of multifurcation loops such that
2081
          each bifurcation leads to a small stem on one side, and
2082
          the rest of the structure on the other. In such a case,
2083
          every call to GENERIC_SPLITTER will split into a small
2084
          subproblem containing a small stem (e.g. only removing a
2085
          constant number of states and residues per split) and a
2086
          large subproblem containing all the remaining states and
2087
          sequence. This case can be avoided. It only arises
2088
          because of the decision to implement a simplified
2089
          CYK/Outside algorithm and always split at the highest
2090
          bifurcations. Better time performance could be guaranteed
2091
          if a complete CYK/Outside algorithm were implemented (at
2092
          the cost of complexity in the description and
2093
          implementation of the algorithm). This would allow us to
2094
          choose a split point in a generic problem at any state in
2095
          the CM (for instance, in the middle) regardless of its
2096
          bifurcating structure.
2097
          In practice, empirical results on a variety of real
2098
          RNAs (see below) indicate that the extra time required to
2099
          do the divide and conquer is a small constant factor. A
2100
          more complex implementation does not seem to be
2101
          necessary.
2102
        
2103
        
2104
          Empirical results
2105
          Six structural RNA families were chosen for empirical
2106
          evaluations of the algorithm, using available RNA
2107
          database resources - tRNA [ 49 ] , 5S ribosomal RNA [ 50
2108
          ] , signal recognition particle (SRP) RNA [ 51 ] , RNase
2109
          P [ 52 ] , small subunit (SSU) ribosomal RNA [ 53 ] , and
2110
          large subunit (LSU) ribosomal RNA [ 54 ] . For each
2111
          family, a secondary structure annotated multiple
2112
          alignment of four or five example sequences was
2113
          extracted, and used to construct a CM.
2114
          The top of Table 4shows some statistics about these
2115
          alignments and CMs. The number of consensus columns in
2116
          the alignments ranges from 72 (tRNA) to 2898 (LSU rRNA);
2117
          about 55-60% are involved in consensus base pairs. The
2118
          number of CM states is consistently about 3-fold more
2119
          than the consensus alignment length, ranging from 230
2120
          states (tRNA) to 9023 (LSU). About 1/150 of the states in
2121
          each model are bifurcations. After optimal reordering of
2122
          the model states, the number of extra decks required by
2123
          the alignment algorithm is small, ranging up to 3 for SSU
2124
          and 5 for LSU rRNA. Therefore the minimum constant of 10
2125
          decks required in iterations across unbifurcated model
2126
          segments dominates the memory requirement. The memory
2127
          required for extra decks does not have much impact even
2128
          for the largest structural RNAs. (Even without optimal
2129
          reordering, the number of extra decks required for SSU
2130
          and LSU are only 7 and 9, respectively. State reordering
2131
          was only needed to assure a 
2132
          O ( 
2133
          N 2log 
2134
          M ) memory complexity bound.)
2135
          To determine the memory and CPU time requirements for
2136
          a structural alignment, one example sequence from each
2137
          family was aligned to the CM. CPU time was measured and
2138
          memory requirements were calculated for three algorithms:
2139
          (1) the full CYK/inside algorithm, but in memory-saving
2140
          score-only mode (e.g. INSIDE(1,M; 1,L); (2) the full
2141
          CYK/inside algorithm, with shadow matrix and traceback to
2142
          recover the optimal alignment (e.g. INSIDE τ(1,M; 1,L));
2143
          and (3) the divide and conquer algorithm to recover an
2144
          optimal alignment (e.g. GENERIC_SPLITTER (1,M; 1,L)). The
2145
          most important comparison is between the full CYK/inside
2146
          algorithm and the divide and conquer algorithm. The
2147
          score-only CYK/inside algorithm was included, because a
2148
          complete CYK alignment couldn't be done on SSU and LSU
2149
          rRNA (because of the steep memory requirement). In all
2150
          cases where comparison could be done, the scores and
2151
          alignments produced by these algorithms were verified to
2152
          be identical.
2153
          The results of these tests are shown in the bottom
2154
          half of Table 1. The memory required by divide and
2155
          conquer alignment ranges up to 271 MB for LSU rRNA,
2156
          compared to a prohibitive 151 GB for the standard CYK
2157
          algorithm. The extra CPU time required by the divide and
2158
          conquer is small; usually about 20% more, with a maximum
2159
          of about two-fold more for SRP-RNA.
2160
          The same results are plotted in Figure 6. Memory
2161
          requirements scale as expected: 
2162
          N 3for standard CYK alignment, and
2163
          better than 
2164
          N 2log 
2165
          N for the divide and conquer
2166
          algorithm. Empirical CPU time requirements scale
2167
          similarly for the two algorithms ( 
2168
          N 3.24- 
2169
          N 3.29). The observed performance
2170
          is better than the theoretical worst case of 
2171
          O ( 
2172
          N 4). The proportion of extra time
2173
          required by divide and conquer is roughly constant over a
2174
          wide range of RNAs. The difference shown in Figure 6is
2175
          exaggerated because times are plotted for score-only CYK,
2176
          not complete CYK alignment, in order to include CPU times
2177
          for SSU and LSU rRNA. Because score-only CYK does not
2178
          keep a shadow traceback matrix nor perform the traceback,
2179
          it is about 20% faster than CYK alignment, as seen in the
2180
          data in Table 4.
2181
        
2182
      
2183
      
2184
        Conclusions
2185
        The divide and conquer algorithm described here makes it
2186
        possible to align even the largest structural RNAs to
2187
        secondary structure consensus models, without exceeding the
2188
        available memory on current computational hardware. Optimal
2189
        SSU and LSU rRNA structural alignments can be performed in
2190
        70 MB and 270 MB of memory, respectively. Previous
2191
        structural alignment algorithms had to sacrifice
2192
        mathematical optimality to achieve ribosomal RNA
2193
        alignments.
2194
        The CPU time requirement of the alignment algorithm is
2195
        still significant, and even prohibitive for certain
2196
        important applications. However, CPU time is generally an
2197
        easier issue to deal with. A variety of simple
2198
        parallelization strategies are possible. Banded dynamic
2199
        programming algorithms (e.g. calculating only relevant
2200
        parts of the matrix) of various forms can also be explored,
2201
        including not only heuristic schemes, but also optimal
2202
        algorithms based on branch and bound ideas. (Properly
2203
        implemented, banded DP algorithms would also save
2204
        additional memory.)
2205
        The algorithm takes advantage of the structure of
2206
        covariance models (profile SCFGs), splitting in the
2207
        dimension of the states of the model rather than in the
2208
        sequence dimensions. The approach does not readily apply,
2209
        therefore, to unprofiled SCFG applications in RNA secondary
2210
        structure prediction [ 34 36 55 56 ] , where the states are
2211
        fewer and more fully interconnected. For these
2212
        applications, it would seem to be necessary to divide and
2213
        conquer in the sequence dimensions to obtain any
2214
        significant improvement in memory requirements, and it is
2215
        not immediately apparent how one would do this.
2216
        The current implementation of the algorithm is not
2217
        biologically useful. It is meant only as a testbed for the
2218
        algorithm. It outputs a raw traceback structure and
2219
        alignment score, not a standardly formatted alignment file.
2220
        Most importantly, the probability parameters for models are
2221
        calculated in a very quick and simple minded fashion, and
2222
        are far from being reasonable for producing robustly
2223
        accurate structural alignments. The next step along this
2224
        line is to produce good prior distributions for estimating
2225
        better parameters, by estimating mixture Dirichlet priors
2226
        from known RNA structures [ 57 ] . At this stage it would
2227
        not be meaningful to compare the biological alignment
2228
        accuracy of this implementation to (for instance) the
2229
        excellent performance of the RAGA genetic algorithm [ 27 ]
2230
        . A biologically useful implementation with accurate
2231
        alignment performance is of course the eventual goal of
2232
        this line of work, but is not the point of the present
2233
        paper.
2234
      
2235
    
2236
  
2237

2238
Product

Resources

Company