CoCalc -- 1471-2105-2-8.txt

OANC_GrAF / data / written_2 / technical / biomed / 1471-2105-2-8.txt
³⁹⁶⁷³ views
1

2
  
3
    
4
      
5
        Introduction
6
        Some genes produce functional noncoding RNAs (ncRNAs)
7
        instead of coding for proteins [ 1 2 ] . For protein-coding
8
        genes, we have computational genefinding tools [ 3 ] that
9
        predict novel genes in genome sequence data with reasonable
10
        efficiency [ 4 ] . For ncRNA genes, there are as yet no
11
        general genefinding algorithms. The number and diversity of
12
        ncRNA genes remains poorly understood, despite the
13
        availability of many complete genome sequences. Gene
14
        discovery methods (whether experimental or computational)
15
        typically assume that the target is a protein coding gene
16
        that produces a messenger RNA.
17
        New noncoding RNA genes continue to be discovered by
18
        less systematic means, which makes it seem likely that a
19
        systematic RNA genefinding algorithm would be of use.
20
        Recent discoveries have included RNAs involved in dosage
21
        compensation and imprinting [ 5 ] , numerous small
22
        nucleolar RNAs involved in RNA modification and processing
23
        [ 6 7 8 ] , and small riboregulatory RNAs controlling
24
        translation and/or stability of target mRNAs [ 9 10 ] .
25
        Mutations in the gene for RNase MRP are associated with
26
        cartilage-hair hypoplasia (CHH), a recessive pleiotropic
27
        human genetic disorder [ 11 ] . The CHH locus eluded
28
        positional cloning for some time; the RNase MRP gene was
29
        only detected in the completely sequenced CHH critical
30
        region because the RNase MRP sequence was already in the
31
        databases.
32
        We have previously explored one RNA genefinding approach
33
        with very limited success [ 12 ] . Maizel and coworkers [
34
        13 14 15 ] had hypothesized that biologically functional
35
        RNA structures may have more stable predicted secondary
36
        structures than would be expected for a random sequence of
37
        the same base composition. Though we could confirm some
38
        anecdotal results where this was true, we were forced to
39
        the conclusion that in general, the predicted stability of
40
        structural RNAs is not sufficiently distinguishable from
41
        the predicted stability of random sequences to use as the
42
        basis for a reliable ncRNA genefinding algorithm.
43
        Nonetheless, conserved RNA secondary structure remained our
44
        best hope for an exploitable statistical signal in ncRNA
45
        genes. We decided to consider ways of incorporating
46
        additional statistical signal using comparative sequence
47
        analysis.
48
        We were motivated by the work of Badger & Olsen [ 16
49
        ] for bacterial coding-region identification. Badger &
50
        Olsen use the BLASTN program [ 17 ] to locate genomic
51
        regions with significant sequence similarity between two
52
        related bacterial species. Their program, CRITICA, then
53
        analyzes the pattern of mutation in these ungapped, aligned
54
        conserved regions for evidence of coding structure. For
55
        example, mutations to synonymous codons get positive
56
        scores, while aligned triplets that translate to dissimilar
57
        amino acids get negative scores. (CRITICA then subsequently
58
        extends any coding-assigned ungapped seed alignments into
59
        complete open reading frames.)
60
        Here we extend the central idea of the Badger &
61
        Olsen approach to identify structural RNA regions. Our
62
        extensions include: (1) using fully probabilistic models;
63
        (2) adding a third model of pairwise alignments constrained
64
        by structural RNA evolution; (3) allowing gapped
65
        alignments; and (4) allowing for the possibility that only
66
        part of the pairwise alignment may represent a coding
67
        region or structural RNA, because a primary sequence
68
        alignment may extend into flanking noncoding or
69
        nonstructural conserved sequence. These extensions add
70
        complexity to the approach. We use probabilistic modeling
71
        methods and formal languages to guide our construction. We
72
        use "pair hidden Markov models" (pair-HMMs) (introduced in
73
        [ 18 ] ) and a "pair stochastic context free grammar"
74
        (pair-SCFG) (a natural extension of the pair-HMM idea to
75
        RNA structure) to produce three evolutionary models for
76
        "coding", "structural RNA", or "something else" (a null
77
        hypothesis). Given three probabilistic models and a
78
        pairwise sequence alignment to be tested, we can calculate
79
        the Bayesian posterior probability that an alignment should
80
        be classified as "coding", "structural RNA", or "something
81
        else".
82
        Our approach is designed to detect conserved 
83
        structural RNAs. Some ncRNA genes do
84
        not have well-conserved intramolecular secondary
85
        structures, and some conserved RNA secondary structures
86
        function as cis-regulatory regions in mRNAs rather than as
87
        independent RNA genes. We will be using the term "ncRNA
88
        gene" to refer to our prediction targets, but it must be
89
        understood that this really means a conserved RNA secondary
90
        structure that may or may not turn out to be an independent
91
        functional ncRNA gene upon further analysis.
92
      
93
      
94
        Algorithm
95
        
96
          Overview of the approach: simple, ungapped global
97
          case
98
          The key idea is to produce three probabilistic models
99
          (RNA, COD, and OTH) describing different evolutionary
100
          constraints on the pattern of mutations observed in a
101
          pairwise sequence alignment. We will first introduce toy
102
          versions of these models, for clarity.
103
          All three models use the "pair-grammar" formalism
104
          described in [ 18 ] . A standard hidden Markov model
105
          (HMM) generates a single observable sequence by emitting
106
          single residues, whereas a pair-HMM generates two aligned
107
          sequences 
108
          X, Y by emitting a pair of aligned
109
          residues at a time (or single residues in either sequence
110
          to deal with insertion and deletion).
111
          The OTH model assumes mutations occur in a simple
112
          position-independent fashion. OTH has 4 × 4 core
113
          parameters, which are the pairwise alignment
114
          probabilities 
115
          P OTH( 
116
          a, b ) - that is, the joint
117
          probabilities of emitting an alignment of a nucleotide 
118
          a in sequence 
119
          X and a nucleotide 
120
          b in sequence 
121
          Y (Table 1). OTH represents our
122
          null hypothesis. The probability of the alignment given
123
          the OTH model is just the product of the probabilities of
124
          the individual aligned positions.
125
          The COD model assumes that the aligned sequences
126
          encode homologous proteins. In a coding region, we
127
          intuitively expect to see mutations that make
128
          conservative amino acid substitutions; in particular, we
129
          expect an abundance of synonymous mutations. To capture
130
          this, COD has 64 × 64 core parameters, which are 
131
          P COD( 
132
          a 
133
          1 
134
          a 
135
          2 
136
          a 
137
          3 , 
138
          b 
139
          1 
140
          b 
141
          2 
142
          b 
143
          3 ), the probabilities of the
144
          correlated emission of two codons - that is, three
145
          nucleotides 
146
          a 
147
          1 
148
          a 
149
          2 
150
          a 
151
          3 in sequence 
152
          X , aligned to three nucleotides 
153
          b 
154
          1 
155
          b 
156
          2 
157
          b 
158
          3 in sequence 
159
          Y . (See Table 1for an example of
160
          pair codon probabilities.) The probability of the
161
          alignment given the COD model for a particular reading
162
          frame is the product of the probabilities of the
163
          individual aligned codons in that frame. Since we don't
164
          know the correct frame 
165
          a priori , the overall probability
166
          of an alignment is a sum over all six frames 
167
          f ,
168
          
169
          and we assume that all frames are a priori
170
          equiprobable in the alignment ( 
171
          P ( 
172
          f |COD) = ).
173
          The RNA model assumes that the pattern of mutation
174
          significantly conserves a homologous RNA secondary
175
          structure. Intuitively, we expect a significant abundance
176
          of pairwise-correlated mutations that preserve
177
          Watson-Crick complementarity in an (as yet unknown)
178
          structure. To capture this, the core parameters in RNA
179
          are the 16 × 16 probabilities 
180
          P RNA( 
181
          a 
182
          
183
            L 
184
           
185
          a 
186
          
187
            R 
188
           , 
189
          b 
190
          
191
            L 
192
           
193
          b 
194
          
195
            R 
196
           ) - that is, the probabilities associated with the
197
          correlated emission of one base-pair ( 
198
          a 
199
          
200
            L 
201
           
202
          a 
203
          
204
            R 
205
           ) in sequence 
206
          X aligned to a homologous base-pair
207
          ( 
208
          b 
209
          
210
            L 
211
           
212
          b 
213
          
214
            R 
215
           ) in sequence 
216
          Y (Table 1). Single stranded
217
          positions in the alignment are modeled by 
218
          P RNA( 
219
          a, b ), the same functional form as
220
          in the OTH model. For a given alignment of known
221
          structure 
222
          s , the probability 
223
          P ( | 
224
          s , RNA) is a product of terms 
225
          P RNA( 
226
          x 
227
          
228
            i 
229
           
230
          x 
231
          
232
            j 
233
           , 
234
          y 
235
          
236
            i 
237
           
238
          y 
239
          
240
            j 
241
           ) for all base paired positions 
242
          i, j and 
243
          P RNA( 
244
          x 
245
          
246
            k 
247
           , 
248
          y 
249
          
250
            k 
251
           ) for all single stranded positions 
252
          k in the alignment. Since we don't
253
          know the correct structure 
254
          a priori , the overall probability
255
          of an alignment given by the RNA model is a sum over all
256
          structures 
257
          s :
258
          
259
          But here, we cannot assume equiprobability for the
260
          various structures 
261
          s as we did for coding frames 
262
          f above; in fact, calculating 
263
          P ( 
264
          s |RNA) implies a full
265
          probabilistic model describing favorable and unfavorable
266
          RNA secondary structures. The necessary machinery for
267
          calculating this weighted sum is exactly what we
268
          developed previously for searching for significant
269
          structure in single sequences [ 12 ] . In that paper we
270
          parameterized a stochastic context-free grammar (SCFG)
271
          that incorporates a model of hairpin loops, stems,
272
          bulges, and internal loops, including stacking and
273
          loop-length distributions, making a probabilistic
274
          counterpart for the widely used MFOLD program for RNA
275
          structure prediction. The SCFG we use here is almost the
276
          same, with the difference that now we generate two
277
          aligned sequences simultaneously: i.e., a pair-SCFG. The
278
          summation over all possible structures can be done
279
          efficiently using an SCFG Inside algorithm (a dynamic
280
          programming algorithm).
281
          In Figure 1we present an example of three different
282
          alignments with different mutation patterns, and how they
283
          would be scored with the three different models.
284
          Finally, in order to classify the input alignment as
285
          RNA, COD, or OTH, we use the three likelihoods to
286
          calculate a Bayesian posterior probability, under the
287
          simple assumption that the three models are 
288
          a priori equiprobable. Alignments
289
          with high RNA posterior probabilities are interpreted as
290
          candidate ncRNA genes.
291
          For scoring purposes, it will also be useful to
292
          calculate log-odds scores in the standard manner [ 19 ]
293
          relative to a fourth model, which we will call IID. In
294
          IID, we assume the two sequences are nonhomologous
295
          independent, identically distributed sequences. The IID
296
          model has 8 parameters corresponding to the expected base
297
          compositions ofthe two sequences, 
298
          P 
299
          X ( 
300
          a ) and 
301
          P 
302
          Y ( 
303
          b ).
304
        
305
        
306
          Parameter estimation in the simple case
307
          Parameter estimation is crucial for our approach. The
308
          three models have to be calibrated to an overall similar
309
          evolutionary divergence time, and to similar base
310
          compositions. Else, one model might be artifactually
311
          favored over the others because of the degree of
312
          conservation or the base composition in an input
313
          alignment, not because of the pattern of mutation.
314
          In an ideal world, we could empirically estimate the
315
          parameters of each model using training sets of pairwise
316
          alignments culled from real RNAs, coding regions, and
317
          other conserved noncoding regions, using pairwise
318
          alignments that were all about the same percent identity.
319
          Unfortunately it is unlikely that we can amass suitably
320
          large training sets. Instead, we take a somewhat 
321
          ad hoc approach that ties the
322
          parameters of all three models as much as possible to a
323
          particular choice of a standard amino acid substitution
324
          matrix, such as BLOSUM62. We will derive joint codon
325
          probabilities from the chosen scoring matrix, then use
326
          these codon probabilities to calculate the average single
327
          nucleotide substitution probabilities in OTH, then
328
          combine these OTH substitution parameters with base-pair
329
          frequencies to obtain the parameters of the RNA model.
330
          This procedure is as follows.
331
          First the 64 × 64 codon emission probabilities 
332
          P COD( 
333
          a 
334
          1 
335
          a 
336
          2 
337
          a 
338
          3, 
339
          b 
340
          1 
341
          b 
342
          2 
343
          b 
344
          3 | 
345
          t ), for some divergence "time" 
346
          t , are derived from the chosen
347
          substitution matrix (i.e. the choice of matrix defines 
348
          t ). We make an independence
349
          assumption that the conditional probability of each codon
350
          depends only on its own encoded amino acid - i.e., it
351
          does not depend on the the other codon - so we can use
352
          the approximation
353
          
354
          for 
355
          a 
356
          1 , 
357
          a 
358
          2 , 
359
          a 
360
          3 , 
361
          b 
362
          1 , 
363
          b 
364
          2 , 
365
          b 
366
          3 ∈ { 
367
          A, C, G, U } and 
368
          A, B ∈ {amino acids}. (An example
369
          of where this independence assumption is violated: for
370
          equiprobable codon bias, our parameters will say that
371
          aligning TCA to AGT is as likely as aligning TCA to TCG
372
          because all three are Ser codons, despite the fact that
373
          the first case requires three transversions.) 
374
          P ( 
375
          A , 
376
          B | 
377
          t ) are the joint target
378
          probabilities of aligned amino acids obtained from the
379
          amino acid score matrix, such as BLOSUM62 [ 20 ] , as
380
          described by [ 19 ] . The terms 
381
          P ( 
382
          a 
383
          1 
384
          a 
385
          2 
386
          a 
387
          3 | 
388
          A ) are the probabilities of
389
          observing a particular codon given a particular amino
390
          acid; these terms can include a codon-bias model [ 21 ]
391
          and, if desired, a substitution error model to deal with
392
          error-prone sequence data. The sum over all possible
393
          amino acids in equation (3) is relevant only when a
394
          substitution error model applies, since otherwise each
395
          observed codon can only mean one possible amino acid.
396
          The 16 mutation probabilities for the OTH model are
397
          then obtained by marginalizing the corresponding
398
          codon-codon emission probabilities in equation (3), in
399
          the following way:
400
          
401
          The 16 × 16 core parameters of the RNA model are
402
          calculated by combining the OTH model (which sets the
403
          average divergence of the two sequences) with some
404
          additional parameters that specify the probability of
405
          base pairs. This involves making an independence
406
          assumption:
407
          
408
          Alternatively, we can symmetrically derive a equation
409
          in which the divergence is controlled by the mutation
410
          probability of the right position instead of the left
411
          position. We calculate the overall joint probability of
412
          the aligned base pairs as the average of these two
413
          equations:
414
          
415
          Here 
416
          P pair( 
417
          a 
418
          
419
            L 
420
           
421
          a 
422
          
423
            R 
424
           | 
425
          t ), 
426
          P pair( 
427
          b 
428
          
429
            L 
430
           
431
          b 
432
          
433
            R 
434
           | 
435
          t ) are just the probabilities of
436
          the various sorts of base pairs (GC, AU, GU) in a single
437
          RNA structure.
438
        
439
        
440
          Extension of the models to gapped local
441
          alignments
442
          
443
            The OTH Model
444
            The complete OTH model, a pair hidden Markov model,
445
            is diagrammed in Figure 5. The flanking double-circled
446
            states 
447
            F 
448
            
449
              L 
450
             , 
451
            F 
452
            
453
              R 
454
             , and 
455
            F 
456
            
457
              J 
458
             are a shorthand for a full IID model of the type
459
            in Figure 4, which allow the alignment to be flanked or
460
            interrupted by runs of unassigned (independent)
461
            residues. (In general we will use the convention that
462
            single-circled states are "single states", and
463
            double-circled states represent some "composite state"
464
            of some kind previously defined. This differs from a
465
            convention in formal languages in which double-circled
466
            states are terminal states of a finite-state automaton
467
            [ 22 ] .)
468
            The OTH model requires us to specify emission
469
            probabilities for the state XY (that emits two aligned
470
            nucleotides), and also for the X and Y states (that
471
            emit one nucleotide "aligned" to a gap character in the
472
            other sequence). The emission probabilities for state
473
            XY, 
474
            P 
475
            XY ( 
476
            a , 
477
            b | 
478
            t ), are simply the mutation
479
            probabilities 
480
            P OTH( 
481
            a , 
482
            b | 
483
            t ) of the toy ungapped OTH
484
            model, as described above. The emission probabilities
485
            for states X and Y are obtained by marginalization of
486
            the 
487
            P 
488
            XY 's:
489
            
490
          
491
          
492
            The COD Model
493
            The complete COD model, a pair hidden Markov model,
494
            is diagrammed in Figure 6. A new degree of "locality"
495
            is introduced. In addition to regions of the alignment
496
            that are better left "unaligned" (i.e. generated by the
497
            flanking states of an IID model), we want to model
498
            regions of the alignment that are not coding but still
499
            well-conserved. To model this, we add three full copies
500
            of OTH models to the core of the COD model, indicated
501
            by the symbols 
502
            O 
503
            
504
              B 
505
             , 
506
            O 
507
            
508
              E 
509
             , and 
510
            O 
511
            
512
              J 
513
             . We represent a full OTH model with:
514
            
515
            with the understanding that any arrow that goes into
516
            " 
517
            O " indicates a transition into
518
            the " 
519
            S 
520
            
521
              FL 
522
             " state of the 
523
            F 
524
            
525
              L 
526
             flanking model, and any arrow leaving " 
527
            O " emerges from the " 
528
            T 
529
            
530
              FR 
531
             " state of the 
532
            F 
533
            
534
              R 
535
             flanking model. In this way the COD model can
536
            score a coding-aligned region that is nested between
537
            other independently-aligned regions.
538
            The different COD states described in Figure 6emit
539
            correlated codon pairs, possibly with indels. To deal
540
            with BLASTN misalignments of codons and possible
541
            applications to error-prone sequence data (expressed
542
            sequence tags or low-pass genome shotgun), we model -1
543
            or +1 frameshifts (by having a probability of emitting
544
            abnormal codons of 2 or 4 nucleotides), in addition to
545
            the more expected indels of multiples of three
546
            nucleotides. (No explicit transition for 
547
            C 
548
            
549
              E 
550
             → 
551
            C 
552
            
553
              B 
554
             is necessary; the intermediate sub-model " 
555
            O 
556
            
557
              J 
558
             " has a non-emitting path that deals with
559
            consecutive codons.)
560
            Codon emission probabilities for the different
561
            coding states are derived from the joint codon
562
            probabilities 
563
            P CODgiven in equation (3) for
564
            the toy case. For incomplete codons we do the
565
            convenient marginalizations. For example,
566
            
567
            Notice that there are three different 
568
            C (3, 2) states, of which we have
569
            only described one in equation (9). Similarly there are
570
            four different 
571
            C (3, 4) states, and six
572
            different 
573
            C (2, 4) states, depending on the
574
            position of the gaps. We will represent these
575
            codon-emission probabilities in general by 
576
            P α, β( 
577
            a 
578
            1 ... 
579
            a 
580
            α , 
581
            b 
582
            1 ... 
583
            b 
584
            β ) with α, β = {0, 2, 3, 4} and a,
585
            b ∈ {A, C, G, U}.
586
          
587
          
588
            The RNA Model
589
            The complete RNA model, a pair stochastic context
590
            free grammar (pair-SCFG), is crudely diagrammed in
591
            Figure 7. The crucial SCFG machinery of the model is
592
            encapsulated in the RNA state of the diagram. This
593
            pair-SCFG, similar to the SCFG described in [ 12 ] ,
594
            has three states labeled 
595
            W , 
596
            W 
597
            
598
              B 
599
             , 
600
            V . They correspond to the 
601
            V and 
602
            W dynamic programming matrices in
603
            [ 23 ] , and to the matrices 
604
            wx , 
605
            wbx and 
606
            vx of the diagrammatic
607
            representation in [ 24 25 ] . We use the diagrams as a
608
            convenient visual representation to enumerate the
609
            configurations which we take into account in the model.
610
            State 
611
            V represents a substring
612
            (sequence fragment) in which the ends are definitely
613
            base-paired. States 
614
            W and 
615
            W 
616
            
617
              B 
618
             represent a substring in which the ends are either
619
            paired or unpaired.
620
            To extend these more or less standard RNA folding
621
            algorithm conventions from a single sequence to an
622
            aligned pair of sequences, let us introduce some
623
            vectorial notation. In this notation stands for the
624
            corresponding positions 
625
            i in sequence 
626
            X and 
627
            i' in sequence 
628
            Y . Similarly stands for the pair
629
            of nucleotides in positions 
630
            i and 
631
            i' of sequences 
632
            X and 
633
            Y respectively. With this
634
            notation, we also define and . We are going to assume
635
            that for two aligned columns and , 
636
            x 
637
            
638
              i 
639
             is base-paired to 
640
            x 
641
            
642
              j 
643
             if and only if 
644
            x 
645
            
646
            i ' is base-paired to 
647
            y 
648
            j' which is a reasonable assumption
649
            if we are trying to find commonly occurring secondary
650
            structures within an alignment of two sequences.
651
            
652
            W acts as the starting state. 
653
            W and 
654
            W 
655
            
656
              B 
657
             are essentially equivalent, but 
658
            W 
659
            
660
              B 
661
             is used exclusively for starting multiloops. The
662
            production rules for 
663
            W are (for 
664
            W 
665
            
666
              B 
667
             , replace 
668
            W by 
669
            W 
670
            
671
              B 
672
             everywhere in the recursion),
673
            
674
            The vector provides us with a compact notation to
675
            represent the three possible situations in which one
676
            nucleotide is emitted in at least one of the two
677
            sequences in the alignment. The components 
678
            e 
679
            x and 
680
            e 
681
            y take values 1 or 0 with at
682
            least one of the two being different from zero. If 
683
            e 
684
            x = 0 or 
685
            e 
686
            y = 0 we place a gap in the
687
            corresponding position in the alignment.
688
            The symbol 
689
            V represents the paired state,
690
            that is, the state we are in after emitting one pair in
691
            each sequence. The recursion for state 
692
            V is,
693
            
694
            Here the first transition corresponds to hairpin
695
            loops, and is equivalent to function 
696
            FH ( 
697
            i, j ) in [ 23 ] ; the second
698
            transition corresponds to stems, bulges, and internal
699
            loops, and is equivalent to function 
700
            FL ( 
701
            i, j, k, l ) in [ 23 ] ; the last
702
            transition corresponds to multiloops, that is, loops
703
            closed by more than two hydrogen bonds. The length of
704
            the alignments generated for those hairpin loops and
705
            bulges and internal loops is variable and depends on
706
            the number of gaps introduced. The only condition is
707
            that all nucleotides in that segment have to be used -
708
            for instance, and for the hairpin loops.
709
            The full description of the algorithms associated to
710
            this grammar is given in the Additional file. The
711
            algorithms requires two kind of emission
712
            probabilities,
713
            • , the concurrent emission of two paired nucleotide
714
            in both sequences, already introduced in equation
715
            (6).
716
            • , the concurrent emission of one unpaired
717
            nucleotide in both sequences, which are taken as the
718
            mutation probabilities in equation (4).
719
            Both types of emission probabilities have been
720
            extended to also emit gaps. For any position , we also
721
            introduce a penalty for "mutating" to a gap, and
722
            another one for "pairing" to a gap. This is a linear
723
            gap cost, and is more convenient than implementing the
724
            additional extra states that an affine gap cost would
725
            require.
726
            The vectorial notation becomes particularly
727
            important if we realigned the input sequences to the
728
            RNA model. In this paper, though, we will only be
729
            working with a special case where we hold an input
730
            (BLASTN) pairwise alignment fixed and simply score it
731
            with the RNA model; in this case, for any given vector
732
            .
733
          
734
        
735
        
736
          Transition Probabilities
737
          In all three models, one of the prices we pay for
738
          introducing realistic flexibility is that we have
739
          introduced a number of transition probability parameters,
740
          in addition to the emission probabilities presented in
741
          the ungapped case (Section 2.1). Now we have to determine
742
          the transition probabilities of the different models.
743
          Again, we want the models tuned to the same level of
744
          "gappiness", else alignments may be artifactually
745
          classified based on how gappy they are. Whereas we were
746
          able to construct divergence-matched emission
747
          probabilities for the three models in a somewhat
748
          justified fashion, we have no guiding theory for
749
          constructing divergence-matched transition
750
          probabilities.
751
          Instead, we have estimated all new transition
752
          probabilities by hand. The number of additional
753
          parameters in the most complete models is 8 for the OTH
754
          model, and 20 for the COD model and RNA models. These
755
          parameters have been optimized by studying the
756
          algorithm's discrimination ability on model generated
757
          data and random sequence alignments. More details on the
758
          type of simulated data used to set the transition
759
          probabilities of the models is given in Section 3.1. This
760
          approach to estimating the transition parameters of the
761
          models is very arbitrary, but we do not currently see a
762
          plausible alternative.
763
          The RNA model also has additional SCFG-related
764
          probability parameters to take into account length
765
          distributions of hairpin loops, bulges, and internal
766
          loops. Those parameters have been determined from a
767
          training set of aligned tRNAs and ribosomal RNAs as
768
          described previously [ 12 26 27 ] .
769
        
770
        
771
          Alignment and scoring algorithms
772
          We are given a pairwise sequence alignment , composed
773
          of 
774
          L aligned columns. We will hold the
775
          input alignment fixed. Thus, globally aligning and
776
          scoring the alignment with each of the three models could
777
          be done by straightforward extensions of standard HMM
778
          Viterbi and/or Forward algorithms and SCFG CYK and/or
779
          Inside algorithms. The OTH and COD pair-HMM alignment
780
          algorithm would cost O( 
781
          L ) in storage and time; the RNA
782
          pair-SCFG algorithm would cost O( 
783
          L 2) in storage and O( 
784
          L 3) in time [ 12 24 ] .
785
          We are interested, however, in a combination of the
786
          standard algorithms. Consider the RNA model. Recall that
787
          we need to obtain 
788
          P ( |RNA) by a summation over all
789
          possible structures, which will require the Inside
790
          algorithm rather than the CYK algorithm (which, like
791
          Viterbi for HMMs, recovers the maximum likelihood parse,
792
          i.e. structure). But we will also be interested in
793
          obtaining a maximum likelihood location of a predicted
794
          RNA within the input alignment - that is, we would like
795
          to identify the maximum likelihood position of the
796
          starting and ending nucleotides that are aligned to
797
          states in the core of the RNA model, as opposed to
798
          flanking states that are accounting for flanking
799
          nonconserved (IID) and conserved but non-RNA (OTH)
800
          nucleotides in the input alignment. This would require
801
          the CYK (maximum likelihood parse) algorithm for the RNA
802
          model, outside of the core RNA pair-SCFG state.
803
          To combine the desired features of the two algorithms,
804
          we use a trick introduced by Stormo and Haussler [ 28 ] ,
805
          perhaps most widely known for its application in the
806
          "semi-Markov model" of the (protein) genefinding program
807
          GENSCAN [ 29 ] . The basic idea is that we start with a
808
          model organized into "meta-states" (such as the 
809
          O 
810
          
811
            B 
812
           , 
813
          O 
814
          
815
            E 
816
           , 
817
          O 
818
          
819
            J 
820
           , and RNA states of the RNA model in Figure 7). Each
821
          meta-state contains its own (possibly complex and
822
          arbitrary) model of a feature (such as the pair-SCFG
823
          represented by the RNA state). The meta-states are
824
          connected to each other by transition probabilities as in
825
          an HMM. To parse and score a sequence, "feature scores"
826
          are first precomputed for the score of all possible
827
          subsequences 
828
          i..j being generated by each
829
          meta-state; then a dynamic programming algorithm is used
830
          to assemble a maximum likelihood parse of the sequence
831
          into a series of component features. Thus, we can (for
832
          instance) use a pair-SCFG Inside algorithm to precompute
833
          scores 
834
          W 
835
          
836
            ij 
837
           for the core RNA metastate of the RNA model
838
          generating the part of the alignment from 
839
          i..j summed over all possible
840
          structures, then use the Stormo/Haussler parsing
841
          algorithm to determine the optimal 
842
          i..j segment that should be
843
          assigned as structural RNA, versus assigning flanking
844
          sequence to the 
845
          O 
846
          
847
            B 
848
           , 
849
          O 
850
          
851
            E 
852
           or 
853
          O 
854
          
855
            J 
856
           meta-states describing non-RNA conserved residues
857
          and nonconserved residues.
858
          Stormo/Haussler parsing algorithms add one order of
859
          complexity both in storage and time to the underlying
860
          dynamic programming problem to which they are applied.
861
          The Forward algorithm for scoring a pair-HMM against a 
862
          fixed pairwise alignment is 
863
          O ( 
864
          L ), but since HMM dynamic
865
          programming algorithms work by iteratively calculating
866
          scores of prefixes 1.. 
867
          j of increasing length, whereas we
868
          need scores of subsections 
869
          i..j , we have to run the algorithm
870
          
871
          L times, once from each possible
872
          start point 
873
          i , making the feature scoring
874
          phase 
875
          O ( 
876
          L 2) in storage and time for both
877
          the OTH and the COD pair-HMM. (The COD pair-HMM would be 
878
          O ( 
879
          L 3) in memory, but the actual
880
          implementation uses a simplified 
881
          O ( 
882
          L ) version for the OTH meta-states
883
          included in the COD model that keeps the whole COD
884
          parsing algorithm 
885
          O ( 
886
          L 2).) The Inside algorithm for
887
          scoring a pair-SCFG against a fixed pairwise alignment is
888
          
889
          O ( 
890
          L 2) space and 
891
          O ( 
892
          L 3) time, and conveniently yields
893
          the matrix of scores we need for all subsections 
894
          i..j . Therefore the computational
895
          complexity of our complete algorithm is dominated by the
896
          Inside algorithm for scoring the core RNA state of the
897
          RNA model. (See Additional filefor more details.)
898
          In principle, we could forget about the input pairwise
899
          alignment, and allow our three models to optimally
900
          realign the input sequences. This would be desirable; it
901
          is dangerous, for example, to rely on the external
902
          sequence alignment program (e.g. BLASTN) to produce a
903
          correct secondary structural alignment of two homologous
904
          RNAs, whereas the RNA pair-SCFG, which models
905
          base-pairing correlation, would potentially produce
906
          better structural alignments. However, such an algorithm
907
          would be expensive: for two input sequences of length 
908
          m and 
909
          n respectively, scoring the RNA
910
          pair-SCFG would cost 
911
          O ( 
912
          m 2 
913
          n 2) in storage and 
914
          O ( 
915
          m 3 
916
          n 3) in time. (See the Additional
917
          filefor a detailed description of all the different
918
          algorithms, and their complexity.) Since this realignment
919
          approach is prohibitive, we rely on an assumption that
920
          the external pairwise alignment algorithm will produce
921
          alignments that are close enough to being correct for
922
          coding regions and structural RNAs, even though the
923
          external alignment program has no notion of these
924
          constraints.
925
        
926
        
927
          Bayesian score evaluation
928
          Once we have calculated the probabilities that a
929
          pairwise alignment has been generated by any one of the
930
          three models, we can classify the alignment into one of
931
          three using a posterior probability calculation:
932
          
933
          where
934
          
935
          We assume a uniform distribution for the prior
936
          probabilities 
937
          P (Model 
938
          
939
            i 
940
           ).
941
          In some figures, we use a phase diagram representation
942
          of the same information in the three posterior
943
          probabilities. We plot log-odds scores of the COD and RNA
944
          models with respect to the OTH model in an ( 
945
          x, y ) plane:
946
          
947
          We can then separate the plane into three different
948
          regions "phases" dominated by any of the three models
949
          (for example, see Figure 8). Those three phases
950
          correspond to the conditions,
951
          
952
          Points deep in one of the phases represent a higher
953
          posterior probability for a particular model, whereas
954
          points falling next to phase-transition boundaries
955
          represent situations in which the method can not clearly
956
          decide for one model or the other.
957
        
958
        
959
          Implementation
960
          This approach was implemented in ANSI C in a program
961
          called QRNA. The source code and the full set of
962
          probability parameters used in QRNA are freely available
963
          from http://www.genetics.wustl.edu/eddy/software/under
964
          the terms of the GNU General Public License. QRNA has
965
          been tested on Intel/Linux and Silicon Graphics IRIX
966
          platforms.
967
          The input alignment is given in a modified (aligned)
968
          FASTA file format. For instance the following file
969
          contains the two homologous nematode sequences shown in
970
          the BLASTN alignment in Figure 3:
971
          >F08G2
972
          
973
          CATTTCATAGTGTCACACGCGCACCCATGAGTTGTCGGCACAC-CACTCCCCACTACCCC
974
          TACCCTCTCCTCCATTCAGTATCGCTTCTTCGGCTTATTAGCTAAGATCAAAGTGTAGTA
975
          TCTGTTCTTATCGTATTAACCTACGGTATACACTCGAATGAGTGTAATAAAGGTTATATG
976
          ATTTTTGGAACCTAGGGAAGACTCGGGGCTTGCTCCGACTTCCCAAGGGTCGTCCTGGCG
977
          TTGCACTGCTGCCGGGCTCGGCCCAGTCCCCGAGGGGACAA
978
          >G42J05
979
          
980
          CATTCCATAGTGGCCGACGCGAGCCCGGTTTTTGTCGGTACATGCGCGCACC-CTACCCC
981
          CCGCGCCTCGTTCTCACCGCATCGCTTCTTCGGCTTATTAGCTAAGATCAAAGTGTAGTA
982
          TCTGTTCTTATCGTATTAACCTACGGTATGCACTCGAATGAGTGTAATAAAGGTTATATG
983
          ATTTTTGGAACCTAGGAAAGACTCGGGGCTTGCTCCGACTTTCCAAGGGTCGTCCCGGCG
984
          TTGCACTGCTGCCGGGCTCGGCCCAGTCCCTGTGGGGACAA
985
          Note the gap characters preserving the pairwise
986
          alignment. (In many cases, there would be more gap
987
          characters than in this particular example.) Multiple
988
          pairs of sequences can be added to a single fasta file,
989
          and will be scored sequentially, one pair at a time.
990
          Typing the following command line:
991
          qrna fastafile
992
          we obtain the output in the following form:
993
          >F08G2 (281)
994
          >G42J05 (281)
995
          ... [some irrelevant output not shown]...
996
          winner = RNA
997
          OTH = 152.817 COD = 129.240 RNA = 182.522
998
          logoddspostOTH = 0.000 logoddspostCOD = -23.577
999
          logoddspostRNA = 29.705
1000
          The line winner = RNA indicates that the 281 nt
1001
          alignment has been classified as a structural RNA. The
1002
          next three numbers correspond to the P( |Model) in
1003
          log-odds scores. The two non-null numbers in the second
1004
          row ("logoddspostCOD"and "logoddspostRNA") correspond to
1005
          the 2-D phase diagram scores described previously. For
1006
          this alignment, the RNA model is favored over COD and OTH
1007
          by 29.7 bits.
1008
          A scanning version of the algorithms is also
1009
          implemented. In this scanning mode a partial segment of
1010
          the alignment - a window of user-determined fixed length
1011
          - is scored. The window slides across the alignment and
1012
          each window is scored independently from the others. This
1013
          option is useful when the input alignment is long, or one
1014
          expects different types of functionalities within a given
1015
          alignment. This is the mode of the program that we use
1016
          for whole genome analysis.
1017
          Scoring a window of 200 nts takes about 14 CPU-seconds
1018
          and 8 MB of memory on 225 Mhz MIPS R10K processor of a
1019
          Silicon Graphics Origin2000. Scoring an alignment of 2
1020
          Kbases in windows of 200 nts and moving 50 nts at a time
1021
          takes about 9 minutes. Scoring the alignments generated
1022
          between the intergenic regions of 
1023
          E. coli and 
1024
          S. typhi (12, 000 alignments with
1025
          average length of about 100 nt) took about 9
1026
          CPU-hours.
1027
          Additional file
1028
          • Description of data: In this additional file we
1029
          provide a detailed description of the algorithms involved
1030
          in implementing the three probabilistic models components
1031
          of our comparative method QRNA. We give the most general
1032
          description of the scoring/parsing algorithms. We also
1033
          indicate how to obtain some simplifications that are part
1034
          of the software implementation.
1035
          Click here for file
1036
        
1037
      
1038
      
1039
        Results
1040
        
1041
          Tests on simulated data
1042
          Because all three models are fully probabilistic, we
1043
          can use them in a generative mode to sample synthetic
1044
          pairwise alignments. These simulations allow us to assess
1045
          the sensitivity and specificity of the approach on
1046
          idealized data, to get a sense of the best that the
1047
          algorithm can do. We generated 1, 000 pairwise alignments
1048
          of 200 nucleotides in length from each of the three
1049
          models. Each of these 3, 000 alignments was then scored
1050
          and classified by the program. Results are shown in
1051
          Figure 8, showing that simulated alignments are almost
1052
          always classified correctly.
1053
          We wanted to test that the classification is based on
1054
          the pattern of mutation in the alignments, not on a
1055
          spurious artifact of differing base composition, sequence
1056
          identity, or gap frequency. To do this, we randomly
1057
          shuffled each alignment by columns - preserving the
1058
          sequence identity in the alignment, while destroying any
1059
          correlations in the pattern of mutation. Figure 8shows
1060
          that shuffled alignments are classified in the OTH phase,
1061
          as expected.
1062
          These simulation experiments were iterated during the
1063
          development of the approach. They were important guide in
1064
          setting the 
1065
          ad hoc transition probabilities in
1066
          each model.
1067
          We used these simulation results from RNA-generated
1068
          and shuffled data to set crude but reasonable score
1069
          thresholds for classification of alignments as RNA. A
1070
          threshold of 1.4 bits for the RNA posterior log-odds
1071
          scores would determine a minimum error rate area with a
1072
          frequency of 0.023 false positives and 0.081 false
1073
          negatives. In whole-genome scans, we want to push the
1074
          rate of false positives down, even at the expense of
1075
          increasing the number of false negatives. To reduce the
1076
          false positive frequency to 0.005, we would need a cutoff
1077
          of 5.8 bits, which increases the false negative frequency
1078
          to 0.14. We set a cutoff of 5 bits for the remainder of
1079
          the results in this paper. These error rates are probably
1080
          somewhat pessimistic. Figure 8shows that the rate of
1081
          false positive RNA classifications of COD or
1082
          OTH-generated data is lower (about 0.001) at the 5-bit
1083
          cutoff that we set based on finding false positives in
1084
          shuffled RNA-generated alignments.
1085
        
1086
        
1087
          Tests on simulated genomes
1088
          To get a better idea of the false positive rate in
1089
          whole genome screens, where the background is dominated
1090
          by sequences other than RNAs, we used the COD and OTH
1091
          models to simulate two aligned complete "pseudobacterial"
1092
          genomes with no structural RNA genes present. The aligned
1093
          pseudo-genomes have the following characteristics [ 30 31
1094
          ] : ~2 megabases in total length, with coding regions
1095
          generated from the COD model with length distributions
1096
          distributed normally around a mean length of ~900
1097
          nucleotides, and intergenic regions generated using the
1098
          OTH model with length distributions distributed normally
1099
          around the mean length of ~100 nucleotides (thus, an
1100
          overall coding density of ~90%).
1101
          Because the parameters of the models are ultimately
1102
          dependent upon the BLOSUM62 amino acid scoring matrix,
1103
          the average percent identity of the aligned genomes was
1104
          only 41% in "coding" regions and 36% in "intergenic"
1105
          regions. This is a weakness in the simulation, because in
1106
          a real genome screen, we would be looking at alignments
1107
          in the 65-85% nucleotide identity range, as we discuss
1108
          later in the paper.
1109
          The parameters also gave a simulated pair of genomes
1110
          with an overall GC content of 47.25%. From previous
1111
          experience [ 12 ] , we expected that genomic sequences
1112
          with high GC content might tend to be misclassified as
1113
          RNAs. We therefore devised a crude way of modifying the
1114
          parameters of the models to correspond to different base
1115
          compositions, by expressing various joint probabilities
1116
          instead as a function of conditional probabilities, i.e.
1117
          for two aligned codons 
1118
          c, c' :
1119
          
1120
          P ( 
1121
          c, c' ) = 
1122
          P ( 
1123
          c|c' ) 
1124
          P ( 
1125
          c' ),     (16)
1126
          where the 
1127
          P ( 
1128
          c' ) are codon frequencies obtained
1129
          by marginalization of the joint probabilities, and the
1130
          information about the mutation rate is in the conditional
1131
          probabilities 
1132
          P ( 
1133
          c|c' ). We then can modify overall
1134
          frequencies to a different set while keeping the same
1135
          conditional probabilities. The joint probabilities are
1136
          then recalculated as:
1137
          
1138
          where is obtained as the product of the single
1139
          nucleotide frequencies for the new GC composition. This
1140
          approximation for the codon probabilities could be
1141
          refined to better reflect the actual codon bias of the
1142
          genome.
1143
          This correction of probabilities can be performed for
1144
          both codon-codon probabilities and independent mutation
1145
          probabilities. Using this modification to the COD and OTH
1146
          models, we generated two more pairs of genomes which had
1147
          overall GC contents of 57.7% and 38.9%. We then ran QRNA
1148
          using its default parameters (i.e. uncorrected for GC
1149
          composition) across these three aligned simulated genomes
1150
          in scanning mode, using a window 200 nucleotides wide,
1151
          moving 50 nucleotides at a time, and counted the number
1152
          of times a window was called RNA with a score of ≥ 5
1153
          bits. All such windows are false positives, because the
1154
          simulated genomes have no RNA component.
1155
          The observed false positive numbers for the 2 Mb
1156
          low-GC, average-GC, and high-GC simulated genomes were 8,
1157
          14, and 21 respectively, or about 4-10 per megabase of
1158
          pairwise alignment analyzed. This indicates that
1159
          specificity degrades with higher GC compositions. We
1160
          reanalyzed the high-GC genome using the high-GC parameter
1161
          set that generated it (i.e. parameters corrected for GC
1162
          composition), and saw one false positive. This indicates
1163
          that setting the parameters of the three models to be
1164
          appropriate for the GC composition of the input alignment
1165
          should improve the effectiveness of the approach;
1166
          however, our current method for doing this may be too
1167
          crude.
1168
        
1169
        
1170
          Tests on known RNAs
1171
          To test the sensitivity and specificity of our method
1172
          on real RNAs, we analyzed pairwise alignments taken from
1173
          a multiple alignment of 63 eukaryotic SRP-RNAs [ 32 ]
1174
          (also known as 7SL RNA), and a multiple alignment of 51
1175
          eukaryotic RNaseP RNAs [ 33 ] . These RNA genes were
1176
          chosen because they are independent from the set of tRNAs
1177
          and rRNAs used to train the RNA model.
1178
          We did two different types of experiments. In the
1179
          first, we used the pairwise alignments as given in the
1180
          curated multiple sequence alignment. These pairwise
1181
          alignments are an ideal case for QRNA, because they are
1182
          structurally aligned. In the second set of experiments,
1183
          we took each known RNA in turn and used it as a BLASTN
1184
          query against the rest of the RNAs, then classified all
1185
          significant alignments with QRNA. This is a more
1186
          realistic scenario for QRNA; a BLASTN primary sequence
1187
          alignment may be fragmentary and/or not entirely
1188
          structurally correct. All alignments were scored with
1189
          QRNA using default parameters.
1190
          For the first experiment, we used QRNA to score in
1191
          full (i.e. not with a scanning window) the 2, 016
1192
          different structural pairwise alignments for SRP-RNAs,
1193
          and the 1, 325 structural pairwise alignments for RNaseP
1194
          RNAs. The manually curated RNA structural alignments have
1195
          a wide range of sequence diversity that extends from 100%
1196
          to 0% pairwise identity. The number of pairwise
1197
          alignments that were classified as RNA with a score of
1198
          > 5 bits was counted, and these counts were binned by
1199
          ranges of percent identity. The fraction of alignments
1200
          classified as RNA is a measure of the sensitivity of
1201
          QRNA. To measure specificity, we randomly shuffled each
1202
          pairwise alignment by columns, which destroys the nested
1203
          RNA structure correlations but retains the percentage
1204
          identity of the alignments. Shuffled alignments that are
1205
          classified by QRNA as RNA are false positives. The
1206
          results in Table 2show that QRNA can detect about half of
1207
          the alignments as RNAs at a wide range of percent
1208
          identities; however, specificity seriously degrades for
1209
          alignments over 90% identity.
1210
          In the second set of experiments, we have taken each
1211
          single RNA gene in a given family (both for the SRP-RNA
1212
          and the RNaseP RNA families) and used it as a BLASTN
1213
          query against all genes in the same family (including
1214
          itself). We used WUBLASTN (2.0MP-WashU, 12 Feb 01
1215
          version, default parameters and scoring matrix) and
1216
          retained those alignments that were longer than 50
1217
          nucleotides, with an E-value of ≤ 0.01, and with an
1218
          overall similarity of ≥ 65%. Of the 3, 342 possible
1219
          comparisons, this produced 1, 003 alignments (586 for SRP
1220
          RNAs, and 417 for RNaseP RNAs). These were then scored by
1221
          QRNA to measure sensitivity, and then shuffled by columns
1222
          and rescored to measure specificity. Table 3shows that
1223
          specificity follows the same trend we saw in the
1224
          structural alignments, with a sharp degradation in
1225
          specificity over 90% identity. Sensitivity, however,
1226
          drops off steeply in the other direction; as percent
1227
          identity declines, sensitivity decreases.
1228
          We also analyzed the dependency of sensitivity and
1229
          specificity with the GC content of the alignments, both
1230
          for structural and BLASTN-type alignments. We observe a
1231
          similar trend for both types of alignments; both
1232
          sensitivity and specificity reach their best values for
1233
          GC contents ranging from 45% to 60%. Specificity drops
1234
          faster for high GC content alignments, which is
1235
          consistent with the fact that unstructured sequences with
1236
          high GC content tend to produce more spurious secondary
1237
          structure predictions than low GC content sequences [ 12
1238
          ] .
1239
          These results show two competing forces at play. In
1240
          order to be detected by QRNA, two RNA sequences must be
1241
          similar enough to produce a BLASTN alignment that is
1242
          reasonably correct and extensive, but they also must be
1243
          dissimilar enough to show compensatory mutations in
1244
          base-paired positions of the RNA secondary structure.
1245
          There is therefore a "sweet spot" of percent identity in
1246
          which QRNA performance is optimal. Based on these
1247
          results, we choose to analyze only BLASTN pairwise
1248
          alignments of between 65% and 85% nucleotide identity
1249
          with QRNA. However, we do not fully understand the
1250
          degradation of specificity at high percentage identities
1251
          (see Discussion).
1252
        
1253
        
1254
          Tests on a whole genome
1255
          To test QRNA performance in a realistic whole genome
1256
          screen, we used it to analyze the 
1257
          Escherichia coli genome by
1258
          comparisons to the related genome of 
1259
          Salmonella typhi . We compared QRNA
1260
          annotation to the curated annotation of known coding
1261
          genes, ncRNAs, and intergenic regions [ 34 ] . The
1262
          feature tables for version M52 of the 
1263
          E. coli genome includes 115 known
1264
          RNA genes and 4, 290 known coding genes (ORFs). The known
1265
          RNA genes include 22 rRNAs, 86 tRNAs, and 7 miscellaneous
1266
          RNAs (RNase P, for example). At least 4 other known RNA
1267
          genes [ 1 35 36 ] - csrB, oxyS, micF, and rprA - were not
1268
          present in the M52 feature table.
1269
          We split the 
1270
          E. coli genome in three different
1271
          components: 115 RNA features (a total of 40 kb, 1% of the
1272
          genome), 4290 ORF features (4090 kb, 88% of the genome),
1273
          and 2367 intergenic sequences of length ≥ 50 nt (500 kb,
1274
          11% of the genome). Each sequence was compared against
1275
          the complete 
1276
          Salmonella typhi genome (Sanger
1277
          Centre, unpublished genome data,
1278
          http://www.sanger.ac.uk/Projects/S_typhi) using WUBLASTN,
1279
          and all alignments of ≥ 50 nt with an E-value of ≤ 0.01
1280
          and a percent identity of ≥ 65% and ≤ 85% were kept. This
1281
          resulted in 354 alignments to RNAs, 4, 946 alignments to
1282
          ORFs, and 11, 509 alignments to intergenic regions. (The
1283
          large number of alignments in intergenic regions is due
1284
          to repetitive sequence families.) These alignments were
1285
          then classified by QRNA in scanning mode, scoring
1286
          overlapping windows of 200 nucleotides sliding 50
1287
          nucleotides at a time, and all windows with scores of ≥ 5
1288
          bits for one of the three models were annotated as RNA,
1289
          COD, or OTH correspondingly.
1290
          We then looked at these data in two ways. First, how
1291
          many of the known features (ncRNAs and ORFs) were
1292
          detected correctly? We counted a known feature as
1293
          "detected" as RNA or COD if it had one or more
1294
          overlapping QRNA annotations of that type. It is possible
1295
          for different parts of a long feature (especially the
1296
          ORFs) to be detected with different annotations. For the
1297
          115 known ncRNAs, 33 have one or more BLASTN alignments
1298
          to 
1299
          S. typhi in the right range, and
1300
          all 33 were annotated as RNA by QRNA; none were called
1301
          COD. For the 4290 known ORFs, 3181 had BLASTN alignments
1302
          in the right range; 2876 were called COD, 20 were called
1303
          RNA, and 184 were called both COD and RNA.
1304
          These results indicate that the sensitivity of the
1305
          program is largely dependent upon the availability of
1306
          appropriate comparative sequence data - only 29% of the
1307
          115 known RNAs were detected, but invariably (in this
1308
          case), a failure to detect an RNA resulted from the lack
1309
          of an appropriate BLASTN alignment to analyze (of 65-85%
1310
          identity). Therefore sensitivity could presumably be
1311
          improved by using multiple comparative genome sequences
1312
          at different evolutionary distances.
1313
          A second way to look at the data is from the
1314
          perspective of how many of QRNA's annotations are
1315
          correct. In a postprocessing step, any overlapping
1316
          windows with the same QRNA annotation were merged into a
1317
          longer annotated region. A total of 148 regions are
1318
          annotated in the ncRNA sequence fraction: 33 as RNA, none
1319
          as COD, and 115 as OTH. 7422 regions are annotated in the
1320
          ORF sequence fraction: 88 as RNA, 3397 as COD, and 3937
1321
          as OTH. 1974 regions are annotated in the intergenic
1322
          sequence fraction: 351 as RNA, 61 as COD, and 1562 as
1323
          OTH. Therefore QRNA annotated a total of 5614 sequence
1324
          regions as OTH, of which 3937 (70%) are actually in known
1325
          ORFs-this means we must interpret an OTH annotation as a
1326
          catch-all "don't know" category, rather than as a
1327
          conserved noncoding sequence of potential interest. QRNA
1328
          annotated a total of 3458 regions as COD, of which 3397
1329
          (98%) are in known ORFs. The other 61 COD annotated
1330
          regions could either be false positive calls, or could be
1331
          previously undetected small coding genes.
1332
          Most interestingly, QRNA annotated a total of 472
1333
          regions of 
1334
          E. coli as RNA, of which only 33
1335
          (7%) are in known RNAs. It is not possible to
1336
          definitively accept or reject the rest of these
1337
          annotations without additional experimental data. The 88
1338
          RNA annotations that overlap known ORFs may be false
1339
          positives, or may indicate cis-regulatory RNA structures
1340
          that overlap coding regions. It is intriguing that a
1341
          disproportionate number of QRNA's RNA annotations (74%,
1342
          351/472) were in the "intergenic" data fraction, which is
1343
          only 11% of the genome - which is what we would expect to
1344
          see if there were a fair number of undetected RNA
1345
          features in the genome.
1346
          We examined many of these 351 regions by eye. Four of
1347
          them are the four ncRNA genes (csrB, oxyS, micF, and
1348
          rprA) that were not included in the M52 feature table for
1349
          
1350
          E. coli . Others are repetitive
1351
          sequence families with conserved palindromic sequence,
1352
          such as BIMEs [ 37 ] . Some correspond to known
1353
          cis-regulatory RNA structures such as ρ-independent
1354
          terminators (which have an RNA stem loop structure) and
1355
          transcriptional attenuators. For about half of these
1356
          regions, we cannot exclude the possibility that they
1357
          correspond to novel RNAs, and we cannot assign a known
1358
          biological role to them without additional computational
1359
          or experimental evidence. A more in-depth QRNA screen of 
1360
          E. coli for novel ncRNAs using
1361
          multiple comparative genomes from γ-proteobacteria,
1362
          accompanied by experimental evidence that many of the
1363
          predicted RNAs are indeed novel ncRNA genes, is presented
1364
          elsewhere (E.R., R.J. Klein, T.A. Jones, and S.R.E.,
1365
          manuscript submitted).
1366
        
1367
      
1368
      
1369
        Discussion
1370
        There are a number of ways in which we could improve
1371
        QRNA. The three probabilistic models are calibrated to a
1372
        fixed evolutionary distance. We used the BLOSUM62
1373
        substitution matrix to define the fixed evolutionary
1374
        distance of our three models, and it is now quite clear
1375
        that this is the wrong distance. Our models generate
1376
        pairwise alignments of about 40% sequence identity. We
1377
        expect on theoretical grounds that this is where the models
1378
        would perform optimally on real input alignments. However,
1379
        BLASTN cannot detect RNA sequences that are this diverged.
1380
        Our evaluations indicated a sweet spot of 65%-85% identity
1381
        for QRNA to work best in its current formulation. We
1382
        suspect that we could obtain some improvement by choosing a
1383
        substitution matrix corresponding to more closely related
1384
        nucleotide sequences.
1385
        In principle QRNA may also be useful as a coding-region
1386
        genefinder. The coding model is a fully probabilistic
1387
        formalization of comparative analysis ideas used by the
1388
        genefinder CRITICA [ 16 ] , and by comparative exon finding
1389
        approaches such as the EXOFISH vertebrate/ 
1390
        Tetraodon comparison [ 38 ] and the
1391
        human/mouse comparison in [ 39 ] . In the 
1392
        E. coli whole genome screen, the
1393
        sensitivity and specificity of QRNA coding annotations seem
1394
        quite high. We have not yet attempted to optimize the
1395
        performance of QRNA for this purpose.
1396
        In terms of other QRNA improvements, it should be
1397
        advantageous to make the emission and transition parameters
1398
        of the models conditional on a parametric evolutionary
1399
        distance. We could then optimize a maximum likelihood
1400
        distance separately for each input alignment (or,
1401
        marginalize over all distances, in a more Bayesian
1402
        approach). This should widen the 65-85% alignment identity
1403
        window that QRNA works best in - in particular, by
1404
        constructing models more appropriate for nearly identical
1405
        sequences, where we currently have high false positive
1406
        rates.
1407
        It would be good to have more theory to guide how we
1408
        produce divergence-matched transition probability
1409
        parameters for the three models. We suspect our 
1410
        ad hoc estimation may be causing the
1411
        RNA model to be favored artifactually in certain cases
1412
        (less gappy alignments and longer alignments), elevating
1413
        our false positive rate.
1414
        We also made a number of simplifying independence
1415
        assumptions in trying to calculate QRNA's parameters all
1416
        from a single chosen amino acid substitution matrix. Some
1417
        of these assumptions probably reduce our performance. It
1418
        would be desirable to move towards estimating parameters
1419
        based on real datasets of aligned nucleotide sequences, if
1420
        large enough datasets could be amassed.
1421
        We are relying on BLASTN to produce approximately
1422
        correct pairwise alignments of coding regions or RNA
1423
        structures, even though BLASTN is purely a
1424
        position-independent primary sequence alignment program. We
1425
        could instead realign the two input sequences using the
1426
        pair-grammars. In principle this should increase the
1427
        performance of QRNA, particularly for more dissimilar
1428
        sequences. Unfortunately, alignment of two sequences to a
1429
        pair-SCFG is effectively the Sankoff algorithm [ 40 ] with
1430
        time and memory complexity of 
1431
        O ( 
1432
        L 6) and 
1433
        O ( 
1434
        L 4), respectively, so we will need a
1435
        more clever algorithmic strategy than straightforward
1436
        dynamic programming (if, indeed, dynamic programming RNA
1437
        structure alignment in a four-dimensional hypercube can be
1438
        called "straightforward").
1439
        Because QRNA detects conserved RNA secondary structure,
1440
        it is not expected to detect ncRNAs that apparently lack
1441
        significant intramolecular secondary structure, such as C/D
1442
        box small nucleolar RNAs [ 6 ] . Identifying novel
1443
        unstructured ncRNAs remains an entirely open problem. A
1444
        pure computational approach will probably have to identify
1445
        transcriptional signals - promoters, enhancers, and
1446
        terminators - and this remains a difficult problem,
1447
        particularly in complex genomes. Experimental screens for
1448
        novel ncRNAs may prove more fruitful for unstructured
1449
        ncRNAs. Expression arrays that pave the entire target
1450
        genome with probes can detect novel transcripts [ 41 ] ,
1451
        and cDNA libraries that enrich for small, nonpolyadenylated
1452
        RNAs can be constructed and EST sequenced [ 42 ] .
1453
        QRNA is also expected to identify 
1454
        cis -regulatory RNA structures in
1455
        mRNAs, in addition to structured ncRNA genes.
1456
        Distinguishing an ncRNA gene from a 
1457
        cis -regulatory RNA structure in an
1458
        mRNA is nontrivial in absence of experimental evidence.
1459
        This cautions against using QRNA for fully automated genome
1460
        annotation and "gene counting" exercises in the way that
1461
        protein genefinders like GENSCAN are used.
1462
        Instead, QRNA is best used as a computational screen for
1463
        candidate ncRNA genes, after which candidate loci are
1464
        further characterized both computationally and
1465
        experimentally before considering them to be "genes". Both
1466
        the data presented here and in a second paper detailing a
1467
        careful 
1468
        E. coli genome screen with
1469
        experimental verification of many novel ncRNA genes (E.R.,
1470
        R.J. Klein, T.A. Jones, and S.R.E., manuscript submitted)
1471
        indicate that QRNA can be successfully used in this role.
1472
        Although we have much we can do to improve its performance,
1473
        we believe QRNA is the first example of a generally
1474
        applicable computational genefinder for noncoding RNA
1475
        genes. We expect to be able to apply QRNA - based screens
1476
        for ncRNAs to a number of organisms as comparative sequence
1477
        data become available - including yeast, 
1478
        Caenorhabditis , 
1479
        Drosophila , human, and several
1480
        microbial systems.
1481
      
1482
      
1483
        Conclusions
1484
        We have described an algorithm that uses three different
1485
        probabilistic models (for RNA-structure-constrained,
1486
        coding-constrained, and position-independent evolution) to
1487
        examine the pattern of mutations in a pairwise sequence
1488
        alignment. The alignment is classified as RNA, coding, or
1489
        other, according to the Bayesian posterior probability of
1490
        each model. We have implemented this algorithm as a
1491
        program, QRNA, which we consider to be a prototype
1492
        structural ncRNA genefinding program.
1493
      
1494
    
1495
  
1496

1497
Product

Resources

Company