Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Introduction
6
Some genes produce functional noncoding RNAs (ncRNAs)
7
instead of coding for proteins [ 1 2 ] . For protein-coding
8
genes, we have computational genefinding tools [ 3 ] that
9
predict novel genes in genome sequence data with reasonable
10
efficiency [ 4 ] . For ncRNA genes, there are as yet no
11
general genefinding algorithms. The number and diversity of
12
ncRNA genes remains poorly understood, despite the
13
availability of many complete genome sequences. Gene
14
discovery methods (whether experimental or computational)
15
typically assume that the target is a protein coding gene
16
that produces a messenger RNA.
17
New noncoding RNA genes continue to be discovered by
18
less systematic means, which makes it seem likely that a
19
systematic RNA genefinding algorithm would be of use.
20
Recent discoveries have included RNAs involved in dosage
21
compensation and imprinting [ 5 ] , numerous small
22
nucleolar RNAs involved in RNA modification and processing
23
[ 6 7 8 ] , and small riboregulatory RNAs controlling
24
translation and/or stability of target mRNAs [ 9 10 ] .
25
Mutations in the gene for RNase MRP are associated with
26
cartilage-hair hypoplasia (CHH), a recessive pleiotropic
27
human genetic disorder [ 11 ] . The CHH locus eluded
28
positional cloning for some time; the RNase MRP gene was
29
only detected in the completely sequenced CHH critical
30
region because the RNase MRP sequence was already in the
31
databases.
32
We have previously explored one RNA genefinding approach
33
with very limited success [ 12 ] . Maizel and coworkers [
34
13 14 15 ] had hypothesized that biologically functional
35
RNA structures may have more stable predicted secondary
36
structures than would be expected for a random sequence of
37
the same base composition. Though we could confirm some
38
anecdotal results where this was true, we were forced to
39
the conclusion that in general, the predicted stability of
40
structural RNAs is not sufficiently distinguishable from
41
the predicted stability of random sequences to use as the
42
basis for a reliable ncRNA genefinding algorithm.
43
Nonetheless, conserved RNA secondary structure remained our
44
best hope for an exploitable statistical signal in ncRNA
45
genes. We decided to consider ways of incorporating
46
additional statistical signal using comparative sequence
47
analysis.
48
We were motivated by the work of Badger & Olsen [ 16
49
] for bacterial coding-region identification. Badger &
50
Olsen use the BLASTN program [ 17 ] to locate genomic
51
regions with significant sequence similarity between two
52
related bacterial species. Their program, CRITICA, then
53
analyzes the pattern of mutation in these ungapped, aligned
54
conserved regions for evidence of coding structure. For
55
example, mutations to synonymous codons get positive
56
scores, while aligned triplets that translate to dissimilar
57
amino acids get negative scores. (CRITICA then subsequently
58
extends any coding-assigned ungapped seed alignments into
59
complete open reading frames.)
60
Here we extend the central idea of the Badger &
61
Olsen approach to identify structural RNA regions. Our
62
extensions include: (1) using fully probabilistic models;
63
(2) adding a third model of pairwise alignments constrained
64
by structural RNA evolution; (3) allowing gapped
65
alignments; and (4) allowing for the possibility that only
66
part of the pairwise alignment may represent a coding
67
region or structural RNA, because a primary sequence
68
alignment may extend into flanking noncoding or
69
nonstructural conserved sequence. These extensions add
70
complexity to the approach. We use probabilistic modeling
71
methods and formal languages to guide our construction. We
72
use "pair hidden Markov models" (pair-HMMs) (introduced in
73
[ 18 ] ) and a "pair stochastic context free grammar"
74
(pair-SCFG) (a natural extension of the pair-HMM idea to
75
RNA structure) to produce three evolutionary models for
76
"coding", "structural RNA", or "something else" (a null
77
hypothesis). Given three probabilistic models and a
78
pairwise sequence alignment to be tested, we can calculate
79
the Bayesian posterior probability that an alignment should
80
be classified as "coding", "structural RNA", or "something
81
else".
82
Our approach is designed to detect conserved
83
structural RNAs. Some ncRNA genes do
84
not have well-conserved intramolecular secondary
85
structures, and some conserved RNA secondary structures
86
function as cis-regulatory regions in mRNAs rather than as
87
independent RNA genes. We will be using the term "ncRNA
88
gene" to refer to our prediction targets, but it must be
89
understood that this really means a conserved RNA secondary
90
structure that may or may not turn out to be an independent
91
functional ncRNA gene upon further analysis.
92
93
94
Algorithm
95
96
Overview of the approach: simple, ungapped global
97
case
98
The key idea is to produce three probabilistic models
99
(RNA, COD, and OTH) describing different evolutionary
100
constraints on the pattern of mutations observed in a
101
pairwise sequence alignment. We will first introduce toy
102
versions of these models, for clarity.
103
All three models use the "pair-grammar" formalism
104
described in [ 18 ] . A standard hidden Markov model
105
(HMM) generates a single observable sequence by emitting
106
single residues, whereas a pair-HMM generates two aligned
107
sequences
108
X, Y by emitting a pair of aligned
109
residues at a time (or single residues in either sequence
110
to deal with insertion and deletion).
111
The OTH model assumes mutations occur in a simple
112
position-independent fashion. OTH has 4 × 4 core
113
parameters, which are the pairwise alignment
114
probabilities
115
P OTH(
116
a, b ) - that is, the joint
117
probabilities of emitting an alignment of a nucleotide
118
a in sequence
119
X and a nucleotide
120
b in sequence
121
Y (Table 1). OTH represents our
122
null hypothesis. The probability of the alignment given
123
the OTH model is just the product of the probabilities of
124
the individual aligned positions.
125
The COD model assumes that the aligned sequences
126
encode homologous proteins. In a coding region, we
127
intuitively expect to see mutations that make
128
conservative amino acid substitutions; in particular, we
129
expect an abundance of synonymous mutations. To capture
130
this, COD has 64 × 64 core parameters, which are
131
P COD(
132
a
133
1
134
a
135
2
136
a
137
3 ,
138
b
139
1
140
b
141
2
142
b
143
3 ), the probabilities of the
144
correlated emission of two codons - that is, three
145
nucleotides
146
a
147
1
148
a
149
2
150
a
151
3 in sequence
152
X , aligned to three nucleotides
153
b
154
1
155
b
156
2
157
b
158
3 in sequence
159
Y . (See Table 1for an example of
160
pair codon probabilities.) The probability of the
161
alignment given the COD model for a particular reading
162
frame is the product of the probabilities of the
163
individual aligned codons in that frame. Since we don't
164
know the correct frame
165
a priori , the overall probability
166
of an alignment is a sum over all six frames
167
f ,
168
169
and we assume that all frames are a priori
170
equiprobable in the alignment (
171
P (
172
f |COD) = ).
173
The RNA model assumes that the pattern of mutation
174
significantly conserves a homologous RNA secondary
175
structure. Intuitively, we expect a significant abundance
176
of pairwise-correlated mutations that preserve
177
Watson-Crick complementarity in an (as yet unknown)
178
structure. To capture this, the core parameters in RNA
179
are the 16 × 16 probabilities
180
P RNA(
181
a
182
183
L
184
185
a
186
187
R
188
,
189
b
190
191
L
192
193
b
194
195
R
196
) - that is, the probabilities associated with the
197
correlated emission of one base-pair (
198
a
199
200
L
201
202
a
203
204
R
205
) in sequence
206
X aligned to a homologous base-pair
207
(
208
b
209
210
L
211
212
b
213
214
R
215
) in sequence
216
Y (Table 1). Single stranded
217
positions in the alignment are modeled by
218
P RNA(
219
a, b ), the same functional form as
220
in the OTH model. For a given alignment of known
221
structure
222
s , the probability
223
P ( |
224
s , RNA) is a product of terms
225
P RNA(
226
x
227
228
i
229
230
x
231
232
j
233
,
234
y
235
236
i
237
238
y
239
240
j
241
) for all base paired positions
242
i, j and
243
P RNA(
244
x
245
246
k
247
,
248
y
249
250
k
251
) for all single stranded positions
252
k in the alignment. Since we don't
253
know the correct structure
254
a priori , the overall probability
255
of an alignment given by the RNA model is a sum over all
256
structures
257
s :
258
259
But here, we cannot assume equiprobability for the
260
various structures
261
s as we did for coding frames
262
f above; in fact, calculating
263
P (
264
s |RNA) implies a full
265
probabilistic model describing favorable and unfavorable
266
RNA secondary structures. The necessary machinery for
267
calculating this weighted sum is exactly what we
268
developed previously for searching for significant
269
structure in single sequences [ 12 ] . In that paper we
270
parameterized a stochastic context-free grammar (SCFG)
271
that incorporates a model of hairpin loops, stems,
272
bulges, and internal loops, including stacking and
273
loop-length distributions, making a probabilistic
274
counterpart for the widely used MFOLD program for RNA
275
structure prediction. The SCFG we use here is almost the
276
same, with the difference that now we generate two
277
aligned sequences simultaneously: i.e., a pair-SCFG. The
278
summation over all possible structures can be done
279
efficiently using an SCFG Inside algorithm (a dynamic
280
programming algorithm).
281
In Figure 1we present an example of three different
282
alignments with different mutation patterns, and how they
283
would be scored with the three different models.
284
Finally, in order to classify the input alignment as
285
RNA, COD, or OTH, we use the three likelihoods to
286
calculate a Bayesian posterior probability, under the
287
simple assumption that the three models are
288
a priori equiprobable. Alignments
289
with high RNA posterior probabilities are interpreted as
290
candidate ncRNA genes.
291
For scoring purposes, it will also be useful to
292
calculate log-odds scores in the standard manner [ 19 ]
293
relative to a fourth model, which we will call IID. In
294
IID, we assume the two sequences are nonhomologous
295
independent, identically distributed sequences. The IID
296
model has 8 parameters corresponding to the expected base
297
compositions ofthe two sequences,
298
P
299
X (
300
a ) and
301
P
302
Y (
303
b ).
304
305
306
Parameter estimation in the simple case
307
Parameter estimation is crucial for our approach. The
308
three models have to be calibrated to an overall similar
309
evolutionary divergence time, and to similar base
310
compositions. Else, one model might be artifactually
311
favored over the others because of the degree of
312
conservation or the base composition in an input
313
alignment, not because of the pattern of mutation.
314
In an ideal world, we could empirically estimate the
315
parameters of each model using training sets of pairwise
316
alignments culled from real RNAs, coding regions, and
317
other conserved noncoding regions, using pairwise
318
alignments that were all about the same percent identity.
319
Unfortunately it is unlikely that we can amass suitably
320
large training sets. Instead, we take a somewhat
321
ad hoc approach that ties the
322
parameters of all three models as much as possible to a
323
particular choice of a standard amino acid substitution
324
matrix, such as BLOSUM62. We will derive joint codon
325
probabilities from the chosen scoring matrix, then use
326
these codon probabilities to calculate the average single
327
nucleotide substitution probabilities in OTH, then
328
combine these OTH substitution parameters with base-pair
329
frequencies to obtain the parameters of the RNA model.
330
This procedure is as follows.
331
First the 64 × 64 codon emission probabilities
332
P COD(
333
a
334
1
335
a
336
2
337
a
338
3,
339
b
340
1
341
b
342
2
343
b
344
3 |
345
t ), for some divergence "time"
346
t , are derived from the chosen
347
substitution matrix (i.e. the choice of matrix defines
348
t ). We make an independence
349
assumption that the conditional probability of each codon
350
depends only on its own encoded amino acid - i.e., it
351
does not depend on the the other codon - so we can use
352
the approximation
353
354
for
355
a
356
1 ,
357
a
358
2 ,
359
a
360
3 ,
361
b
362
1 ,
363
b
364
2 ,
365
b
366
3 ∈ {
367
A, C, G, U } and
368
A, B ∈ {amino acids}. (An example
369
of where this independence assumption is violated: for
370
equiprobable codon bias, our parameters will say that
371
aligning TCA to AGT is as likely as aligning TCA to TCG
372
because all three are Ser codons, despite the fact that
373
the first case requires three transversions.)
374
P (
375
A ,
376
B |
377
t ) are the joint target
378
probabilities of aligned amino acids obtained from the
379
amino acid score matrix, such as BLOSUM62 [ 20 ] , as
380
described by [ 19 ] . The terms
381
P (
382
a
383
1
384
a
385
2
386
a
387
3 |
388
A ) are the probabilities of
389
observing a particular codon given a particular amino
390
acid; these terms can include a codon-bias model [ 21 ]
391
and, if desired, a substitution error model to deal with
392
error-prone sequence data. The sum over all possible
393
amino acids in equation (3) is relevant only when a
394
substitution error model applies, since otherwise each
395
observed codon can only mean one possible amino acid.
396
The 16 mutation probabilities for the OTH model are
397
then obtained by marginalizing the corresponding
398
codon-codon emission probabilities in equation (3), in
399
the following way:
400
401
The 16 × 16 core parameters of the RNA model are
402
calculated by combining the OTH model (which sets the
403
average divergence of the two sequences) with some
404
additional parameters that specify the probability of
405
base pairs. This involves making an independence
406
assumption:
407
408
Alternatively, we can symmetrically derive a equation
409
in which the divergence is controlled by the mutation
410
probability of the right position instead of the left
411
position. We calculate the overall joint probability of
412
the aligned base pairs as the average of these two
413
equations:
414
415
Here
416
P pair(
417
a
418
419
L
420
421
a
422
423
R
424
|
425
t ),
426
P pair(
427
b
428
429
L
430
431
b
432
433
R
434
|
435
t ) are just the probabilities of
436
the various sorts of base pairs (GC, AU, GU) in a single
437
RNA structure.
438
439
440
Extension of the models to gapped local
441
alignments
442
443
The OTH Model
444
The complete OTH model, a pair hidden Markov model,
445
is diagrammed in Figure 5. The flanking double-circled
446
states
447
F
448
449
L
450
,
451
F
452
453
R
454
, and
455
F
456
457
J
458
are a shorthand for a full IID model of the type
459
in Figure 4, which allow the alignment to be flanked or
460
interrupted by runs of unassigned (independent)
461
residues. (In general we will use the convention that
462
single-circled states are "single states", and
463
double-circled states represent some "composite state"
464
of some kind previously defined. This differs from a
465
convention in formal languages in which double-circled
466
states are terminal states of a finite-state automaton
467
[ 22 ] .)
468
The OTH model requires us to specify emission
469
probabilities for the state XY (that emits two aligned
470
nucleotides), and also for the X and Y states (that
471
emit one nucleotide "aligned" to a gap character in the
472
other sequence). The emission probabilities for state
473
XY,
474
P
475
XY (
476
a ,
477
b |
478
t ), are simply the mutation
479
probabilities
480
P OTH(
481
a ,
482
b |
483
t ) of the toy ungapped OTH
484
model, as described above. The emission probabilities
485
for states X and Y are obtained by marginalization of
486
the
487
P
488
XY 's:
489
490
491
492
The COD Model
493
The complete COD model, a pair hidden Markov model,
494
is diagrammed in Figure 6. A new degree of "locality"
495
is introduced. In addition to regions of the alignment
496
that are better left "unaligned" (i.e. generated by the
497
flanking states of an IID model), we want to model
498
regions of the alignment that are not coding but still
499
well-conserved. To model this, we add three full copies
500
of OTH models to the core of the COD model, indicated
501
by the symbols
502
O
503
504
B
505
,
506
O
507
508
E
509
, and
510
O
511
512
J
513
. We represent a full OTH model with:
514
515
with the understanding that any arrow that goes into
516
"
517
O " indicates a transition into
518
the "
519
S
520
521
FL
522
" state of the
523
F
524
525
L
526
flanking model, and any arrow leaving "
527
O " emerges from the "
528
T
529
530
FR
531
" state of the
532
F
533
534
R
535
flanking model. In this way the COD model can
536
score a coding-aligned region that is nested between
537
other independently-aligned regions.
538
The different COD states described in Figure 6emit
539
correlated codon pairs, possibly with indels. To deal
540
with BLASTN misalignments of codons and possible
541
applications to error-prone sequence data (expressed
542
sequence tags or low-pass genome shotgun), we model -1
543
or +1 frameshifts (by having a probability of emitting
544
abnormal codons of 2 or 4 nucleotides), in addition to
545
the more expected indels of multiples of three
546
nucleotides. (No explicit transition for
547
C
548
549
E
550
551
C
552
553
B
554
is necessary; the intermediate sub-model "
555
O
556
557
J
558
" has a non-emitting path that deals with
559
consecutive codons.)
560
Codon emission probabilities for the different
561
coding states are derived from the joint codon
562
probabilities
563
P CODgiven in equation (3) for
564
the toy case. For incomplete codons we do the
565
convenient marginalizations. For example,
566
567
Notice that there are three different
568
C (3, 2) states, of which we have
569
only described one in equation (9). Similarly there are
570
four different
571
C (3, 4) states, and six
572
different
573
C (2, 4) states, depending on the
574
position of the gaps. We will represent these
575
codon-emission probabilities in general by
576
P α, β(
577
a
578
1 ...
579
a
580
α ,
581
b
582
1 ...
583
b
584
β ) with α, β = {0, 2, 3, 4} and a,
585
b ∈ {A, C, G, U}.
586
587
588
The RNA Model
589
The complete RNA model, a pair stochastic context
590
free grammar (pair-SCFG), is crudely diagrammed in
591
Figure 7. The crucial SCFG machinery of the model is
592
encapsulated in the RNA state of the diagram. This
593
pair-SCFG, similar to the SCFG described in [ 12 ] ,
594
has three states labeled
595
W ,
596
W
597
598
B
599
,
600
V . They correspond to the
601
V and
602
W dynamic programming matrices in
603
[ 23 ] , and to the matrices
604
wx ,
605
wbx and
606
vx of the diagrammatic
607
representation in [ 24 25 ] . We use the diagrams as a
608
convenient visual representation to enumerate the
609
configurations which we take into account in the model.
610
State
611
V represents a substring
612
(sequence fragment) in which the ends are definitely
613
base-paired. States
614
W and
615
W
616
617
B
618
represent a substring in which the ends are either
619
paired or unpaired.
620
To extend these more or less standard RNA folding
621
algorithm conventions from a single sequence to an
622
aligned pair of sequences, let us introduce some
623
vectorial notation. In this notation stands for the
624
corresponding positions
625
i in sequence
626
X and
627
i' in sequence
628
Y . Similarly stands for the pair
629
of nucleotides in positions
630
i and
631
i' of sequences
632
X and
633
Y respectively. With this
634
notation, we also define and . We are going to assume
635
that for two aligned columns and ,
636
x
637
638
i
639
is base-paired to
640
x
641
642
j
643
if and only if
644
x
645
646
i ' is base-paired to
647
y
648
j' which is a reasonable assumption
649
if we are trying to find commonly occurring secondary
650
structures within an alignment of two sequences.
651
652
W acts as the starting state.
653
W and
654
W
655
656
B
657
are essentially equivalent, but
658
W
659
660
B
661
is used exclusively for starting multiloops. The
662
production rules for
663
W are (for
664
W
665
666
B
667
, replace
668
W by
669
W
670
671
B
672
everywhere in the recursion),
673
674
The vector provides us with a compact notation to
675
represent the three possible situations in which one
676
nucleotide is emitted in at least one of the two
677
sequences in the alignment. The components
678
e
679
x and
680
e
681
y take values 1 or 0 with at
682
least one of the two being different from zero. If
683
e
684
x = 0 or
685
e
686
y = 0 we place a gap in the
687
corresponding position in the alignment.
688
The symbol
689
V represents the paired state,
690
that is, the state we are in after emitting one pair in
691
each sequence. The recursion for state
692
V is,
693
694
Here the first transition corresponds to hairpin
695
loops, and is equivalent to function
696
FH (
697
i, j ) in [ 23 ] ; the second
698
transition corresponds to stems, bulges, and internal
699
loops, and is equivalent to function
700
FL (
701
i, j, k, l ) in [ 23 ] ; the last
702
transition corresponds to multiloops, that is, loops
703
closed by more than two hydrogen bonds. The length of
704
the alignments generated for those hairpin loops and
705
bulges and internal loops is variable and depends on
706
the number of gaps introduced. The only condition is
707
that all nucleotides in that segment have to be used -
708
for instance, and for the hairpin loops.
709
The full description of the algorithms associated to
710
this grammar is given in the Additional file. The
711
algorithms requires two kind of emission
712
probabilities,
713
• , the concurrent emission of two paired nucleotide
714
in both sequences, already introduced in equation
715
(6).
716
• , the concurrent emission of one unpaired
717
nucleotide in both sequences, which are taken as the
718
mutation probabilities in equation (4).
719
Both types of emission probabilities have been
720
extended to also emit gaps. For any position , we also
721
introduce a penalty for "mutating" to a gap, and
722
another one for "pairing" to a gap. This is a linear
723
gap cost, and is more convenient than implementing the
724
additional extra states that an affine gap cost would
725
require.
726
The vectorial notation becomes particularly
727
important if we realigned the input sequences to the
728
RNA model. In this paper, though, we will only be
729
working with a special case where we hold an input
730
(BLASTN) pairwise alignment fixed and simply score it
731
with the RNA model; in this case, for any given vector
732
.
733
734
735
736
Transition Probabilities
737
In all three models, one of the prices we pay for
738
introducing realistic flexibility is that we have
739
introduced a number of transition probability parameters,
740
in addition to the emission probabilities presented in
741
the ungapped case (Section 2.1). Now we have to determine
742
the transition probabilities of the different models.
743
Again, we want the models tuned to the same level of
744
"gappiness", else alignments may be artifactually
745
classified based on how gappy they are. Whereas we were
746
able to construct divergence-matched emission
747
probabilities for the three models in a somewhat
748
justified fashion, we have no guiding theory for
749
constructing divergence-matched transition
750
probabilities.
751
Instead, we have estimated all new transition
752
probabilities by hand. The number of additional
753
parameters in the most complete models is 8 for the OTH
754
model, and 20 for the COD model and RNA models. These
755
parameters have been optimized by studying the
756
algorithm's discrimination ability on model generated
757
data and random sequence alignments. More details on the
758
type of simulated data used to set the transition
759
probabilities of the models is given in Section 3.1. This
760
approach to estimating the transition parameters of the
761
models is very arbitrary, but we do not currently see a
762
plausible alternative.
763
The RNA model also has additional SCFG-related
764
probability parameters to take into account length
765
distributions of hairpin loops, bulges, and internal
766
loops. Those parameters have been determined from a
767
training set of aligned tRNAs and ribosomal RNAs as
768
described previously [ 12 26 27 ] .
769
770
771
Alignment and scoring algorithms
772
We are given a pairwise sequence alignment , composed
773
of
774
L aligned columns. We will hold the
775
input alignment fixed. Thus, globally aligning and
776
scoring the alignment with each of the three models could
777
be done by straightforward extensions of standard HMM
778
Viterbi and/or Forward algorithms and SCFG CYK and/or
779
Inside algorithms. The OTH and COD pair-HMM alignment
780
algorithm would cost O(
781
L ) in storage and time; the RNA
782
pair-SCFG algorithm would cost O(
783
L 2) in storage and O(
784
L 3) in time [ 12 24 ] .
785
We are interested, however, in a combination of the
786
standard algorithms. Consider the RNA model. Recall that
787
we need to obtain
788
P ( |RNA) by a summation over all
789
possible structures, which will require the Inside
790
algorithm rather than the CYK algorithm (which, like
791
Viterbi for HMMs, recovers the maximum likelihood parse,
792
i.e. structure). But we will also be interested in
793
obtaining a maximum likelihood location of a predicted
794
RNA within the input alignment - that is, we would like
795
to identify the maximum likelihood position of the
796
starting and ending nucleotides that are aligned to
797
states in the core of the RNA model, as opposed to
798
flanking states that are accounting for flanking
799
nonconserved (IID) and conserved but non-RNA (OTH)
800
nucleotides in the input alignment. This would require
801
the CYK (maximum likelihood parse) algorithm for the RNA
802
model, outside of the core RNA pair-SCFG state.
803
To combine the desired features of the two algorithms,
804
we use a trick introduced by Stormo and Haussler [ 28 ] ,
805
perhaps most widely known for its application in the
806
"semi-Markov model" of the (protein) genefinding program
807
GENSCAN [ 29 ] . The basic idea is that we start with a
808
model organized into "meta-states" (such as the
809
O
810
811
B
812
,
813
O
814
815
E
816
,
817
O
818
819
J
820
, and RNA states of the RNA model in Figure 7). Each
821
meta-state contains its own (possibly complex and
822
arbitrary) model of a feature (such as the pair-SCFG
823
represented by the RNA state). The meta-states are
824
connected to each other by transition probabilities as in
825
an HMM. To parse and score a sequence, "feature scores"
826
are first precomputed for the score of all possible
827
subsequences
828
i..j being generated by each
829
meta-state; then a dynamic programming algorithm is used
830
to assemble a maximum likelihood parse of the sequence
831
into a series of component features. Thus, we can (for
832
instance) use a pair-SCFG Inside algorithm to precompute
833
scores
834
W
835
836
ij
837
for the core RNA metastate of the RNA model
838
generating the part of the alignment from
839
i..j summed over all possible
840
structures, then use the Stormo/Haussler parsing
841
algorithm to determine the optimal
842
i..j segment that should be
843
assigned as structural RNA, versus assigning flanking
844
sequence to the
845
O
846
847
B
848
,
849
O
850
851
E
852
or
853
O
854
855
J
856
meta-states describing non-RNA conserved residues
857
and nonconserved residues.
858
Stormo/Haussler parsing algorithms add one order of
859
complexity both in storage and time to the underlying
860
dynamic programming problem to which they are applied.
861
The Forward algorithm for scoring a pair-HMM against a
862
fixed pairwise alignment is
863
O (
864
L ), but since HMM dynamic
865
programming algorithms work by iteratively calculating
866
scores of prefixes 1..
867
j of increasing length, whereas we
868
need scores of subsections
869
i..j , we have to run the algorithm
870
871
L times, once from each possible
872
start point
873
i , making the feature scoring
874
phase
875
O (
876
L 2) in storage and time for both
877
the OTH and the COD pair-HMM. (The COD pair-HMM would be
878
O (
879
L 3) in memory, but the actual
880
implementation uses a simplified
881
O (
882
L ) version for the OTH meta-states
883
included in the COD model that keeps the whole COD
884
parsing algorithm
885
O (
886
L 2).) The Inside algorithm for
887
scoring a pair-SCFG against a fixed pairwise alignment is
888
889
O (
890
L 2) space and
891
O (
892
L 3) time, and conveniently yields
893
the matrix of scores we need for all subsections
894
i..j . Therefore the computational
895
complexity of our complete algorithm is dominated by the
896
Inside algorithm for scoring the core RNA state of the
897
RNA model. (See Additional filefor more details.)
898
In principle, we could forget about the input pairwise
899
alignment, and allow our three models to optimally
900
realign the input sequences. This would be desirable; it
901
is dangerous, for example, to rely on the external
902
sequence alignment program (e.g. BLASTN) to produce a
903
correct secondary structural alignment of two homologous
904
RNAs, whereas the RNA pair-SCFG, which models
905
base-pairing correlation, would potentially produce
906
better structural alignments. However, such an algorithm
907
would be expensive: for two input sequences of length
908
m and
909
n respectively, scoring the RNA
910
pair-SCFG would cost
911
O (
912
m 2
913
n 2) in storage and
914
O (
915
m 3
916
n 3) in time. (See the Additional
917
filefor a detailed description of all the different
918
algorithms, and their complexity.) Since this realignment
919
approach is prohibitive, we rely on an assumption that
920
the external pairwise alignment algorithm will produce
921
alignments that are close enough to being correct for
922
coding regions and structural RNAs, even though the
923
external alignment program has no notion of these
924
constraints.
925
926
927
Bayesian score evaluation
928
Once we have calculated the probabilities that a
929
pairwise alignment has been generated by any one of the
930
three models, we can classify the alignment into one of
931
three using a posterior probability calculation:
932
933
where
934
935
We assume a uniform distribution for the prior
936
probabilities
937
P (Model
938
939
i
940
).
941
In some figures, we use a phase diagram representation
942
of the same information in the three posterior
943
probabilities. We plot log-odds scores of the COD and RNA
944
models with respect to the OTH model in an (
945
x, y ) plane:
946
947
We can then separate the plane into three different
948
regions "phases" dominated by any of the three models
949
(for example, see Figure 8). Those three phases
950
correspond to the conditions,
951
952
Points deep in one of the phases represent a higher
953
posterior probability for a particular model, whereas
954
points falling next to phase-transition boundaries
955
represent situations in which the method can not clearly
956
decide for one model or the other.
957
958
959
Implementation
960
This approach was implemented in ANSI C in a program
961
called QRNA. The source code and the full set of
962
probability parameters used in QRNA are freely available
963
from http://www.genetics.wustl.edu/eddy/software/under
964
the terms of the GNU General Public License. QRNA has
965
been tested on Intel/Linux and Silicon Graphics IRIX
966
platforms.
967
The input alignment is given in a modified (aligned)
968
FASTA file format. For instance the following file
969
contains the two homologous nematode sequences shown in
970
the BLASTN alignment in Figure 3:
971
>F08G2
972
973
CATTTCATAGTGTCACACGCGCACCCATGAGTTGTCGGCACAC-CACTCCCCACTACCCC
974
TACCCTCTCCTCCATTCAGTATCGCTTCTTCGGCTTATTAGCTAAGATCAAAGTGTAGTA
975
TCTGTTCTTATCGTATTAACCTACGGTATACACTCGAATGAGTGTAATAAAGGTTATATG
976
ATTTTTGGAACCTAGGGAAGACTCGGGGCTTGCTCCGACTTCCCAAGGGTCGTCCTGGCG
977
TTGCACTGCTGCCGGGCTCGGCCCAGTCCCCGAGGGGACAA
978
>G42J05
979
980
CATTCCATAGTGGCCGACGCGAGCCCGGTTTTTGTCGGTACATGCGCGCACC-CTACCCC
981
CCGCGCCTCGTTCTCACCGCATCGCTTCTTCGGCTTATTAGCTAAGATCAAAGTGTAGTA
982
TCTGTTCTTATCGTATTAACCTACGGTATGCACTCGAATGAGTGTAATAAAGGTTATATG
983
ATTTTTGGAACCTAGGAAAGACTCGGGGCTTGCTCCGACTTTCCAAGGGTCGTCCCGGCG
984
TTGCACTGCTGCCGGGCTCGGCCCAGTCCCTGTGGGGACAA
985
Note the gap characters preserving the pairwise
986
alignment. (In many cases, there would be more gap
987
characters than in this particular example.) Multiple
988
pairs of sequences can be added to a single fasta file,
989
and will be scored sequentially, one pair at a time.
990
Typing the following command line:
991
qrna fastafile
992
we obtain the output in the following form:
993
>F08G2 (281)
994
>G42J05 (281)
995
... [some irrelevant output not shown]...
996
winner = RNA
997
OTH = 152.817 COD = 129.240 RNA = 182.522
998
logoddspostOTH = 0.000 logoddspostCOD = -23.577
999
logoddspostRNA = 29.705
1000
The line winner = RNA indicates that the 281 nt
1001
alignment has been classified as a structural RNA. The
1002
next three numbers correspond to the P( |Model) in
1003
log-odds scores. The two non-null numbers in the second
1004
row ("logoddspostCOD"and "logoddspostRNA") correspond to
1005
the 2-D phase diagram scores described previously. For
1006
this alignment, the RNA model is favored over COD and OTH
1007
by 29.7 bits.
1008
A scanning version of the algorithms is also
1009
implemented. In this scanning mode a partial segment of
1010
the alignment - a window of user-determined fixed length
1011
- is scored. The window slides across the alignment and
1012
each window is scored independently from the others. This
1013
option is useful when the input alignment is long, or one
1014
expects different types of functionalities within a given
1015
alignment. This is the mode of the program that we use
1016
for whole genome analysis.
1017
Scoring a window of 200 nts takes about 14 CPU-seconds
1018
and 8 MB of memory on 225 Mhz MIPS R10K processor of a
1019
Silicon Graphics Origin2000. Scoring an alignment of 2
1020
Kbases in windows of 200 nts and moving 50 nts at a time
1021
takes about 9 minutes. Scoring the alignments generated
1022
between the intergenic regions of
1023
E. coli and
1024
S. typhi (12, 000 alignments with
1025
average length of about 100 nt) took about 9
1026
CPU-hours.
1027
Additional file
1028
• Description of data: In this additional file we
1029
provide a detailed description of the algorithms involved
1030
in implementing the three probabilistic models components
1031
of our comparative method QRNA. We give the most general
1032
description of the scoring/parsing algorithms. We also
1033
indicate how to obtain some simplifications that are part
1034
of the software implementation.
1035
Click here for file
1036
1037
1038
1039
Results
1040
1041
Tests on simulated data
1042
Because all three models are fully probabilistic, we
1043
can use them in a generative mode to sample synthetic
1044
pairwise alignments. These simulations allow us to assess
1045
the sensitivity and specificity of the approach on
1046
idealized data, to get a sense of the best that the
1047
algorithm can do. We generated 1, 000 pairwise alignments
1048
of 200 nucleotides in length from each of the three
1049
models. Each of these 3, 000 alignments was then scored
1050
and classified by the program. Results are shown in
1051
Figure 8, showing that simulated alignments are almost
1052
always classified correctly.
1053
We wanted to test that the classification is based on
1054
the pattern of mutation in the alignments, not on a
1055
spurious artifact of differing base composition, sequence
1056
identity, or gap frequency. To do this, we randomly
1057
shuffled each alignment by columns - preserving the
1058
sequence identity in the alignment, while destroying any
1059
correlations in the pattern of mutation. Figure 8shows
1060
that shuffled alignments are classified in the OTH phase,
1061
as expected.
1062
These simulation experiments were iterated during the
1063
development of the approach. They were important guide in
1064
setting the
1065
ad hoc transition probabilities in
1066
each model.
1067
We used these simulation results from RNA-generated
1068
and shuffled data to set crude but reasonable score
1069
thresholds for classification of alignments as RNA. A
1070
threshold of 1.4 bits for the RNA posterior log-odds
1071
scores would determine a minimum error rate area with a
1072
frequency of 0.023 false positives and 0.081 false
1073
negatives. In whole-genome scans, we want to push the
1074
rate of false positives down, even at the expense of
1075
increasing the number of false negatives. To reduce the
1076
false positive frequency to 0.005, we would need a cutoff
1077
of 5.8 bits, which increases the false negative frequency
1078
to 0.14. We set a cutoff of 5 bits for the remainder of
1079
the results in this paper. These error rates are probably
1080
somewhat pessimistic. Figure 8shows that the rate of
1081
false positive RNA classifications of COD or
1082
OTH-generated data is lower (about 0.001) at the 5-bit
1083
cutoff that we set based on finding false positives in
1084
shuffled RNA-generated alignments.
1085
1086
1087
Tests on simulated genomes
1088
To get a better idea of the false positive rate in
1089
whole genome screens, where the background is dominated
1090
by sequences other than RNAs, we used the COD and OTH
1091
models to simulate two aligned complete "pseudobacterial"
1092
genomes with no structural RNA genes present. The aligned
1093
pseudo-genomes have the following characteristics [ 30 31
1094
] : ~2 megabases in total length, with coding regions
1095
generated from the COD model with length distributions
1096
distributed normally around a mean length of ~900
1097
nucleotides, and intergenic regions generated using the
1098
OTH model with length distributions distributed normally
1099
around the mean length of ~100 nucleotides (thus, an
1100
overall coding density of ~90%).
1101
Because the parameters of the models are ultimately
1102
dependent upon the BLOSUM62 amino acid scoring matrix,
1103
the average percent identity of the aligned genomes was
1104
only 41% in "coding" regions and 36% in "intergenic"
1105
regions. This is a weakness in the simulation, because in
1106
a real genome screen, we would be looking at alignments
1107
in the 65-85% nucleotide identity range, as we discuss
1108
later in the paper.
1109
The parameters also gave a simulated pair of genomes
1110
with an overall GC content of 47.25%. From previous
1111
experience [ 12 ] , we expected that genomic sequences
1112
with high GC content might tend to be misclassified as
1113
RNAs. We therefore devised a crude way of modifying the
1114
parameters of the models to correspond to different base
1115
compositions, by expressing various joint probabilities
1116
instead as a function of conditional probabilities, i.e.
1117
for two aligned codons
1118
c, c' :
1119
1120
P (
1121
c, c' ) =
1122
P (
1123
c|c' )
1124
P (
1125
c' ),     (16)
1126
where the
1127
P (
1128
c' ) are codon frequencies obtained
1129
by marginalization of the joint probabilities, and the
1130
information about the mutation rate is in the conditional
1131
probabilities
1132
P (
1133
c|c' ). We then can modify overall
1134
frequencies to a different set while keeping the same
1135
conditional probabilities. The joint probabilities are
1136
then recalculated as:
1137
1138
where is obtained as the product of the single
1139
nucleotide frequencies for the new GC composition. This
1140
approximation for the codon probabilities could be
1141
refined to better reflect the actual codon bias of the
1142
genome.
1143
This correction of probabilities can be performed for
1144
both codon-codon probabilities and independent mutation
1145
probabilities. Using this modification to the COD and OTH
1146
models, we generated two more pairs of genomes which had
1147
overall GC contents of 57.7% and 38.9%. We then ran QRNA
1148
using its default parameters (i.e. uncorrected for GC
1149
composition) across these three aligned simulated genomes
1150
in scanning mode, using a window 200 nucleotides wide,
1151
moving 50 nucleotides at a time, and counted the number
1152
of times a window was called RNA with a score of ≥ 5
1153
bits. All such windows are false positives, because the
1154
simulated genomes have no RNA component.
1155
The observed false positive numbers for the 2 Mb
1156
low-GC, average-GC, and high-GC simulated genomes were 8,
1157
14, and 21 respectively, or about 4-10 per megabase of
1158
pairwise alignment analyzed. This indicates that
1159
specificity degrades with higher GC compositions. We
1160
reanalyzed the high-GC genome using the high-GC parameter
1161
set that generated it (i.e. parameters corrected for GC
1162
composition), and saw one false positive. This indicates
1163
that setting the parameters of the three models to be
1164
appropriate for the GC composition of the input alignment
1165
should improve the effectiveness of the approach;
1166
however, our current method for doing this may be too
1167
crude.
1168
1169
1170
Tests on known RNAs
1171
To test the sensitivity and specificity of our method
1172
on real RNAs, we analyzed pairwise alignments taken from
1173
a multiple alignment of 63 eukaryotic SRP-RNAs [ 32 ]
1174
(also known as 7SL RNA), and a multiple alignment of 51
1175
eukaryotic RNaseP RNAs [ 33 ] . These RNA genes were
1176
chosen because they are independent from the set of tRNAs
1177
and rRNAs used to train the RNA model.
1178
We did two different types of experiments. In the
1179
first, we used the pairwise alignments as given in the
1180
curated multiple sequence alignment. These pairwise
1181
alignments are an ideal case for QRNA, because they are
1182
structurally aligned. In the second set of experiments,
1183
we took each known RNA in turn and used it as a BLASTN
1184
query against the rest of the RNAs, then classified all
1185
significant alignments with QRNA. This is a more
1186
realistic scenario for QRNA; a BLASTN primary sequence
1187
alignment may be fragmentary and/or not entirely
1188
structurally correct. All alignments were scored with
1189
QRNA using default parameters.
1190
For the first experiment, we used QRNA to score in
1191
full (i.e. not with a scanning window) the 2, 016
1192
different structural pairwise alignments for SRP-RNAs,
1193
and the 1, 325 structural pairwise alignments for RNaseP
1194
RNAs. The manually curated RNA structural alignments have
1195
a wide range of sequence diversity that extends from 100%
1196
to 0% pairwise identity. The number of pairwise
1197
alignments that were classified as RNA with a score of
1198
> 5 bits was counted, and these counts were binned by
1199
ranges of percent identity. The fraction of alignments
1200
classified as RNA is a measure of the sensitivity of
1201
QRNA. To measure specificity, we randomly shuffled each
1202
pairwise alignment by columns, which destroys the nested
1203
RNA structure correlations but retains the percentage
1204
identity of the alignments. Shuffled alignments that are
1205
classified by QRNA as RNA are false positives. The
1206
results in Table 2show that QRNA can detect about half of
1207
the alignments as RNAs at a wide range of percent
1208
identities; however, specificity seriously degrades for
1209
alignments over 90% identity.
1210
In the second set of experiments, we have taken each
1211
single RNA gene in a given family (both for the SRP-RNA
1212
and the RNaseP RNA families) and used it as a BLASTN
1213
query against all genes in the same family (including
1214
itself). We used WUBLASTN (2.0MP-WashU, 12 Feb 01
1215
version, default parameters and scoring matrix) and
1216
retained those alignments that were longer than 50
1217
nucleotides, with an E-value of ≤ 0.01, and with an
1218
overall similarity of ≥ 65%. Of the 3, 342 possible
1219
comparisons, this produced 1, 003 alignments (586 for SRP
1220
RNAs, and 417 for RNaseP RNAs). These were then scored by
1221
QRNA to measure sensitivity, and then shuffled by columns
1222
and rescored to measure specificity. Table 3shows that
1223
specificity follows the same trend we saw in the
1224
structural alignments, with a sharp degradation in
1225
specificity over 90% identity. Sensitivity, however,
1226
drops off steeply in the other direction; as percent
1227
identity declines, sensitivity decreases.
1228
We also analyzed the dependency of sensitivity and
1229
specificity with the GC content of the alignments, both
1230
for structural and BLASTN-type alignments. We observe a
1231
similar trend for both types of alignments; both
1232
sensitivity and specificity reach their best values for
1233
GC contents ranging from 45% to 60%. Specificity drops
1234
faster for high GC content alignments, which is
1235
consistent with the fact that unstructured sequences with
1236
high GC content tend to produce more spurious secondary
1237
structure predictions than low GC content sequences [ 12
1238
] .
1239
These results show two competing forces at play. In
1240
order to be detected by QRNA, two RNA sequences must be
1241
similar enough to produce a BLASTN alignment that is
1242
reasonably correct and extensive, but they also must be
1243
dissimilar enough to show compensatory mutations in
1244
base-paired positions of the RNA secondary structure.
1245
There is therefore a "sweet spot" of percent identity in
1246
which QRNA performance is optimal. Based on these
1247
results, we choose to analyze only BLASTN pairwise
1248
alignments of between 65% and 85% nucleotide identity
1249
with QRNA. However, we do not fully understand the
1250
degradation of specificity at high percentage identities
1251
(see Discussion).
1252
1253
1254
Tests on a whole genome
1255
To test QRNA performance in a realistic whole genome
1256
screen, we used it to analyze the
1257
Escherichia coli genome by
1258
comparisons to the related genome of
1259
Salmonella typhi . We compared QRNA
1260
annotation to the curated annotation of known coding
1261
genes, ncRNAs, and intergenic regions [ 34 ] . The
1262
feature tables for version M52 of the
1263
E. coli genome includes 115 known
1264
RNA genes and 4, 290 known coding genes (ORFs). The known
1265
RNA genes include 22 rRNAs, 86 tRNAs, and 7 miscellaneous
1266
RNAs (RNase P, for example). At least 4 other known RNA
1267
genes [ 1 35 36 ] - csrB, oxyS, micF, and rprA - were not
1268
present in the M52 feature table.
1269
We split the
1270
E. coli genome in three different
1271
components: 115 RNA features (a total of 40 kb, 1% of the
1272
genome), 4290 ORF features (4090 kb, 88% of the genome),
1273
and 2367 intergenic sequences of length ≥ 50 nt (500 kb,
1274
11% of the genome). Each sequence was compared against
1275
the complete
1276
Salmonella typhi genome (Sanger
1277
Centre, unpublished genome data,
1278
http://www.sanger.ac.uk/Projects/S_typhi) using WUBLASTN,
1279
and all alignments of ≥ 50 nt with an E-value of ≤ 0.01
1280
and a percent identity of ≥ 65% and ≤ 85% were kept. This
1281
resulted in 354 alignments to RNAs, 4, 946 alignments to
1282
ORFs, and 11, 509 alignments to intergenic regions. (The
1283
large number of alignments in intergenic regions is due
1284
to repetitive sequence families.) These alignments were
1285
then classified by QRNA in scanning mode, scoring
1286
overlapping windows of 200 nucleotides sliding 50
1287
nucleotides at a time, and all windows with scores of ≥ 5
1288
bits for one of the three models were annotated as RNA,
1289
COD, or OTH correspondingly.
1290
We then looked at these data in two ways. First, how
1291
many of the known features (ncRNAs and ORFs) were
1292
detected correctly? We counted a known feature as
1293
"detected" as RNA or COD if it had one or more
1294
overlapping QRNA annotations of that type. It is possible
1295
for different parts of a long feature (especially the
1296
ORFs) to be detected with different annotations. For the
1297
115 known ncRNAs, 33 have one or more BLASTN alignments
1298
to
1299
S. typhi in the right range, and
1300
all 33 were annotated as RNA by QRNA; none were called
1301
COD. For the 4290 known ORFs, 3181 had BLASTN alignments
1302
in the right range; 2876 were called COD, 20 were called
1303
RNA, and 184 were called both COD and RNA.
1304
These results indicate that the sensitivity of the
1305
program is largely dependent upon the availability of
1306
appropriate comparative sequence data - only 29% of the
1307
115 known RNAs were detected, but invariably (in this
1308
case), a failure to detect an RNA resulted from the lack
1309
of an appropriate BLASTN alignment to analyze (of 65-85%
1310
identity). Therefore sensitivity could presumably be
1311
improved by using multiple comparative genome sequences
1312
at different evolutionary distances.
1313
A second way to look at the data is from the
1314
perspective of how many of QRNA's annotations are
1315
correct. In a postprocessing step, any overlapping
1316
windows with the same QRNA annotation were merged into a
1317
longer annotated region. A total of 148 regions are
1318
annotated in the ncRNA sequence fraction: 33 as RNA, none
1319
as COD, and 115 as OTH. 7422 regions are annotated in the
1320
ORF sequence fraction: 88 as RNA, 3397 as COD, and 3937
1321
as OTH. 1974 regions are annotated in the intergenic
1322
sequence fraction: 351 as RNA, 61 as COD, and 1562 as
1323
OTH. Therefore QRNA annotated a total of 5614 sequence
1324
regions as OTH, of which 3937 (70%) are actually in known
1325
ORFs-this means we must interpret an OTH annotation as a
1326
catch-all "don't know" category, rather than as a
1327
conserved noncoding sequence of potential interest. QRNA
1328
annotated a total of 3458 regions as COD, of which 3397
1329
(98%) are in known ORFs. The other 61 COD annotated
1330
regions could either be false positive calls, or could be
1331
previously undetected small coding genes.
1332
Most interestingly, QRNA annotated a total of 472
1333
regions of
1334
E. coli as RNA, of which only 33
1335
(7%) are in known RNAs. It is not possible to
1336
definitively accept or reject the rest of these
1337
annotations without additional experimental data. The 88
1338
RNA annotations that overlap known ORFs may be false
1339
positives, or may indicate cis-regulatory RNA structures
1340
that overlap coding regions. It is intriguing that a
1341
disproportionate number of QRNA's RNA annotations (74%,
1342
351/472) were in the "intergenic" data fraction, which is
1343
only 11% of the genome - which is what we would expect to
1344
see if there were a fair number of undetected RNA
1345
features in the genome.
1346
We examined many of these 351 regions by eye. Four of
1347
them are the four ncRNA genes (csrB, oxyS, micF, and
1348
rprA) that were not included in the M52 feature table for
1349
1350
E. coli . Others are repetitive
1351
sequence families with conserved palindromic sequence,
1352
such as BIMEs [ 37 ] . Some correspond to known
1353
cis-regulatory RNA structures such as ρ-independent
1354
terminators (which have an RNA stem loop structure) and
1355
transcriptional attenuators. For about half of these
1356
regions, we cannot exclude the possibility that they
1357
correspond to novel RNAs, and we cannot assign a known
1358
biological role to them without additional computational
1359
or experimental evidence. A more in-depth QRNA screen of
1360
E. coli for novel ncRNAs using
1361
multiple comparative genomes from γ-proteobacteria,
1362
accompanied by experimental evidence that many of the
1363
predicted RNAs are indeed novel ncRNA genes, is presented
1364
elsewhere (E.R., R.J. Klein, T.A. Jones, and S.R.E.,
1365
manuscript submitted).
1366
1367
1368
1369
Discussion
1370
There are a number of ways in which we could improve
1371
QRNA. The three probabilistic models are calibrated to a
1372
fixed evolutionary distance. We used the BLOSUM62
1373
substitution matrix to define the fixed evolutionary
1374
distance of our three models, and it is now quite clear
1375
that this is the wrong distance. Our models generate
1376
pairwise alignments of about 40% sequence identity. We
1377
expect on theoretical grounds that this is where the models
1378
would perform optimally on real input alignments. However,
1379
BLASTN cannot detect RNA sequences that are this diverged.
1380
Our evaluations indicated a sweet spot of 65%-85% identity
1381
for QRNA to work best in its current formulation. We
1382
suspect that we could obtain some improvement by choosing a
1383
substitution matrix corresponding to more closely related
1384
nucleotide sequences.
1385
In principle QRNA may also be useful as a coding-region
1386
genefinder. The coding model is a fully probabilistic
1387
formalization of comparative analysis ideas used by the
1388
genefinder CRITICA [ 16 ] , and by comparative exon finding
1389
approaches such as the EXOFISH vertebrate/
1390
Tetraodon comparison [ 38 ] and the
1391
human/mouse comparison in [ 39 ] . In the
1392
E. coli whole genome screen, the
1393
sensitivity and specificity of QRNA coding annotations seem
1394
quite high. We have not yet attempted to optimize the
1395
performance of QRNA for this purpose.
1396
In terms of other QRNA improvements, it should be
1397
advantageous to make the emission and transition parameters
1398
of the models conditional on a parametric evolutionary
1399
distance. We could then optimize a maximum likelihood
1400
distance separately for each input alignment (or,
1401
marginalize over all distances, in a more Bayesian
1402
approach). This should widen the 65-85% alignment identity
1403
window that QRNA works best in - in particular, by
1404
constructing models more appropriate for nearly identical
1405
sequences, where we currently have high false positive
1406
rates.
1407
It would be good to have more theory to guide how we
1408
produce divergence-matched transition probability
1409
parameters for the three models. We suspect our
1410
ad hoc estimation may be causing the
1411
RNA model to be favored artifactually in certain cases
1412
(less gappy alignments and longer alignments), elevating
1413
our false positive rate.
1414
We also made a number of simplifying independence
1415
assumptions in trying to calculate QRNA's parameters all
1416
from a single chosen amino acid substitution matrix. Some
1417
of these assumptions probably reduce our performance. It
1418
would be desirable to move towards estimating parameters
1419
based on real datasets of aligned nucleotide sequences, if
1420
large enough datasets could be amassed.
1421
We are relying on BLASTN to produce approximately
1422
correct pairwise alignments of coding regions or RNA
1423
structures, even though BLASTN is purely a
1424
position-independent primary sequence alignment program. We
1425
could instead realign the two input sequences using the
1426
pair-grammars. In principle this should increase the
1427
performance of QRNA, particularly for more dissimilar
1428
sequences. Unfortunately, alignment of two sequences to a
1429
pair-SCFG is effectively the Sankoff algorithm [ 40 ] with
1430
time and memory complexity of
1431
O (
1432
L 6) and
1433
O (
1434
L 4), respectively, so we will need a
1435
more clever algorithmic strategy than straightforward
1436
dynamic programming (if, indeed, dynamic programming RNA
1437
structure alignment in a four-dimensional hypercube can be
1438
called "straightforward").
1439
Because QRNA detects conserved RNA secondary structure,
1440
it is not expected to detect ncRNAs that apparently lack
1441
significant intramolecular secondary structure, such as C/D
1442
box small nucleolar RNAs [ 6 ] . Identifying novel
1443
unstructured ncRNAs remains an entirely open problem. A
1444
pure computational approach will probably have to identify
1445
transcriptional signals - promoters, enhancers, and
1446
terminators - and this remains a difficult problem,
1447
particularly in complex genomes. Experimental screens for
1448
novel ncRNAs may prove more fruitful for unstructured
1449
ncRNAs. Expression arrays that pave the entire target
1450
genome with probes can detect novel transcripts [ 41 ] ,
1451
and cDNA libraries that enrich for small, nonpolyadenylated
1452
RNAs can be constructed and EST sequenced [ 42 ] .
1453
QRNA is also expected to identify
1454
cis -regulatory RNA structures in
1455
mRNAs, in addition to structured ncRNA genes.
1456
Distinguishing an ncRNA gene from a
1457
cis -regulatory RNA structure in an
1458
mRNA is nontrivial in absence of experimental evidence.
1459
This cautions against using QRNA for fully automated genome
1460
annotation and "gene counting" exercises in the way that
1461
protein genefinders like GENSCAN are used.
1462
Instead, QRNA is best used as a computational screen for
1463
candidate ncRNA genes, after which candidate loci are
1464
further characterized both computationally and
1465
experimentally before considering them to be "genes". Both
1466
the data presented here and in a second paper detailing a
1467
careful
1468
E. coli genome screen with
1469
experimental verification of many novel ncRNA genes (E.R.,
1470
R.J. Klein, T.A. Jones, and S.R.E., manuscript submitted)
1471
indicate that QRNA can be successfully used in this role.
1472
Although we have much we can do to improve its performance,
1473
we believe QRNA is the first example of a generally
1474
applicable computational genefinder for noncoding RNA
1475
genes. We expect to be able to apply QRNA - based screens
1476
for ncRNAs to a number of organisms as comparative sequence
1477
data become available - including yeast,
1478
Caenorhabditis ,
1479
Drosophila , human, and several
1480
microbial systems.
1481
1482
1483
Conclusions
1484
We have described an algorithm that uses three different
1485
probabilistic models (for RNA-structure-constrained,
1486
coding-constrained, and position-independent evolution) to
1487
examine the pattern of mutations in a pairwise sequence
1488
alignment. The alignment is classified as RNA, coding, or
1489
other, according to the Bayesian posterior probability of
1490
each model. We have implemented this algorithm as a
1491
program, QRNA, which we consider to be a prototype
1492
structural ncRNA genefinding program.
1493
1494
1495
1496
1497