Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
In the 1830's, Charles Darwin's investigation of the
7
Galapagos finches led to an appreciation of the structural
8
characteristics that varied and were conserved among the
9
birds in this landmark comparative study. His analysis of
10
the finches' structural features was the foundation for his
11
theory on the origin and evolution of biological species [
12
1 ] . Today, 150 years later, our understanding of cells
13
from a molecular perspective, in parallel with the
14
technological advances in nucleic acid sequencing and
15
computer hardware and software, affords us the opportunity
16
to determine and study the sequences for many genes from a
17
comparative perspective, followed by the computational
18
analysis, cataloging, and presentation of the resulting
19
data on the World Wide Web.
20
In the 1970's, Woese and Fox revisited Darwinian
21
evolution from a molecular sequence and structure
22
perspective. Their two primary objectives were to determine
23
phylogenetic relationships for all organisms, including
24
those that can only be observed with a microscope, using a
25
single molecular chronometer, the ribosomal RNA (rRNA), and
26
to predict the correct structure for an RNA molecule, given
27
that the number of possible structure models can be larger
28
than the number of elemental particles in the universe. For
29
the first objective, they rationalized that the origin of
30
species and the related issue of the phylogenetic
31
relationships for all organisms are encoded in the
32
organism's rRNA, a molecule that encompasses two-thirds of
33
the mass of the bacterial ribosome (ribosomal proteins
34
comprise the other one-third). One of their first and most
35
significant findings was the discovery of the third kingdom
36
of life, the Archaebacteria (later renamed Archaea) [ 2 3 4
37
] . Subsequently, the analysis of ribosomal RNA produced
38
the first phylogenetic tree, based on the analysis of a
39
single molecule, that included prokaryotes, protozoa,
40
fungi, plants, and animals [ 4 ] . These accomplishments
41
were the foundation for the subsequent revolution in
42
rRNA-based phylogenetic analysis, which has resulted in the
43
sequencing of more than 10,000 16S and 16S-like rRNA and
44
1,000 23S and 23S-like rRNA genes, from laboratories trying
45
to resolve the phylogenetic relationships for organisms
46
that occupy different sections of the big phylogenetic
47
tree.
48
The prediction of tRNA structure with a comparative
49
perspective in the 1960's [ 5 6 7 8 9 ] and subsequent
50
validation with tRNA crystal structures [ 10 11 ]
51
established the foundation for Woese and Fox in the 1970's
52
to begin predicting 5S rRNA structure from the analysis of
53
multiple sequences. They realized that all sequences within
54
the same functional RNA class (in this case, 5S rRNA) will
55
form the same secondary and tertiary structure. Thus, for
56
all of the possible RNA secondary and tertiary structures
57
for any one RNA sequence, such as for
58
Escherichia coli 5S rRNA, the correct
59
structure for this sequence will be similar to the correct
60
secondary structure for every other 5S rRNA sequence [ 12
61
13 ] .
62
While the first complete 16S rRNA sequence was
63
determined for
64
E. coli in 1978 [ 14 ] , the first
65
covariation-based structure models were not predicted until
66
more 16S rRNA sequences were determined [ 15 16 17 ] . The
67
first 23S rRNA sequence was determined for
68
E. coli in 1980 [ 18 ] ; the first
69
covariation-based structure models were predicted the
70
following year, once a few more complete 23S rRNA sequences
71
were determined [ 19 20 21 ] . Both of these comparative
72
structure models were improved as the number of sequences
73
with different patterns of variation increased and the
74
covariation algorithms were able to resolve different types
75
and extents of covariation (see below). Initially, the
76
alignments of 16S and 23S rRNA sequences were analyzed for
77
the occurrence of G:C, A:U, or G:U base pairs that occur
78
within potential helices in the 16S [ 15 22 ] and 23S [ 19
79
] rRNAs. The 16S and 23S rRNA covariation-based structure
80
models have undergone numerous revisions [ 23 24 25 26 27
81
28 ] . Today, with a significantly larger number of
82
sequences and more advanced covariation algorithms, we
83
search for all positional covariations, regardless of the
84
types of pairings and the proximity of those pairings with
85
other paired and unpaired nucleotides. The net result is a
86
highly refined secondary and tertiary covariation-based
87
structure model for 16S and 23S rRNA. While the majority of
88
these structure models contain standard G:C, A:U, and G:U
89
base-pairings arranged into regular secondary structure
90
helices, there were many novel base-pairing exchanges (
91
e.g., U:U <-> C:C; A:A
92
<-> G:G; G:U <-> A:C;
93
etc. ) and base pairs that form
94
tertiary or tertiary-like structural elements. Thus, the
95
comparative analysis of the rRNA sequences and structures
96
has resulted in the prediction of structure and the
97
identification of structural motifs [ 29 ] .
98
Beyond the comparative structure analysis of the three
99
ribosomal RNAs and transfer RNA, several other RNAs have
100
been studied with this perspective. These include the group
101
I [ 30 31 32 33 ] and II [ 34 35 ] introns, RNase P [ 36 37
102
38 ] , telomerase RNA [ 39 40 ] , tmRNA [ 41 ] , U RNA [ 42
103
] , and the SRP RNA [ 43 ] . The comparative sequence
104
analysis paradigm has been successful in determining
105
structure over this wide range of RNA molecules.
106
Very recently, the authenticities of the ribosomal RNA
107
comparative structure models have been determined [Gutell
108
et al., manuscript in preparation]:
109
97-98% of the secondary and tertiary structure base pairs
110
predicted with covariation analysis are present in the
111
crystal structures for the 30S [ 44 ] and 50S [ 45 ]
112
ribosomal subunits. Thus, the underlying premise for
113
comparative analysis and our implementation of this method,
114
including the algorithms, the sequence alignments, and the
115
large collection of comparative structure models with
116
different structural variations for each of the different
117
RNA molecules (
118
e.g., 16S and 23S rRNAs) have been
119
validated.
120
The highly refined and accurate analysis of phylogenetic
121
relationships and RNA structure with comparative analysis
122
can require very large, phylogenetically and structurally
123
diverse data sets that contain raw and analyzed data that
124
is organized for further analysis and interpretation. With
125
these requirements for our own analysis, and the utility of
126
this comparative information for the greater scientific
127
community, we have been assembling, organizing, analyzing,
128
and disseminating this comparative information. Initially,
129
a limited amount of sequence and comparative structure
130
information was available online for our 16S (and 16S-like)
131
[ 46 47 ] and 23S (and 23S-like) ribosomal RNAs [ 48 49 50
132
51 52 ] and the group I introns [ 33 ] . In parallel, two
133
other groups have been providing various forms of ribosomal
134
RNA sequence and structure data (the RDP/RDP II [ 53 54 ]
135
and Belgium (5S/5.8S [ 55 ] , small subunit [ 56 57 ] and
136
large subunit [ 58 59 ] ) groups). With significant
137
increases in the amount of sequences available for the RNAs
138
under study here, improved programs for the analysis of
139
this data, and better web presentation software, we have
140
established a new "Comparative RNA Web" (CRW) Site
141
http://www.rna.icmb.utexas.edu/. This resource has been
142
available to the public since January 2000.
143
144
145
Results and Discussion
146
147
1. Comparative structure models
148
149
1A. Current structure models for reference
150
organisms
151
The first major category, Comparative Structure
152
Models http://www.rna.icmb.utexas.edu/CSI/2STR/contains
153
our most recent 16S and 23S rRNA covariation-based
154
structure models, which were adapted from the original
155
Noller & Woese models (16S [ 15 22 ] and 23S [ 19 ]
156
rRNA), and the structure models for 5S rRNA [ 12 ] ,
157
tRNA [ 5 6 7 8 9 ] , and the group I [ 32 ] and group
158
II [ 34 ] introns, as determined by others. This
159
collection of RNA structure models was predicted with
160
covariation analysis, as described at the CRW Site
161
Methods Section
162
http://www.rna.icmb.utexas.edu/METHODS/and in several
163
publications (see below).
164
Briefly, covariation analysis, a specific
165
application of comparative analysis (as mentioned
166
earlier), searches for helices and base pairs that are
167
conserved in different sequences that form the same
168
functionally equivalent molecule (
169
e.g., tRNA sequences). It was
170
determined very early in this methodology that the
171
correct helix is the one that contains positions within
172
a potential helix that vary in composition while
173
maintaining G:C, A:U, and G:U base pairs. As more
174
sequences for a given molecule were determined, we
175
developed newer algorithms that searched for positions
176
in an alignment of homologous sequences that had
177
similar patterns of variation. This latter
178
implementation of the covariation analysis helped us
179
refine the secondary and tertiary structure models by
180
eliminating previously proposed base pairs that are not
181
underscored with positional covariation and identifying
182
new secondary and tertiary structure base pairs that do
183
have positional covariation [ 19 70 71 72 ] . Our
184
newest covariation analysis methods associate
185
color-coded confidence ratings with each proposed base
186
pair (see reference structure diagrams and Section 2A,
187
"Nucleotide Frequency Tabular Display," for more
188
details). One exception to this is the tRNA analysis,
189
which was initially performed with the Mixy
190
chi-square-based algorithm [ 71 ] , and thus the color
191
codes are based on that analysis.
192
When implemented properly, covariation analysis can
193
predict RNA structure with extreme accuracy. All of the
194
secondary structure base pairs and a few of the
195
tertiary structure base pairs predicted with
196
covariation analysis [ 5 6 7 8 9 71 72 73 74 ] are
197
present in the tRNA crystal structure [ 10 11 ] . The
198
analysis of fragments of 5S rRNA [ 75 ] and the group I
199
intron [ 76 ] resulted in similar levels of success.
200
Most recently, the high-resolution crystal structures
201
for the 30S [ 44 ] and 50S [ 45 ] ribosomal subunits
202
have given us the opportunity to evaluate our rRNA
203
structure models. Approximately 97-98% of the 16S and
204
23S rRNA base pairs predicted with covariation analysis
205
are in these crystal structures (Gutell
206
et al., manuscript in
207
preparation). This congruency between the comparative
208
model and the crystal structure validates the
209
comparative approach, the covariation algorithms, the
210
accuracy of the juxtapositions of sequences in the
211
alignments, and the accuracy of all of the comparative
212
structure models presented herein and available at the
213
CRW Site. However, while nearly all of the base pairs
214
predicted with comparative analysis are present in the
215
crystal structure solution, some interactions in the
216
crystal structure, which are mostly tertiary
217
interactions, do not have similar patterns of variation
218
at the positions that interact (Gutell
219
et al., manuscript in
220
preparation). Thus, covariation analysis is unable to
221
predict many of the tertiary base pairings in the
222
crystal structure, although it does identify nearly all
223
of the secondary structure base pairings.
224
Beyond the base pairs predicted with covariation
225
analysis, comparative analysis has been used to predict
226
some structural motifs that are conserved in structure
227
although they do not necessarily have similar patterns
228
of variation at the two paired positions. Our analyses
229
of these motifs are available in the "Structure,
230
Motifs, and Folding" section of our CRW Site.
231
While the secondary structure models for the 16S,
232
23S and 5S rRNAs, group I and II introns, and tRNA are
233
available at the "Current Structure Models for
234
Reference Organisms" page, our primary focus has been
235
on the 16S and 23S rRNAs. Thus, some of our subsequent
236
analysis and interpretation will emphasize only these
237
two RNAs.
238
Each RNA structure model presented here is based
239
upon a single reference sequence, chosen as the most
240
representative for that molecule (Table 1); for
241
example,
242
E. coli is the preferred choice
243
as the reference sequence for rRNA (5S, 16S, and 23S),
244
based on the early and continued research on the
245
structure and functions of the ribosome [ 77 78 ] .
246
Each of the six structure models (5S, 16S and 23S rRNA,
247
group I and II introns, and tRNA) in the "Current
248
Structure Models for Reference Organisms" page
249
http://www.rna.icmb.utexas.edu/CSI/2STR/contains six or
250
seven different diagrams for that molecule: Nucleotide,
251
Tentative, Helix Numbering, Schematic, Histogram,
252
Circular, and Matrix of All Possible Helices.
253
254
Nucleotide: The standard format for
255
the secondary structure diagrams with nucleotides
256
(Figures 2A, 2B, and 2C) reveals our confidence for
257
each base pair, as predicted by covariation analysis.
258
Base pairs with a red identifier ("-" for G:C and A:U
259
base pairs, small closed circles for G:U, large open
260
circles for A:G, and large closed circles for any other
261
base pair) have the greatest amount of covariation;
262
thus, we have the most confidence in these predicted
263
base pairs. Base pairs with a green, black, grey, or
264
blue identifier have progressively lower covariation
265
scores and are predicted due to the high percentages of
266
A:U + G:C and/or G:U at these positions. The most
267
current covariation-based
268
E. coli 16S and 23S rRNA
269
secondary structure models are shown in Figures 2A, 2B,
270
and 2C. Note that the majority of the base pairs in the
271
16S and 23S rRNA have a red base pair symbol, our
272
highest rating. These diagrams are the culmination of
273
twenty years of comparative analysis. Approximately
274
8500 16S and 16S-like rRNA sequences and 1050 23S and
275
23S-like rRNA sequences were collected from all
276
branches of the phylogenetic tree, as shown in Section
277
2, "Nucleotide Frequency and Conservation Information"
278
and in Table 2. These sequences have been aligned and
279
analyzed with several covariation algorithms, as
280
described in more detail in the "Predicting RNA
281
Structure with Comparative Methods" section of the CRW
282
Site http://www.rna.icmb.utexas.edu/METHODS/and in
283
Section 2A. All of the secondary structure diagrams
284
from the "Current Structure Models for Reference
285
Organisms" page are available in three formats. The
286
first two are standard printing formats, PostScript
287
http://www.adobe.com/products/postscript/main.htmland
288
PDF
289
http://www.adobe.com/products/acrobat/adobepdf.html.
290
The third, named "bpseq," is a simple text format that
291
contains the sequence, one nucleotide per line, its
292
position number, and the position number of the pairing
293
partner (or 0 if that nucleotide is unpaired in the
294
covariation-based structure model).
295
296
Tentative: In addition to the 16S
297
and 23S rRNA structure models, we have also identified
298
some base pairs in the 16S and 23S rRNAs that have a
299
lower, although significant, extent of covariation.
300
These are considered 'tentative' and are shown on
301
separate 16S and 23S rRNA secondary structure diagrams
302
http://www.rna.icmb.utexas.edu/CSI/2STR/. These base
303
pairs and base triples have fewer coordinated changes
304
(or positional covariations) and/or a higher number of
305
sequences that do not have the same pattern of
306
variation present at the other paired position.
307
Consequently, we have less confidence in these putative
308
interactions, in contrast with the interactions
309
predicted in our main structure models.
310
The
311
Helix Numbering secondary structure
312
diagrams illustrate our system for uniquely and
313
unambiguously numbering each helix in a RNA molecule.
314
Based upon the numbering of the reference sequence,
315
each helix is named for the position number at the 5'
316
end of the 5' half of the helix. For example, the first
317
16S rRNA helix, which spans
318
E. coli positions 9-13/21-25, is
319
named "9;" the helix at positions 939-943/1340-1344 is
320
named "939." This numbering system is used in the
321
Nucleotide Frequency Tabular Display tables (see
322
below). The
323
Schematic versions of the reference
324
structure diagrams replace the nucleotides with a line
325
traversing the RNA backbone.
326
The
327
"Histogram" and
328
"Circular" diagram formats
329
http://www.rna.icmb.utexas.edu/CSI/2STR/both abstract
330
the global arrangement of the base pairs. For the
331
histogram version (Figure 2D), the sequence is
332
displayed as a line from left (5') to right (3'), with
333
the secondary structure base pairs shown in blue above
334
the sequence line; below this line, tertiary structure
335
base pairs and base triples are shown in red and green,
336
respectively. The distance from the baseline to the
337
interaction line is proportional to the distance
338
between the two interacting positions within the RNA
339
sequence. In contrast, in the circular diagram, the
340
sequence is drawn clockwise (5' to 3') in a circle,
341
starting at the top. Secondary and tertiary base-base
342
interactions are shown with lines traversing the
343
circle, using the same coloring scheme as in the
344
histogram diagram. The global arrangement and
345
higher-order organization of the base pairs predicted
346
with covariation analysis are revealed in part in these
347
two alternative formats. The majority of the base pairs
348
are clustered into regular secondary structure helices,
349
and the majority of the helices are contained within
350
the boundaries of another helix, forming large
351
cooperative sets of nested helices. The remaining base
352
pairs form tertiary interactions that either span two
353
sets of nested helices, forming a pseudoknot, or are
354
involved in base triple interactions.
355
In th*/e
356
"Matrix of All Possible
357
Helices" plot
358
http://www.rna.icmb.utexas.edu/CSI/2STR/, the same RNA
359
sequence is extended along the X- and Y-axes, with all
360
potential helices that are comprised of at least four
361
consecutive Watson-Crick (G:C and A:U) or G:U base
362
pairs shown below the diagonal line. The helices in the
363
present comparative structure model are shown above
364
this line. The number of potential helices is larger
365
than the actual number present in the
366
biologically-active structure (see CRW Methods
367
http://www.rna.icmb.utexas.edu/METHODS/). For example,
368
the
369
S. cerevisiae phenylalanine tRNA
370
sequence, with a length of 76 nucleotides, has 37
371
possible helices (as defined above); only four of these
372
are in the crystal structure. The
373
E. coli 16S rRNA, with 1542
374
nucleotides (nt), has nearly 15,000 possible helices;
375
only about 60 of these are in the crystal structure.
376
For the
377
E. coli 23S rRNA (2904 nt), there
378
are more than 50,000 possible helices, with
379
approximately 100 in the crystal structure. The number
380
of possible secondary structure models is significantly
381
larger than the number of possible helices, due to the
382
exponential increase in the number of different
383
combinations of these helices. The number of different
384
tRNA secondary structure models is approximately 2.5 ×
385
10 19; there are approximately 10 393and 10 740possible
386
structure models for 16S and 23S rRNA, respectively
387
(see CRW Methods
388
http://www.rna.icmb.utexas.edu/METHODS/). Covariation
389
analysis accurately predicted the structures of the 16S
390
and 23S rRNAs (see above) from this very large number
391
of structure models.
392
393
394
1B. Evolution of the 16S and 23S rRNA comparative
395
structure models
396
An analysis of the evolution of the
397
Noller-Woese-Gutell comparative structure models for
398
the 16S and 23S rRNAs is presented here
399
http://www.rna.icmb.utexas.edu/CSI/EVOLUTION/(H-1B.1).
400
Our objective is to categorize the improvements in
401
these covariation-based comparative structure models by
402
tabulating the presence or absence of every proposed
403
base pair in each version of the 16S and 23S rRNA
404
structure models, starting with our first 16S [ 15 ]
405
and 23S [ 19 ] rRNA models. Every base pair in each of
406
the structure models was evaluated against the growing
407
number and diversity of new rRNA sequences. Proposed
408
base pairs were taken out of the structure model when
409
the number of sequences without either a covariation or
410
a G:C, A:U, or G:U base pair was greater than our
411
allowed minimum threshold; the nucleotide frequencies
412
for those base pairs are available from the "Lousy
413
Base-Pair" tables that are discussed in the next
414
section. New base pairs were proposed when a (new)
415
significant covariation was identified with our newer
416
and more sensitive algorithms that were applied to
417
larger sequence alignments containing more inherent
418
variation (see CRW Methods
419
http://www.rna.icmb.utexas.edu/METHODS/for more
420
detail).
421
Although other comparative structure models and base
422
pairs were predicted by other labs, those interactions
423
are not included in this analysis of the improvements
424
in our structure models. The four main structure models
425
for 16S and 23S rRNA are very similar to one another.
426
The Brimacombe [ 16 20 ] and Strasburg [ 17 21 ]
427
structure models were determined independently of ours,
428
while the De Wachter [ 58 79 ] models were adapted from
429
our earlier structure models and have incorporated some
430
of the newer interactions proposed here.
431
This analysis produced two very large tables with
432
579 proposed 16S rRNA base pairs evaluated against six
433
versions of the structure model and 1001 23S rRNA base
434
pairs evaluated against five versions of the structure
435
model. Some highlights from these detailed tables are
436
captured in summary tables (Tables 3aand 3b, and
437
http://www.rna.icmb.utexas.edu/CSI/EVOLUTION/) that
438
compare the numbers of sequences and base pairs
439
predicted correctly and incorrectly for each of the
440
major versions of the 16S and 23S rRNA structure
441
models. For this analysis, the current structure model
442
is considered to be the correct structure; thus, values
443
for comparisons are referenced to the numbers of
444
sequences and base pairs in the current structure model
445
(478 base pairs and approximately 7000 sequences for
446
16S rRNA, and 870 base pairs and approximately 1050
447
sequences for 23S rRNA). Three sets of 16S and 23S rRNA
448
secondary structure diagrams were developed to reveal
449
the improvements between the current model and earlier
450
versions: 1) changes since the 1996 published structure
451
models; 2) changes since 1983 (16S rRNA) or 1984 (23S
452
rRNA); and 3) all previously proposed base pairs that
453
are not in the most current structure models
454
(H-1B.2).
455
An analysis of these tables reveals several major
456
conclusions from the evolution of the 16S and 23S rRNA
457
covariation-based structure models. First,
458
approximately 60% of the 16S and nearly 80% of the 23S
459
rRNA base pairs predicted in the initial structure
460
models appear in the current structure models. The
461
accuracy of these early models, produced from the
462
analysis of only two well-chosen sequences, is
463
remarkable. Second, the accuracy, number of secondary
464
and tertiary structure interactions, and complexity of
465
the structure models increase as the number and
466
diversity of sequences increase and the covariation
467
algorithms are improved. As well, some pairs predicted
468
in the earlier structure models were removed from
469
subsequent models due to the large number of exceptions
470
to the positional covariation at the two paired
471
positions. Third, the majority of the tertiary
472
interactions were proposed in the last few versions of
473
the structure models.
474
475
476
1C. RNA structure definitions
477
The RNA structure models presented here are composed
478
of several different basic building blocks (or motifs)
479
that are described and illustrated at our RNA Structure
480
Definitions page
481
http://www.rna.icmb.utexas.edu/CSI/DEFS/(H-1C.1-2). The
482
nucleotides in a comparative structure model can be
483
either base paired or unpaired. Base paired nucleotides
484
can be part of either a secondary structure helix (two
485
or more consecutive, antiparallel and nested base
486
pairs) or a tertiary interaction, which is a more
487
heterogeneous collection of base pair interactions.
488
These include any non-canonical base pair (not a G:C,
489
A:U, or G:U;
490
e.g., U:U), lone or single base
491
pairs (when both positions in a base pair are not
492
flanked by two nucleotides that are base paired to one
493
another), base pairs in a pseudoknot arrangement, and
494
base triples (a single nucleotide interacting with a
495
base pair). Each of these base pair categories has a
496
unique color code in the illustrations on the "RNA
497
Structure Definitions" page, which provides multiple
498
examples of each category from the 16S and 23S rRNA
499
structure models. In contrast to the nucleotides that
500
are base paired, nucleotides can also be unpaired in
501
the comparative structure models. Within this category,
502
they can be within a hairpin loop (nucleotides capping
503
the end of a helix), internal loop (nucleotides within
504
two helices), or in a multi-stem loop (nucleotides
505
within three or more helices).
506
507
508
509
2. Nucleotide frequency and conservation
510
information
511
512
2A. Nucleotide frequency tabular display
513
The nucleotide frequency tables appear in two
514
general presentation modes. In the traditional table,
515
the nucleotide types are displayed in the columns,
516
while their frequencies are shown for each alignment in
517
the rows. The nucleotide frequencies were determined
518
for single positions, base pairs, and base triples for
519
a subset of the RNAs in the CRW Site collection
520
(detailed in Table 1). Single nucleotide frequencies
521
are available for all individual positions, based upon
522
the reference sequence, for every RNA in this
523
collection. Base pair frequencies are presented for a)
524
all base pairs in the current covariation-based
525
structure models, b) tentative base pairs predicted
526
with covariation analysis, and c) base pairs previously
527
proposed with comparative analysis that are not
528
included in our current structure models due to a lack
529
of comparative support from the analysis with our best
530
covariation methods on our current alignments (named
531
"Lousy" base pairs). Base triples are interactions
532
between a base pair and a third unpaired nucleotide;
533
base triple frequencies are provided for a) base
534
triples in the current covariation-based structure
535
models and b) tentative base triples predicted with
536
covariation analysis.
537
For each of these frequency tables, the percentages
538
of each of the nucleotides are determined for multiple
539
alignments, where the most similar sequences are
540
organized into the same alignment. For the three rRNAs,
541
the alignments are partitioned by their phylogenetic
542
relationships. There is an alignment for the
543
nuclear-encoded rRNA for each of the three primary
544
lines of descent ((1) Archaea, (2) Bacteria, and (3)
545
Eucarya; [ 80 ] ), each of the two Eucarya organelles
546
(no alignments yet for the 5S rRNA; (4) Chloroplasts
547
and (5) Mitochondria), and two larger alignments that
548
include all of the (6) nuclear-encoded rRNA sequences
549
for the Archaea, Bacteria, and Eucarya, and (7) these
550
three phylogenetic groups and the two Eucarya
551
organelles (Table 2).
552
For the tRNA and group I and II intron sequences,
553
the most similar sequences are not necessarily from
554
similar phylogenetic groups. Instead, the sequences
555
that are most similar with one another are members of
556
the same functional and/or structural class. The tRNA
557
sequences are grouped according to the amino acids that
558
are bound to the tRNA. Currently, only the type I tRNAs
559
[ 81 ] are included here; the tRNAs are collected in 19
560
functional subgroup alignments and one total type I
561
alignment. The group I and II intron alignments are
562
based on the structural classifications determined by
563
Michel (group I [ 32 ] and group II [ 34 ] ) and Suh
564
(group IE [ 82 ] ). The group I introns are split into
565
seven alignments: A, B, Cl-2, C3, D, E, and unknown.
566
The group II introns are divided into the two major
567
subgroups, IIA and IIB (Table 2).
568
For the standard nucleotide frequency tables
569
(Highlight 2A (H-2A)), the left frame in the main frame
570
window ("List Frame") contains the position numbers for
571
the three types of tables: single bases, base pairs,
572
and base triples. Clicking on a position, base pair, or
573
base triple number will bring the detailed nucleotide
574
occurrence and frequency information to the main window
575
("Data Frame;" H-2A.1). The collective scoring data
576
(H-2A.2) used to predict the base pair is obtained,
577
where available, by clicking the "Collective Score"
578
link on the right-hand side of the base pair frequency
579
table.
580
As discussed in Section 1A, we have established a
581
confidence rating for the base pairs predicted with the
582
covariation analysis; a detailed explanation of the
583
covariation analysis methods and the confidence rating
584
system will be available in the Methods section of the
585
CRW Site http://www.rna.icmb.utexas.edu/METHODS/. The
586
extent of base pair types and their mutual exchange
587
pattern (
588
e.g., A:U <-> G:C) is
589
indicative of the covariation score. This value
590
increases to the maximum score as the percentage and
591
the amount of pure covariations (simultaneous changes
592
at both positions) increase in parallel with a decrease
593
in the number of single uncompensated changes, and the
594
number of times these coordinated variations occur
595
during the evolution of that RNA (for the rRNAs, the
596
number of times this covariation occurs in the
597
phylogenetic tree) increases. These scores are
598
proportional to our confidence in the accuracy of the
599
predicted base pair. Red, our highest confidence
600
rating, denotes base pairs with the highest scores and
601
with at least a few phylogenetic events (changes at
602
both paired positions during the evolution of that base
603
pair). The colors green, black, and grey denote base
604
pairs with a G:C, A:U, and/or G:U in at least 80% of
605
the sequences and within a potential helix that
606
contains at least one red base pair. Base pairs with a
607
green confidence rating have a good covariation score
608
although not as high as (or with the confidence of) a
609
red base pair. Black base pairs have a lower
610
covariation score, while grey base pairs are invariant,
611
or nearly so, in 98% of the sequences. Finally, blue
612
base pairs do not satisfy these constraints;
613
nevertheless, we are confident of their authenticity
614
due to a significant number of covariations within the
615
sequences in a subset of the phylogenetic tree or are
616
an invariant G:C or A:U pairings in close proximity to
617
the end of a helix.
618
The covariation score for each base pair is
619
determined independently for each alignment (
620
e.g., Three Domain/Two Organelle,
621
Three Domain, Archaea,
622
etc. ). The collective score for
623
each base pair is equivalent to the highest ranking
624
score for any one of the alignments. For example, we
625
have assigned our highest confidence rating to the
626
927:1390 base pair in 16S rRNA (Figure 2C; H-2A). Note
627
that the entry for the 927:1390 base pair (H-2A) in the
628
list of base pairs in the left frame is red in the C
629
(or confidence) column. For this base pair, only the T
630
(Three Phylogenetic Domains/Two Organelle) alignment
631
has a significant covariation score (H-2A); thus, only
632
the "T" alignment name is red. Of the nearly 6000
633
sequences in the T alignment, 69% of the sequences have
634
a G:U base pair, A:U base pair at 16.2%, U:A at 6.9%,
635
and less than 1% of the sequences have a G:C, C:G, U:U,
636
or G:G base pair (H-2A.1). The collective scoring data
637
(H-2A.2) reveals that there are 11 phylogenetic events
638
(PE) for the T alignment, while the C1+C3 score is
639
1.00, greater than the minimum value for this RNA and
640
this alignment (a more complete explanation of the
641
collective scoring method is available at CRW Methods
642
http://www.rna.icmb.utexas.edu/METHODS/). Note that the
643
928:1389 and 929:1388 base pairs are also both red.
644
Here, six of the seven alignments have significant
645
extents of covariation for both base pairs and are thus
646
red. Each of the red alignments have at least two base
647
pair types (
648
e.g., G:C and A:U) that occur
649
frequently, at least three phylogenetic events, and
650
C1+C3 scores >= 1.5.
651
652
653
2B. Nucleotide frequency mapped onto a
654
phylogenetic tree
655
The second presentation mode maps the same
656
nucleotide frequency data in the previous section onto
657
the NCBI phylogenetic tree
658
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/
659
[ 60 61 ] (see Materials and Methods for details). This
660
display allows the user to navigate through the
661
phylogenetic tree and observe the nucleotide
662
frequencies for any node and all of the branches off of
663
that node. The number of nucleotide substitutions on
664
each branch are displayed, with the number of mutual
665
changes displayed for the base pairs and base triples.
666
Currently, only the 16S and 23S rRNA nucleotide
667
frequencies available in the first tabular presentation
668
format are mapped onto the phylogenetic tree (see Table
669
1). As shown in CRW Section 2B (H-2B), the left frame
670
in the main frame window contains the position numbers
671
for the three types of data, single bases, base pairs,
672
and base triples. Clicking on a position, base pair, or
673
base triple number will initially reveal, in the larger
674
section of the main frame, the root of the phylogenetic
675
tree, with the frequencies for the selected single
676
base, base pair, or base triple. The presentation for
677
single bases (H-2B.1) reveals the nucleotides and their
678
frequencies for all sequences at the root level,
679
followed by the nucleotides and their frequencies for
680
the Archaea, Bacteria, and Eukaryota (nuclear,
681
mitochondrial, and chloroplast). Nucleotides that occur
682
in less than 2%, 1.5%, 1%, 0.5%, 0.2%, and 0.1% of the
683
sequences can be eliminated from the screen by changing
684
the green "percentage limit" selection at the top of
685
the main frame. The number of phylogenetic levels
686
displayed on the screen can also be modulated with the
687
yellow phylogenetic level button at the top of the main
688
frame. Highlight 2B.1 displays only one level of the
689
phylogenetic tree from the point of origin, which is
690
the root level for this example. In contrast, Highlight
691
2B.2 displays four levels from the root. The number of
692
single nucleotide changes on each branch of the
693
phylogenetic tree is shown at the end of the row. For
694
single bases, this number is in black. For base pairs,
695
there are two numbers. The orange color refers to the
696
number of changes at one of the two positions, while
697
the pink color refers to the number of mutual changes
698
(or covariations) that has occurred on that branch of
699
the tree (H-2B.2). For example, for the 16S rRNA base
700
pair 501:544, there are 65 mutual and 74 single changes
701
in total for the Archaea, Bacteria, Eucarya nuclear,
702
mitochondrial, and chloroplast. Within the Archaea,
703
there are six mutual and five single changes. Five of
704
these mutual changes are within the Euryarchaeota, and
705
four of these are within the Halobacteriales (H-2B.2).
706
The base pair types that result from a mutual change
707
(or strict covariation) are marked with an asterisk
708
("*").
709
710
711
2C. Secondary structure conservation
712
diagrams
713
Conservation secondary structure diagrams summarize
714
nucleotide frequency data by revealing the nucleotides
715
present at the most conserved positions and the
716
positions that are present in nearly all sequences in
717
the analyzed data set. The conservation information is
718
overlaid on a secondary structure diagram from a
719
sequence that is representative of the chosen group (
720
e.g., E. coli for the gamma
721
subdivision of the Proteobacteria, or
722
S. cerevisiae for the Fungi;
723
H-2C.1). All positions that are present in less than
724
95% of the sequences studied are considered variable,
725
hidden from view, and replaced by arcs. These regions
726
are labeled to show the minimum and maximum numbers of
727
nucleotides present in that region in the group under
728
study (
729
e.g., [0-179] indicates that all
730
sequences in the group contain a minimum of zero
731
nucleotides but not more than 179 nucleotides in a
732
particular variable region). The remaining positions,
733
which are present in at least 95% of the sequences, are
734
separated into four groups (H-2C.1): 1) those which are
735
conserved in 98-100% of the sequences in the group
736
(shown with red upper-case letters indicating the
737
conserved nucleotide); 2) those which are conserved in
738
90-98% of the sequences in the group (shown with red
739
lower-case letters indicating the conserved
740
nucleotide); 3) those which are conserved in 80-90% of
741
the sequences in the group (shown with large closed
742
circles); and 4) those which are conserved in less than
743
80% of the sequences in the group (shown with small
744
open circles).
745
Insertions relative to the reference sequence are
746
identified with a blue line to the nucleotides between
747
which the insertion occurs, and text in small blue font
748
denoting the maximum number of nucleotides that are
749
inserted and the percentage of the sequences with any
750
length insertion at that place in the conservation
751
secondary structure diagram (H-2C.1). All insertions
752
greater than five nucleotides are tabulated, in
753
addition to insertions of one to four nucleotides that
754
occur in more than 10% of the sequences analyzed for
755
that conservation diagram. Each diagram contains the
756
full NCBI phylogenetic classification
757
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/for
758
the group.
759
Currently, there are conservation diagrams for the
760
5S, 16S, and 23S rRNA for the broadest phylogenetic
761
groups: (1) the three major phylogenetic groups and the
762
two Eucarya organelles, chloroplasts and mitochondria;
763
(2) the three major phylogenetic groups; (3) the
764
Archaea; (4) the Bacteria; (5) the Eucarya (nuclear
765
encoded); (6) the chloroplasts; and (7) the
766
mitochondria. Longer term, our goal is to generate rRNA
767
conservation diagrams for all branches of the
768
phylogenetic tree that contain a significant number of
769
sequences. Toward this end, we have generated 5S, 16S,
770
and 23S rRNA conservation diagrams for many of the
771
major phylogenetic groups within the Bacterial lineage
772
(
773
e.g., Firmicutes and
774
Proteobacteria). We will also be generating
775
conservation diagrams for the group I and II
776
introns.
777
The CRW Site conservation diagram interface (H-2C.2)
778
provides both the conservation diagrams (in PostScript
779
and PDF formats) and useful auxiliary information. The
780
display is sorted phylogenetically, with each row of
781
the table containing all available conservation
782
information for the rRNA sequences in that phylogenetic
783
group. For each of the three rRNA molecules (5S, 16S,
784
and 23S), three items are available: 1) the reference
785
structure diagram, upon which the conservation
786
information is overlaid; 2) the conservation diagram
787
itself; and 3) the number of sequences summarized in
788
the conservation diagram, which links to a
789
web-formatted list of those sequences. The lists, for
790
each sequence, contain: 1) organism name (NCBI
791
scientific name); 2) GenBank accession number; 3) cell
792
location; 4) RNA Type; 5) RNA Class; and 6) NCBI
793
phylogeny. Users who want more information about a
794
given sequence should consult the CRW RDBMS (see
795
below). An equivalent presentation for intron
796
conservation data is under development.
797
798
799
800
3. Sequence and structure data
801
802
Structure-based alignments and phylogenetic
803
analysis of RNA structure
804
Analysis of the patterns of sequence conservation
805
and variation present in RNA sequence alignments can
806
reveal phylogenetic relationships and be utilized to
807
predict RNA structure. The accuracy of the phylogenetic
808
tree and the predicted RNA structure is directly
809
dependent on the proper juxtapositioning of the
810
sequences in the alignment. These alignments are an
811
attempt to approximate the best juxtapositioning of
812
sequences that represent similar placement of
813
nucleotides in their three-dimensional structure. For
814
sequences that are very similar, the proper
815
juxtapositioning or alignment of sequences can be
816
achieved simply by aligning the obviously similar or
817
identical subsequences with one another. However, when
818
there is a significant amount of variation between the
819
sequences, it is not possible to align sequences
820
accurately or with confidence based on sequence
821
information alone. For these situations, we can
822
juxtapose those sequences that form the same secondary
823
and tertiary structure by aligning the positions that
824
form the same components of the similar structure
825
elements (
826
e.g., align the positions that
827
form the base of the helix, the hairpin loop,
828
etc. ). Given the accurate
829
prediction of the 16S and 23S rRNA secondary structures
830
from the analysis of the alignments we assembled, we
831
are now even more confident in the accuracy of the
832
positioning of the sequence positions in our
833
alignments, and the process we utilize to build
834
them.
835
836
837
Aligning new sequences
838
At this stage in our development of the sequence
839
alignments, there are well-established and distinct
840
patterns of sequence conservation and variation. From
841
the base of the phylogenetic tree, we observe regions
842
that are conserved in all of the rRNA sequences that
843
span the three phylogenetic domains and the two
844
eucaryotic organelles, the chloroplast and
845
mitochondria. Other regions of the rRNA are conserved
846
within the three phylogenetic domains although variable
847
in the mitochondria. As we proceed into the
848
phylogenetic tree, we observe positions that are
849
conserved within one phylogenetic group and different
850
at the same level in the other phylogenetic groups. For
851
example, Bacterial rRNAs have positions that are
852
conserved within all members of their group, but
853
different from the Archaea and the Eucarya
854
(nuclear-encoded). These types of patterns of
855
conservation and variation transcend all levels of the
856
phylogenetic tree and result in features in the rRNA
857
sequences and structures that are characteristic for
858
each of the phylogenetic groups at each level of the
859
phylogenetic tree (
860
e.g., level one: Bacterial,
861
Archaea, Eucarya; level two: Crenarchaeota,
862
Euryarchaeota in the Archaea; level three: gamma,
863
alpha, beta, and delta/epsilon subdivisions in the
864
Proteobacteria). Carl Woese likened the different rates
865
of evolution at the positions in the rRNA to the hands
866
on a clock [ 4 ] . The highly variable regions are
867
associated with the second hand; these can change many
868
times for each single change that occurs in the regions
869
associated with the minute hand. Accordingly, the
870
minute hand regions change many times for each single
871
change in the hour hand regions of the rRNAs. In
872
addition to the different rates of evolution, many of
873
the positions in the rRNA are dependent on one another.
874
The simplest of the dependencies, positional
875
covariation, is the basis for the prediction of the
876
same RNA structure from similar RNA sequences (see
877
Section 1A, Covariation Analysis).
878
We utilize these underlying dynamics in the
879
evolution and positional dependency of the RNA to
880
facilitate the alignment and structural analysis of the
881
RNA sequences. Our current RNA data sets contain a very
882
large and diverse set of sequences that represent all
883
sections of the major phylogenetic branches on the tree
884
of life. This data collection also contains many
885
structural variations, in addition to their conserved
886
sequence and structure core. The majority of the new
887
RNA sequences are very similar to at least one sequence
888
that has already been aligned for maximum sequence and
889
structure similarity; thus, these sequences are
890
relatively simple to align. However, some of the new
891
sequences contain subsequences that cannot be aligned
892
with any of the previously aligned sequences, due to
893
the excessive variation in these hypervariable regions.
894
For these sequences, the majority of the sequence can
895
be readily aligned with the more conserved elements,
896
followed by a manual, visual analysis of the
897
hypervariable regions. To align these hypervariable
898
regions with more confidence, we usually need several
899
more sequences with significant similarity in these
900
regions that will allow us to identify positional
901
covariation and subsequently to predict a new
902
structural element. Thus, at this stage in the
903
development of the alignments, the most conserved
904
regions (
905
i.e., hour hand regions) and
906
semi-conserved regions (
907
i.e., minute hand regions) have
908
been aligned with high confidence. The second and
909
sub-second (
910
i.e., tenth and hundredth of a
911
second) hand regions have been aligned for many of the
912
sequences on the branches at the ends on the
913
phylogenetic tree. However, regions of the sequences
914
continue to challenge us. For example, the 545 and 1707
915
regions (
916
E. coli numbering) contain an
917
excessive amount of variation in the Eucarya
918
nuclear-encoded 23S-like rRNAs. These two regions could
919
not be well aligned and we could not predict a common
920
structure with comparative analysis with ten Eucaryotic
921
sequences in 1988 (see Figures 35-43 in [ 48 ] ).
922
However, once a larger number of related Eucaryotic
923
23S-like rRNA sequences was determined, we reanalyzed
924
these two regions and were able to align those regions
925
to other related organisms (
926
e.g., S. cerevisiae with
927
Schizosaccharomyces pombe,
928
Cryptococcus neoformans, Pneumocystis carinii, Candida
929
albicans, and
930
Mucor racemosus ) and predict a
931
secondary structure that is common for all of these
932
rRNAs (see Figures 3and 6 in [ 52 ] ). While the
933
secondary structures for the fungal 23S-like rRNAs are
934
determined in these regions, the animal rRNAs were only
935
partially solved. We still need to determine a common
936
secondary structure for the large variable-sized
937
insertions in the animal rRNAs, and this will require
938
even more animal 23S-like rRNA sequences from organisms
939
that are very closely related to the organisms for
940
which we currently have sequences.
941
942
943
A large sampling of secondary structure
944
diagrams
945
We have generated secondary structure diagrams for
946
sequences that represent the major phylogenetic groups,
947
and for those sequences that reveal the major forms of
948
sequence and structure conservation and variation. New
949
secondary structure diagrams are templated from an
950
existing secondary structure diagram and the alignment
951
of these two sequences, the sequence for the new
952
structure diagram and the sequence for the structure
953
that has been templated. The nucleotides in the new
954
sequence replace the templated sequence when they are
955
in the same position in the alignment, while positions
956
in the new sequence that are not juxtaposed with a
957
nucleotide in the templated sequence are initially left
958
unstructured. These nucleotides are then placed
959
interactively into their correct location in the
960
structure diagram with the program XRNA (Weiser &
961
Noller, University of California, Santa Cruz) and
962
base-paired when there is comparative support for that
963
pairing in the alignment; otherwise, they are left
964
unpaired.
965
The process of generating these secondary structure
966
diagrams occurs in parallel with the development of the
967
sequence alignments. In some cases, the generation of a
968
structure diagram helps us identify problems with the
969
sequence or its alignment. For example, anomalies in
970
structural elements (in the new structure diagram) that
971
had strong comparative support in the other sequences
972
could be the result of a bad sequence or due to the
973
misalignment of sequences in the helix region. In other
974
cases, the new structure diagram reveals a possible
975
helix in a variable region that was weakly predicted
976
with comparative analysis. However, a re-inspection of
977
a few related structure diagrams revealed another
978
potential helix in this region that was then
979
substantiated from an analysis of the corresponding
980
region of the alignment. Thus, the process of
981
generating additional secondary structure diagrams
982
improves the sequence alignments and the predicted
983
structures, in addition to the original purpose for
984
these diagrams, to reveal the breadth of sequence
985
conservation and variation for any one RNA type.
986
Our goals for the "Sequence and Structure Data"
987
section of the CRW Site are to:
988
A) Align all rRNA, group I and II intron sequences
989
that are greater than 90% complete and are available at
990
GenBank;
991
B) Generate rRNA and group I/II intron secondary
992
structure diagrams for organisms that are
993
representative of a phylogenetic group or
994
representative of a type of RNA structural element. The
995
generation of 5S, 16S, and 23S rRNAs secondary
996
structures from genomic sequences generally has higher
997
priority over other rRNA sequences.
998
C) Enter pertinent information for each sequence and
999
structure into our relational database management
1000
system. This computer system organizes all of our RNA
1001
sequence and structure entries, associates them with
1002
the organisms' complete NCBI phylogeny
1003
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/,
1004
and allows for the efficient retrieval of this data
1005
(see Section 4: Data Access Systems for more
1006
details).
1007
Due in part to the technological improvements in the
1008
determination of nucleic acid sequence information, the
1009
number of ribosomal RNA and group I and II intron
1010
sequences has increased significantly within the past
1011
10 years. As of December 2001, the approximate numbers
1012
of complete or nearly complete sequences and secondary
1013
structure diagrams for each of these RNAs for the major
1014
phylogenetic groups and structural categories are shown
1015
in Highlight 3A.1. At this time, the actual number of
1016
sequences that are both greater than 90% complete and
1017
available at GenBank is greater than the number in our
1018
CRW RDBMS.
1019
The sequences, alignments, and secondary structure
1020
diagrams are available from several different web
1021
pages, which are described below in Sections 3A-3D and
1022
4A-4B.
1023
1024
1025
3A. Index of available sequences and
1026
structures
1027
The top section of the "Index of Available Sequences
1028
and Structures" page (H-3A.1) reveals the numbers of
1029
available sequences for the Archaea, Bacteria, and
1030
Eucarya nuclear, mitochondrial, and chloroplast groups
1031
that are at least 90% complete and structure diagrams
1032
for the 5S, 16S, and 23S rRNAs and group I and group II
1033
introns. The remainder of the index page contains the
1034
numbers of sequences and structures for more expanded
1035
lists for each of those five phylogenetic/cell location
1036
groups. For example, the Archaea are expanded to the
1037
Crenarchaeota, Euryarchaeota, Korarchaeota, and
1038
unclassified Archaea. These counts are updated
1039
dynamically when the information in our relational
1040
database management system is revised. The numbers of
1041
sequences and structures are links that open the RDBMS
1042
"standard" output view (see below for details) for the
1043
selected target set. Secondary structure diagrams are
1044
available in PostScript, PDF, and BPSEQ (see above)
1045
formats from the structure links. The organism names in
1046
the output from these links are sorted alphabetically.
1047
The number of entries per output page is selectable
1048
(20, 50, 100, 200, or 400), with 20 set as a default.
1049
Entries not shown on the first page can be viewed by
1050
clicking on the "Next" button at the bottom left of the
1051
output page.
1052
As of December 2001, our data collection contains
1053
11,464 rRNA (5S, 16S, and 23S) and intron (group I, II,
1054
and other) sequences. The ribosomal RNAs comprise 80%
1055
of this total, and 16S rRNA represents 82% of the rRNA
1056
total; the remainder is split between the 23S and 5S
1057
rRNAs. Intron sequences comprise 20% of our total
1058
collection, with approximately twice as many group I
1059
introns than group II introns. Of the 406 secondary
1060
structure diagrams, the majority are for the 16S (71%)
1061
and 23S (20%) rRNAs. At this time, tRNA records are not
1062
maintained in our database system.
1063
1064
1065
3B. New secondary structure diagrams
1066
Secondary structure diagrams that have been created
1067
or modified recently are listed and available from
1068
their own page (H-3B.1). These diagrams are sorted into
1069
one of three categories: new or modified 1) in the past
1070
seven days (highlighted with red text); 2) in the past
1071
month (blue text); and 3) in the past three months
1072
(black text). Diagrams are listed alphabetically by
1073
organism name within each of the three time categories.
1074
The display also indicates the cell location and RNA
1075
Class (see below) for each diagram. The PostScript,
1076
PDF, and BPSEQ files can be viewed by clicking the
1077
appropriate radio button at the top of this page and
1078
then the links in the structure field.
1079
1080
1081
3C. Secondary structure diagram retrieval
1082
Multiple secondary structure diagrams can be
1083
downloaded from the Secondary Structure Retrieval Page
1084
(Highlight 3C.1). This system allows the user to select
1085
from organism names, phylogeny (general: Archaea,
1086
Bacteria, Eukaryota, and Virus), RNA Class (see Table
1087
4), and cell location, as well as selecting for
1088
PostScript, PDF, or BPSEQ display formats. Once these
1089
selections are made, a list of the structure diagrams
1090
that fit those criteria appears. The user may select
1091
any or all of the diagrams to be downloaded. The 23S
1092
rRNA diagrams (which appear in two halves) are
1093
presented on one line as a single unit to ensure that
1094
both halves are downloaded. The system packages the
1095
secondary structure diagrams files into a compressed
1096
tar file, which can be uncompressed with appropriate
1097
software on Macintosh, Windows, and Unix computer
1098
platforms. (Note: due to a limitation in the web server
1099
software, it is currently not possible to reliably
1100
download more than 300 structures at one time. This
1101
limitation can be avoided by subdividing large
1102
queries.)
1103
1104
1105
3D. Sequence alignment retrieval
1106
The Sequence Alignment Retrieval page (Highlight
1107
3D.1) provides access to the sequence alignments used
1108
in the analyses presented at the CRW Site. Sequence
1109
alignments are available in GenBank and AE2 (Macke)
1110
formats (Table 2). These alignments will be updated
1111
periodically when the number of new sequences is
1112
significant. Newer alignments might also contain
1113
refinements in the alignments of the sequences. For
1114
each alignment, there is a corresponding list of
1115
sequences, their phylogenetic placement, and other
1116
information about the sequences (see conservation list
1117
of sequences for conservation diagrams). At present,
1118
only the rRNA alignments are available; the group I and
1119
group II intron alignments will be made available in
1120
June 2002.
1121
1122
1123
3E. rRNA Introns
1124
The introns that occur in 16S and 23S rRNAs are
1125
organized into four preconfigured online tables. These
1126
tables disseminate the intron information and emphasize
1127
the major dimensions inherent in this data: 1) intron
1128
position in the rRNA, 2) intron type, 3) phylogenetic
1129
distribution, and 4) number of introns per exon
1130
gene.
1131
1132
1133
3E. rRNA Introns Table 1: Intron Position
1134
The introns in rRNA Introns Table 1 are organized by
1135
their position numbers in the 16S and 23S rRNAs. The
1136
16S and 23S rRNA position numbers are based on the
1137
E. coli rRNA reference sequence
1138
(J01695) (see Table 1). The intron occurs between the
1139
position number listed and the following position (
1140
e.g., the introns between
1141
position 516 and 517 are listed as 516). rRNA Introns
1142
Table 1 has four components.
1143
The total number of introns and the number of
1144
positions with at least one intron in 16S and 23S rRNA
1145
are shown in rRNA Introns Table 1A (see highlights
1146
below and H-3E.1). The list of all publicly available
1147
rRNA introns, sorted by the numeric order of the intron
1148
positions, is contained in rRNA Introns Table 1B. This
1149
table has nine fields: 1) rRNA type (16S or 23S); 2)
1150
the intron position; 3) the number of documented
1151
introns occurring at that position; 4) the intron types
1152
(RNA classes) for each rRNA intron position; 5) the
1153
number of introns for each intron type for each rRNA
1154
position; 6) the length variation (minimum # - maximum
1155
#) for introns in each intron type; 7) the cell
1156
location for each intron type; 8) the number of
1157
phylogenetic groups for each intron type, (here,
1158
defined using the third column from rRNA Introns Table
1159
3: Phylogenetic Distribution); and 9) the organism name
1160
and accession number.
1161
These fields in rRNA Introns Table 1B (H-3E.1) allow
1162
for a natural dissemination of the introns that occur
1163
at each rRNA site. For example, of the 116 introns (as
1164
of December 2001) at position 516 in 16S rRNA, 55 of
1165
them are in the IC1 subgroup (H-3E.2); these introns
1166
range from 334-1789 nucleotides in length, all occur in
1167
the nucleus, and are distributed into four distinct
1168
phylogenetic groups. 54 of the introns at position 516
1169
are in the IE subgroup, range from 190-622 nucleotides
1170
in length, all occur in the nucleus, and are also
1171
distributed into four distinct phylogenetic groups,
1172
etc.
1173
Additional information is available in a new window
1174
for each of the values in rRNA Introns Table 1B
1175
(H-3E.3). This information is retrieved from the
1176
relational database management system (see section 4).
1177
The information for each intron entry in the new window
1178
are: 1) exon (16S or 23S rRNA); 2) intron position in
1179
the rRNA; 3) intron type (RNA class); 4) length of
1180
intron (in nucleotides); 5) cell location; 6) NCBI
1181
phylogeny; 7) organism name; 8) accession number; 9)
1182
link to structure diagram (if it is available); and 10)
1183
comment.
1184
The number of intron types per intron position are
1185
tabulated in rRNA Introns Table 1C (H-3E.4), while the
1186
number of introns at each rRNA position are ranked in
1187
rRNA Introns Table 1D (H-3E.5). This latter table
1188
contains six fields of information for each rRNA: 1)
1189
number of introns per rRNA position; 2) number of
1190
positions with that number of introns; 3) the rRNA
1191
position numbers; 4) total number of introns (field #1
1192
× field #2); 5) the Poisson probability (see rRNA
1193
Introns Table 1D for details); and 6) the expected
1194
number of introns for each of the observed number of
1195
introns per rRNA site.
1196
The highlights from rRNA Introns Table 1 are: 1) As
1197
of December 2001, there are 1184 publicly available
1198
introns that occur in the rRNAs, with 900 in the 16S
1199
rRNA, and 284 in 23S rRNA. These introns are
1200
distributed over 152 different positions, 84 in the 16S
1201
rRNA and 68 in 23S rRNA. 2) Although 16S rRNA is
1202
approximately half the length of 23S rRNA, there are
1203
more than three times as many introns in 16S rRNA.
1204
However, this bias is due, at least in part, to the
1205
more prevalent sampling of 16S and 16S-like rRNAs for
1206
introns. 3) The sampling of introns at the intron
1207
positions is not evenly distributed (1184/152 = 7.79
1208
introns per position for a random sampling). Instead,
1209
nearly 50% (71/152) of the intron positions contain a
1210
single intron and 89% (135/152) of the intron positions
1211
contain ten or less introns. In contrast, 59%
1212
(681/1163) of the introns are located at 9% of the
1213
intron positions and the three intron positions with
1214
the most introns (943, 516, and 1516 in 16S rRNA)
1215
contain 361, or 31% (361/1163), of the rRNA introns. 4)
1216
rRNA Introns Table 1D compares the observed
1217
distribution of rRNA introns with the Poisson
1218
distribution for the observed number of introns. The
1219
Poisson distribution,
1220
P(x) =
1221
e -μμ
1222
x
1223
x! -1, where μ is the mean
1224
frequency of introns for positions in a particular exon
1225
and
1226
x is the target number of introns
1227
present at a particular position, allows the
1228
calculation of expected numbers of positions containing
1229
a particular number of introns. Based upon the observed
1230
raw numbers of introns in the 16S and 23S rRNAs, we
1231
expect to see no positions in 16S rRNA containing more
1232
than five introns and no positions in 23S rRNA
1233
containing more than three introns. However,
1234
thirty-five rRNA positions fall into one of those two
1235
categories. We also see both more positions without
1236
introns and fewer positions containing only one or two
1237
introns than expected. This observed distribution of
1238
rRNA introns among the available insertion positions is
1239
extremely unlikely to occur by chance. 5) While a
1240
single intron type occurs at the majority of the intron
1241
positions, several positions have more than one intron
1242
type. A few of the positions that deserve special
1243
attention have IC1 and IE introns at the same position
1244
(16S rRNA positions 516 and 1199, and 23S rRNA position
1245
2563). The 16S rRNA position 788 has several examples
1246
each of IC1, IIB, and I introns.
1247
1248
1249
3E. rRNA Introns Table 2: Intron Type
1250
The introns are organized by intron type, as defined
1251
above, in rRNA Introns Table 2 (H-3E.6). The frequency
1252
of 16S and 23S rRNA exons, non-rRNA exons, number of
1253
intron positions in the 16S and 23S rRNA, cell
1254
locations, and number of phylogenetic groups for each
1255
intron type are tabulated. The highlights of this table
1256
are: 1) Of the 1184 known rRNA introns, 980 (83%) are
1257
group I, 21 (2%) are group II introns, and the
1258
remaining 183 (15%) are unclassified (see below). While
1259
only 2% of the rRNA introns are group II, 62%
1260
(728/1180) of the non-rRNA introns are group II. In
1261
addition to the group II introns, nearly all of the IC3
1262
introns do not occur in rRNAs. 2) The majority of the
1263
rRNA group I introns (851/980 = 87%) fall into one of
1264
three subgroups: I (276 introns), IC1 (415 introns),
1265
and IE (160 introns). 3) As noted earlier, there are
1266
three times as many 16S rRNA group I introns than 23S
1267
rRNA group I introns (753 vs. 227). 4) Among the three
1268
cellular organelles in eucaryotes, 1010 introns (85%)
1269
occur in the nucleus, 133 (11%) in the mitochondria,
1270
and 41 (4%) in the chloroplasts. 5) The subgroups IC1,
1271
IC3 and IE are only present in the nucleus, while the
1272
IA, IB, IC2, ID, and II subgroups occur almost
1273
exclusively in chloroplasts and/or mitochondria.
1274
The 183 introns described in rRNA Introns Table 2 as
1275
"Unclassified" merit special attention. All of these
1276
introns do not fall into either the group I and group
1277
II categories; however, two notable groups of introns
1278
are included within the "Unclassified" category. The
1279
first is a series of 43 introns occurring in Archaeal
1280
rRNAs (the Archaeal introns). Thirty-one of the known
1281
Archaeal introns are found in 16S rRNA and the
1282
remaining twelve are from 23S rRNA exons. The Archaeal
1283
introns range in length from 24 to 764 nucleotides,
1284
with an average length of 327 nucleotides. The second
1285
group contains 121 spliceosomal introns found in fungal
1286
rRNAs. 92 spliceosomal introns are from 16S rRNA and 29
1287
are from 23S rRNA; the lengths of these introns range
1288
from 49 to 292 nucleotides. A future version of this
1289
database will include both of these groups as separate,
1290
distinct entries. Both the Archaeal and splicesomal
1291
introns occur only in nuclear rRNA genes and tend to
1292
occur at unique sites; the lone exception is the
1293
spliceosomal intron from
1294
Dibaeis baeomyces nuclear 23S
1295
rRNA position 787, a position where a group IIB intron
1296
occurs in mitochondrial
1297
Marchantia polymorpha rRNA. The
1298
Unclassified group contains 21 introns that do not fall
1299
into any of the four previously discussed categories
1300
(group I, group II, Archaeal, or spliceosomal),
1301
including all four mitochondrial introns in this
1302
group.
1303
rRNA Introns Table 2 expands the presentation by
1304
providing links to twenty additional tables (H-3E.7),
1305
each of which provides expanded information about a
1306
specific intron type. The organism name, exon, intron
1307
position, cell location, and complete phylogeny are
1308
accessible for each intron from these tables. These
1309
online tables are dynamically updated daily as
1310
information about new introns is made available.
1311
1312
1313
3E. rRNA Introns Table 3: Phylogenetic
1314
Distribution
1315
The distribution of introns on the phylogenetic tree
1316
is tabulated in rRNA Introns Table 3A (H-3E.8) and 3B
1317
(H-3E.9). rRNA Introns Table 3A reveals the ratio of
1318
the number of rRNA introns per rRNA gene for the
1319
nuclear, chloroplast, and mitochondrial encoded RNAs
1320
for the major phylogenetic groups. The most noteworthy
1321
distributions are: 1) The majority (96%) of the rRNA
1322
introns occur in Eucarya, followed by the Archaea, and
1323
the Bacteria. 2) Only one rRNA intron has been
1324
documented in the Bacteria; due to the large number of
1325
rRNA gene sequences that have been determined, the
1326
ratio of rRNA introns per rRNA gene is essentially zero
1327
for the bacteria. 3) The frequency of introns in
1328
Archaea rRNAs is higher, with 43 examples documented as
1329
of December 2001. Within the Archaea, there is a higher
1330
ratio of rRNA introns in the Desulfurococcales and
1331
Thermoproteales subbranches in the Crenarchaeota
1332
branch. 4) For the three primary phylogenetic groups,
1333
the highest ratio of rRNA introns per rRNA gene is for
1334
the Eucarya, and for the phylogenetic groups within the
1335
Eucarya that have significant numbers of rRNA
1336
sequences, the ratio is highest in the fungi. Here, the
1337
ratios of rRNA introns per rRNA gene are similar
1338
between the nucleus and mitochondria (1.34 for the
1339
nucleus, 1.20 for the mitochondria). A significant
1340
number of rRNA introns occurs in the plants, with
1341
similar ratios of rRNA intron/rRNA gene for the
1342
nucleus, chloroplast, and mitochondria (0.36 for the
1343
nucleus, 0.38 for the chloroplast, and 0.34 for the
1344
mitochondria). In sharp contrast with the fungi and
1345
plants, only one intron has been documented in an
1346
animal rRNA, occurring within the
1347
Calliphora vicina nuclear-encoded
1348
23S-like rRNA (GenBank accession number K02309).
1349
Each of the two special "Unclassified" rRNA intron
1350
groups has a specific phylogenetic bias. Archaeal rRNA
1351
introns, which have unique sequence and structural
1352
characteristics [ 83 ] , have not yet been observed
1353
within the Euryarchaeota or Korarchaeota; in fact, no
1354
non-Archaeal introns have been found in Archaea rRNAs
1355
to date. Splicesomal rRNA introns have only been
1356
reported in 31 different genera in the Ascomycota [ 84
1357
] . rRNA Introns Table 3A also presents the numbers of
1358
(complete or nearly so) rRNA sequences in the same
1359
phylogenetic groups in order to address the question of
1360
sampling bias. Two important caveats to this data must
1361
be considered. First, the numbers of rRNA sequences are
1362
an underestimate, since many rRNA introns are published
1363
with only short flanking exon sequences and do not meet
1364
the 90% completeness criterion for inclusion in this
1365
rRNA sequence count. The second caveat is that many
1366
rRNA sequences contain multiple introns (see rRNA
1367
Introns Table 4 and related discussion, below, for more
1368
information). Of the 51 phylogenetic group/cell
1369
location combinations shown in rRNA Introns Table 3
1370
that may contain rRNA introns, 15 (29%) have a
1371
intron:rRNA sequence ratio greater than 1.0, indicating
1372
a bias toward introns within those groups. Introns are
1373
comparatively rare within the 26 (51%) groups that have
1374
a ratio below 0.3; ten of these 26 groups contain no
1375
known rRNA introns. Ten (20%) of the groups have
1376
intermediate ratios (between 0.3 and 1.0).
1377
A more detailed phylogenetic distribution is
1378
available in rRNA Introns Table 3B (H-3E.10). The first
1379
three fields contain levels 2, 3, and 4 of the NCBI
1380
phylogeny, followed by fields for the genus of the
1381
organism, cell location, exon (16S or 23S rRNA), and
1382
intron type. Each of these classifications include a
1383
link to the complete details (organism name, phylogeny,
1384
cell location, exon, intron position, intron number,
1385
accession number, and structure diagram (when
1386
available)) for the intron sequences in that group.
1387
1388
1389
3E. rRNA Introns Table 4: Number of Introns per
1390
Exon
1391
rRNA Introns Table 4 presents the number of introns
1392
per rRNA gene (H-3E.11). While more than 80% of the
1393
documented rRNA genes do not have an intron, 646 16S
1394
and 182 23S rRNAs have at least one intron.
1395
Approximately 75% (623) of these genes have a single
1396
intron, 15% (127) have two introns, 0.5% (40) have
1397
three, 0.25% (20) have four, 0.1% (11) have five, two
1398
rRNA genes have 6, 7 or 8 introns, and one rRNA gene
1399
has 9 introns.
1400
To determine the amount of bias in the distribution
1401
of introns among their exon sequences, the Poisson
1402
distribution (here, μ is the mean frequency of introns
1403
for a particular exon and
1404
x is the target number of introns
1405
per rRNA gene) has been used to calculate the number of
1406
rRNA sequences expected to contain a given number of
1407
introns (rRNA Introns Table 4). Based upon this data,
1408
no rRNA sequences are expected to contain four or more
1409
introns; in fact, we see 38 sequences that contain
1410
these large numbers of introns. The observed numbers of
1411
sequences exceed the expected values for all but one
1412
category: fewer rRNAs contain only one intron than
1413
expected.
1414
The two molecules (16S and 23S rRNA) show a
1415
differing trend with respect to cell location for those
1416
sequences containing large numbers of introns. In 16S
1417
rRNA, only nuclear genes (ten) have been observed to
1418
contain five or more introns; indeed, of the 57 genes
1419
containing three or more introns, only two are not
1420
nuclear (both of these are mitochondrial). In 23S rRNA,
1421
the trend is both opposite and weaker; of the thirteen
1422
rRNA sequences containing four or more introns, five
1423
are nuclear (containing five introns), with four
1424
chloroplast and four mitochondrial genes comprising the
1425
remaining eight sequences.
1426
rRNA Introns Table 4 provides access to seventeen
1427
additional tables (H-3E.12), which each present the
1428
complete information for every intron within a
1429
particular class (
1430
e.g., 16S rRNA genes containing
1431
two introns), grouped by their exons. As with the other
1432
online tables, this information will be updated daily
1433
to reflect new intron sequences that are added to this
1434
database.
1435
The final components of the "rRNA Introns" page are
1436
16S and 23S rRNA secondary structure diagrams that show
1437
the locations for all of the known rRNA introns
1438
(H-3E.13). The information collected here on the "rRNA
1439
Introns" page is the basis for two detailed analyses
1440
that will be published elsewhere: 1) the spatial
1441
distribution of introns on the three dimensional
1442
structure of the 16S and 23S rRNA (Jackson
1443
et al., manuscript in
1444
preparation); and 2) the statistical analysis of the
1445
distribution of introns on the rRNA (Bhattacharya
1446
et al., manuscript in
1447
preparation).
1448
1449
1450
3F. Group 1/11 Intron distributions
1451
For the CRW Site project, we collect group I and II
1452
introns and all other introns that occur in the
1453
ribosomal RNA. The "Intron Distribution Data" page
1454
contains three tables that compare intron types,
1455
phylogeny, exon, and cell location.
1456
Intron Distribution Table 1 maps "Intron Type" vs.
1457
"Phylogeny" (and "Cell Location;" H-3F.1). Group I and
1458
II intron data are highlighted with yellow and blue
1459
backgrounds, respectively. The phylogenetic divisions
1460
are also split into the three possible cellular
1461
locations (nuclear, chloroplast, and mitochondria). A
1462
few of the highlights are:
1463
1) the Eukaryota contain the majority (2218 / 2349 =
1464
94%) of the introns in the CRW RDBMS. 2) The Archaea
1465
have 42 introns that have unique characteristics and
1466
are called "Archaeal introns." 3) Group I introns are
1467
present in eukaryotes (nuclear-, chloroplast-, and
1468
mitochondrial-encoded genes) and in Bacteria. Group II
1469
introns have only been observed in Bacteria and in
1470
Eukaryotic chloroplast and mitochondrial genes.
1471
Intron Distribution Table 2 shows "Intron Type" vs.
1472
"Exon" (and "Cell Location;" H-3F.2). Again, group I
1473
and II intron data are highlighted with yellow and blue
1474
backgrounds, respectively. In this table, the exon
1475
types are split into the three possible cellular
1476
locations (nuclear, chloroplast, and mitochondria). As
1477
of December 2001, the most obvious trend is that the
1478
exons with the most Group I introns are 16S rRNA (900),
1479
leucine tRNA (337), 23S rRNA (284), ribosomal protein
1480
S16 (214), and ribosomal protein L16 (152).
1481
Intron Distribution Table 3 compartmentalizes the
1482
intron data by "Phylogeny" and "Exon" (and "Cell
1483
Location;" H-3F.3). In this table, color is used to
1484
highlight the three phylogenetic domains (Archaea in
1485
yellow, Bacteria in blue, and Eukaryota in green). As
1486
in Intron Distribution Table 2, the exon types are
1487
split into the three possible cellular locations
1488
(nuclear, chloroplast, and mitochondria).
1489
Each of these three tables is dynamically created
1490
from a specific series of RDBMS queries on a daily
1491
basis. As of December 2001, links connecting to the
1492
specific RDBMS results are not available.
1493
1494
1495
1496
4. Data access systems
1497
1498
4A. RDBMS (Standard)
1499
The "Standard" interface is the most fundamental of
1500
our interfaces to the CRW RDBMS information. While the
1501
restricted, specialized interface to the RDBMS
1502
information in Section 3A requires minimal instruction
1503
to use, the standard interface, with its ability to
1504
cull out all arrangements of information from the
1505
different fields with sophisticated search queries and
1506
output field sortings, requires a quick lesson for its
1507
operation. The selection process has three stages: 1)
1508
selection of attribute fields to display; 2)
1509
determination of values for the search; and 3)
1510
adjustment of the output field sort order.
1511
A detailed explanation for each of the attributes is
1512
available from the links to the attribute names. This
1513
information is shown in the right frame. Additional
1514
examples of this system are available online.
1515
Step 1. At the onset, the user selects the fields to
1516
be displayed on the screen and then clicks the "Go"
1517
button. While the user can select the individual fields
1518
(
1519
e.g., "Organism" or "Phylogeny"),
1520
for most applications the "QrRNA" (query rRNA),
1521
"Qintron" (query intron), or "All" options will
1522
automatically click the appropriate fields that are
1523
most important for searching for ribosomal RNA or group
1524
I and II intron entries.
1525
Step 2. Select values for the fields or attributes.
1526
The acceptable values for the attributes in our RDBMS
1527
system are shown on the main frame of the query page
1528
(for list- and button-driven fields) or, for text input
1529
fields, can be determined with the "V" (values) button
1530
on the right side of the main frame; the results are
1531
displayed in the right frame (see Figure 3and
1532
H-4A.1).
1533
• The values for cellular location are Chl
1534
(chloroplast), Cya (cyanelle), Mit (mitochondria), Nuc
1535
(nuclear), and Vir (viral); each can be selected by
1536
simply checking the box to the left of its name.
1537
• The values for the attributes RNA Type, ORF (open
1538
reading frame), Secondary Structures (entries
1539
with/without secondary structure diagrams),
1540
Results/Page, and Color Display are also displayed on
1541
the main frame, and can be selected by clicking the
1542
appropriate box or button.
1543
The values for other attributes such as RNA Class,
1544
Sequence Length, and Exon can be determined by
1545
selecting one or more of the values in the scroll box.
1546
The values for these attributes can also be found by
1547
clicking the "V" button associated with each attribute.
1548
For example, clicking on the "Exon" "V" button will
1549
reveal, in the right frame, all of the exons that are
1550
contained in our database. The same exons are present
1551
in the scroll box.
1552
• The values displayed for any one attribute are
1553
dependent on the settings of the other attributes. For
1554
example, when only rRNA is selected for the "RNA Type,"
1555
then there are no values for "Exon." All of the
1556
possible exon values are displayed when "Intron" is the
1557
selected "RNA Type," while only a subset of the
1558
possible exon values are shown when Mit (mitochondria)
1559
is the selected "Cell Location." Note: no selection for
1560
an attribute signifies to this system that all of the
1561
values are possible.
1562
The values present in our database for the
1563
attributes "Organism," "Phylogeny" (except for the
1564
first level - Archaea, Bacteria, and Eukaryota - that
1565
can be selected from the main frame), "Common Name"
1566
(except for the first level: "Animals,"
1567
"Fungi&Plants," "Protists"), "Accession Number,"
1568
"Intron Position," and "Comment" can only be observed
1569
in the right frame after clicking the "V" button.
1570
• The values selected with the mouse in the right
1571
frame will appear in the appropriate attribute
1572
field.
1573
• The values for each attribute are dependent on the
1574
settings for the other attributes. For example, if
1575
there are many values for the "Organism" field,
1576
selecting Archaea in the "Phylogeny" field will reduce
1577
the number of names in the "Organism" field to just
1578
those that are in this phylogenetic group.
1579
• The number of possible values for an attribute can
1580
also be constrained by entering only part of a value in
1581
the field. For example, typing 'Esch' in the "Organism"
1582
field will output several organism names that contain
1583
'Escherichia' when the "V" button is clicked. Typing
1584
"coli" in this field will list all organism names that
1585
contain "coli," as either part of a name or a complete
1586
word.
1587
• Note that the system is case sensitive for all
1588
fields except "Common Name." The text 'esch' in the
1589
same "Organism" field will not output 'Escherichia' in
1590
the right frame.
1591
The "Phylogeny" field with the values frame on the
1592
right was developed to allow the user to navigate
1593
through the phylogenetic tree. The information for the
1594
"Phylogeny" and "Common Name" fields is downloaded from
1595
the NCBI (see Materials and Methods; this information
1596
is downloaded daily to assure that we have the most
1597
current version of this data). There are two general
1598
modes of operation.
1599
For mode one, you can systematically navigate
1600
through the phylogenetic tree to the selected goal
1601
point. For example, to get to the last phylogenetic
1602
group that contains
1603
Homo sapiens and gorillas, the
1604
user would click on the "Eukaryota" phylogeny button,
1605
then click on the "Fungi/Metazoa group" link in the
1606
right frame, followed by the "Metazoa," "Eumetazoa,"
1607
"Bilateria," "Coelomata," "Deuterostomia," "Chordata,"
1608
"Craniata," "Vertebrata," "Gnathostomata,"
1609
"Teleostomi," "Euteleostomi," "Sarcopterygii,"
1610
"Tetrapoda," "Amniota," "Mammalia," "Theria,"
1611
"Eutheria," "Primates," "Catarrhini," and "Hominidae"
1612
links. The phylogenetic group Hominidae contains the
1613
genera Gorilla, Pan (chimpanzees), Pongo, and Homo (see
1614
Figure 3, H-4A.1, and H-4A.2). This type of navigation
1615
is useful when you know the links that will get you to
1616
the desired goal point; otherwise, mode two can help
1617
you jump to the appropriate node in the phylogenetic
1618
tree.
1619
For the second mode, you type all or part of the
1620
name of an organism or phylogenetic group that is close
1621
to the phylogenetic node you want. For example, type
1622
"Homo sapiens" in the "Phylogeny" field and press the
1623
"V" button in the "Phylogeny" field. The right frame
1624
will display a few names; from these, select "Homo
1625
sapiens." The right frame now contains the entire
1626
phylogenetic path from the base of the tree to Humans
1627
(Figure 3and H-4A.1).
1628
The "Common Name" attribute can also help identify
1629
organism names in the CRW RDBMS. As with the phylogeny
1630
operation, two general modes for determining the values
1631
are available. For the first, the user would type the
1632
presumed common name in the "Common Name" field, and
1633
click the "V" button. A few general examples are: worm,
1634
fish, cat, dog, and human. More specific examples are:
1635
common earthworm (
1636
Lumbricus terrestris ), European
1637
polecat (
1638
Mustela putorius ), and duckbill
1639
platypus (
1640
Ornithorhynchus anatinus ). These
1641
names must be in the "Common Name" database for the
1642
sequence entry to be identified with this method. In
1643
contrast, the second mode is intended to identify
1644
larger groups of organisms. The three buttons in the
1645
"Common Name" field ("Animals," "Fungi&Plants,"
1646
"Protists;" H-4A.3) each reveal various low-level
1647
common names in the right frame that are arranged in a
1648
pseudo-phylogenetic structure. For example, a few of
1649
the lower animals (sponges, flatworms,
1650
etc. ) are listed when the
1651
"Animals" button is pressed, in addition to the
1652
Protostomia, Deuterostomia, and organisms nested within
1653
these groups (Arthropoda, chordates, vertebrates,
1654
Mammals,
1655
etc.; H-4A.3). Accordingly, the
1656
"Fungi&Plants" and "Protists" buttons reveal the
1657
major groups of organisms within their respective
1658
groups. For the latter mode of operation, the user
1659
selects one of these common names, such as "Mammals."
1660
The phylogeny for this group then appears in the same
1661
right frame (cellular organisms, Eukaryota,
1662
Fungi/Metazoa group, Metazoa, Eumetazoa, Bilateria,
1663
Coelomata, Deuterostomia, Chordata, Craniata,
1664
Vertebrata, Gnathostomata, Teleostomi, Euteleostomi,
1665
Sarcopterygii, Tetrapoda, Amniota, Mammalia), along
1666
with the two phylogenetic groups within the Mammals
1667
(Mammalia), Prototheria and Theria. Another example is
1668
the common name "Mosses" in the Fungi&Plants.
1669
Selecting "Mosses" brings up the phylogeny for the
1670
Bryophyta. Note that these common names (
1671
i.e., "mammals" or "mosses") do
1672
not appear in the common name field in the output for
1673
the sequence entries that are within the Mammalian or
1674
Bryophyta phylogenetic groups. Thus, the common name
1675
field could be very useful to identify organisms and
1676
phylogenetically related organisms when you don't know
1677
their genus/species organism name or the phylogeny for
1678
that group of organisms.
1679
Step 3. The last, critical step before submitting a
1680
query is to select the sort order for the attributes in
1681
the output. While a query will yield the same number of
1682
results with any sort order, the choice of sort order
1683
can make answering questions easier. Take, for example,
1684
a search for all Eucarya rRNA entries. By default, the
1685
entries are sorted alphabetically first by their
1686
phylogenetic classification, followed by organism name,
1687
cell location, and last by their RNA class. In
1688
contrast, the sort orders <phylogeny, organism name,
1689
cell location, and RNA class> and <organism name,
1690
RNA class, phylogeny, and cell location> produce
1691
significantly different orders and overall arrangements
1692
for the same set of entries (see online examples); the
1693
second sorting is more useful when searching for a
1694
particular organism, since its exact location on the
1695
phylogenetic tree may not be known to the user. The
1696
output page (H-4A.2) reveals the search strategy and
1697
attribute sort order at the bottom of the page. The
1698
default sort order for the attributes is shown on the
1699
"S" (or sort) buttons on the right side of the main
1700
frame (Figure 3and H-4A.1). The sort order is changed
1701
by simply clicking the "S" buttons in the order the
1702
attributes are to be sorted. The resulting sort order
1703
for the attributes are shown in the small text box to
1704
the left of each attribute's S button; alternatively,
1705
you can type numbers into these boxes to set the sort
1706
order. The alphabetical/numerical order for any
1707
attribute can be reversed (z -> a, high number ->
1708
low number) by checking the box in the "R" (or reverse)
1709
column to the right of the Sort buttons. Finally, the
1710
sortings can be reset to the default values by clicking
1711
the "Sort Reset" button at the top of the query
1712
page.
1713
Before submitting the query, a few attributes
1714
deserve more attention.
1715
• Secondary Structures: a comparative secondary
1716
structure model has been developed for more than 400 of
1717
the sequence entries (see Section 3). The 'secondary
1718
structure' attribute near the bottom of the query page
1719
is an option to output
1720
all sequence and structure
1721
entries, only those entries
1722
with a secondary structure, or
1723
entries
1724
without a secondary structure
1725
diagram.
1726
• Results/Page: the number of entries per output
1727
page can be modulated. While the system defaults to 50
1728
entries per page, the maximum number of entries per
1729
output page can be set to 20, 100, 200, and 400. The
1730
user can scroll to those entries that do not appear on
1731
the first page by selecting the "Next" button on the
1732
left bottom frame in the output window and use the
1733
"Previous" button in the same frame to move toward the
1734
first page, as necessary.
1735
• Color Display: to help distinguish the organism
1736
names on the output pages, the entries have the same
1737
color when the organism names are the same. The colors
1738
(pink and white) alternate for changes in the organism
1739
names in the output entries.
1740
• Group ID and Group Class: these two attributes are
1741
currently not fully functional; thus, we do not
1742
encourage their use at this time.
1743
• RNA Type/Class: currently, we do not have data
1744
entries for the following RNA Types and Classes: mRNA,
1745
tRNA, SnRNA, and Other.
1746
After clicking the submit button at the top or
1747
bottom of the query page, a new window will open. This
1748
window distributes the results into three frames
1749
(H-4A.2). The main frame contains the sequence and
1750
structure entries that satisfy the search query. The
1751
frame in the lower left indicates the number of entries
1752
shown in the window and the entry numbers currently
1753
shown, and, if necessary, contains buttons to scroll to
1754
the next or previous set of entries. The third frame at
1755
the bottom middle-right displays the total number of
1756
entries that satisfy the query, the search strategy and
1757
the sort order for this query.
1758
The three formats for the secondary structure
1759
diagrams, PostScript, PDF, and BPSEQ (see Section 1A
1760
and the online help from the "Secondary Structure" and
1761
"StrDiags" links on the RDBMS query and results pages)
1762
can be retrieved from the results window. The system
1763
defaults to PostScript when the secondary structure
1764
link is clicked; PDF or BPSEQ files can be obtained
1765
instead from the structure link by selecting the
1766
corresponding radio button at the top left section of
1767
the main frame. An explanation of the structure link
1768
names (d.5, d.l6, d.235, d.233, b.Il, and a.I2) and the
1769
longer names that are associated with the downloaded
1770
structure files is also available online.
1771
The GenBank accession number for each entry is a
1772
link to a new window that retrieves the specified entry
1773
from NCBI. Sequence entries with more than one GenBank
1774
number contain a "m" to the right of the accession
1775
number. Clicking the "m" link opens a new window with
1776
all of the GenBank numbers associated with this
1777
sequence.
1778
Each entry is associated with a NCBI phylogeny
1779
listing that can be retrieved in a new window by
1780
clicking the "m" button in the Phylogeny column. This
1781
listing also contains the known common names associated
1782
with each level of the phylogenetic tree (H-4A.4). The
1783
phylogeny for all of the entries in the results window
1784
is available in a new window when the "M" button in the
1785
header line of the phylogeny field is clicked.
1786
1787
1788
4B. RDBMS (PhyloBrowser)
1789
The PhyloBrowser interface to the CRW RDBMS was
1790
developed to facilitate the identification and
1791
retrieval of sequence and structure entries that are
1792
associated with specific phylogenetic groups. While the
1793
Standard interface will reveal all sequence entries for
1794
any one phylogenetic group, it does not show the
1795
phylogenetic groups that do not have the requested
1796
sequences; the PhyloBrowser interface displays the
1797
entire phylogenetic tree, including those branches that
1798
do not have corresponding entries. This interface is
1799
based on the Taxonomy Browser developed by NCBI
1800
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/and
1801
uses the NCBI taxonomy database [ 60 61 ] . Here, we
1802
describe the PhyloBrowser interface, ways to navigate
1803
through the phylogenetic data, and how to retrieve RNA
1804
information using this system.
1805
The PhyloBrowser uses three frames (Figure 4and
1806
H-4B.1). At the bottom of the page is the Results Frame
1807
(white background), which displays the selected portion
1808
of the phylogenetic tree and any RNA information. In
1809
the upper left is the Selection Frame (pink
1810
background), where the user can select the phylogenetic
1811
and RNA information shown in the Results Frame. Help is
1812
provided in the Help Frame, at the upper right (blue
1813
background).
1814
Starting at the root, the entire phylogenetic tree
1815
can be navigated with this system. The base
1816
phylogenetic level name is shown in green. The number
1817
of phylogenetic levels displayed (below the base level)
1818
can be modulated from one (the default) to five levels
1819
using the "Display Phylogenetic Levels" control in the
1820
Selection Frame. The phylogenetic level number for each
1821
group is shown in red preceding the phylogenetic group
1822
name, and common name information, where available, is
1823
shown in black text in parentheses after the group
1824
name. Each phylogenetic group name is a link that
1825
reveals additional phylogenetic levels (Figure 4and
1826
H-4B.1), allowing the user to navigate onto the
1827
branches of the phylogenetic tree.
1828
In addition to this mode of transversing the
1829
phylogenetic tree, starting at the root and knowing the
1830
pathway to the desired end point, this system has the
1831
facility to jump to specific places in the phylogenetic
1832
tree. The user can enter a partial or complete
1833
scientific or common name in the white text field in
1834
the lower, purple-colored panel of the Selection Frame
1835
(
1836
e.g., "human;" see H-4B.2). Once
1837
the appropriate scientific or common name radio button
1838
is set, different names that satisfy the user-entered
1839
text can be viewed in the Results Frame by checking the
1840
"View" box. Clicking the appropriate name in the
1841
Results Frame will enter that name into the text field;
1842
unchecking the "View" check box and clicking "Submit"
1843
will reveal the phylogenetic branch for this organism
1844
(H-4B.3).
1845
To navigate toward the root of the phylogenetic
1846
tree, click the "Parents" button in the Selection
1847
Frame. This will open a new window with the complete
1848
NCBI phylogeny from the root to the level of the
1849
organism of interest. This window (H-4B.4) also reveals
1850
the phylogenetic level number and common names. Simply
1851
clicking on a node name in this window (
1852
e.g., the "Eutheria" node in
1853
H-4B.4) will reveal this section of the phylogenetic
1854
tree in the Results Frame.
1855
RNA information can be mapped onto the phylogenetic
1856
tree in the Results Frame at any time. In the white
1857
panel in the Selection Frame, the user can choose to
1858
view six RNA types (5S, 16S, and 23S rRNA; group I, II
1859
and other introns) from five cellular locations
1860
(chloroplast, cyanelle, mitochondria, nucleus, and
1861
viral) by checking the boxes to the left of the desired
1862
selections. After clicking the white "Submit" button,
1863
all entries that satisfy the RNA type and cell location
1864
selections are mapped onto the phylogenetic tree in the
1865
Results Frame (H-4B.3). There, the numbers of sequences
1866
and structure diagrams available in our CRW RDBMS are
1867
shown adjacent to each phylogenetic group name at all
1868
levels of the phylogenetic tree and enclosed in
1869
brackets; the format of this information for each
1870
individual RNA type is: [cell location, # sequences/#
1871
structures, cell location, # sequences/# structures,
1872
...]. The RNA types are indicated in different colors
1873
(rRNA: 5S, green; 16S, red; 23S, blue; introns: group
1874
I, black; II, brown; other intron types, magenta) and
1875
the cell locations are abbreviated (N, nucleus; M,
1876
mitochondria; C, chloroplast; Y, cyanelle; V, viral).
1877
These values in brackets link to the Standard RDBMS
1878
results page, as described in the previous section, and
1879
allow the user to view the available sequence and
1880
structure information. The PhyloBrowser page (H-4B.3)
1881
reveals the
1882
"Homo sapiens" phylogenetic group
1883
with the number of sequences and structures available
1884
in our CRW RDBMS for RNA types (
1885
e.g., 16S and group I introns)
1886
that are present in the selected cell locations (
1887
e.g., Chl, Mit, Nuc).
1888
Additional documentation for the use of this page is
1889
available from the PhyloBrowser page. A short
1890
description is displayed in the top-right frame by
1891
placing the mouse over each of the attributes
1892
("Molecule," "Cell Location," "Phylogenetic Levels,"
1893
"Go to Parents," "Query," and "Acknowledgement").
1894
Additional information for each of these attributes is
1895
then displayed in a new window by clicking on either
1896
the attribute link or the additional information link
1897
in the top-right frame (Figure 4and H-4B.1).
1898
1899
1900
4C. RNA Structure Query System
1901
Currently, we are unable to reliably and accurately
1902
predict an RNA structure from its underlying sequence
1903
due in part to the lack of more fundamental RNA
1904
structure rules that relate families of RNA sequences
1905
with specific RNA structural elements. Given this
1906
limitation, we have utilized comparative analysis to
1907
determine that RNA structure that is common to a set of
1908
functionally and structurally equivalent sequences.
1909
This analysis, as mentioned earlier, is very accurate:
1910
nearly 98% of the basepairings in our 16S and 23S rRNA
1911
comparative structure models are present in the
1912
high-resolution crystal structures for the 30S [ 44 ]
1913
and 50S [ 45 ] ribosomal subunits. In the process of
1914
predicting these comparative structure models, we have
1915
determined a large number of 5S, 16S, and 23S rRNA and
1916
group I intron comparative structure models from
1917
sequences that are representative of all types of
1918
structural variations and conservation. Thus, with the
1919
correct rRNA structure models and a large sampling of
1920
structurally diverse structure models, we now want to
1921
decipher more relationships between RNA sequences and
1922
RNA structural elements. Toward this end, we developed
1923
a system for the identification of biases in short
1924
sequences associated with simple structural elements in
1925
our set of comparative structure models. The first set
1926
of examples reveals a sampling of structure-based
1927
sequence biases. Recently, we utilized this system to
1928
identify and quantitate the following biases for
1929
adenosines in the Bacterial 16S and 23S rRNA
1930
covariation-based structure models [ 63 ] : 1)
1931
approximately 2/3 of the adenosines are unpaired; 2)
1932
more than 50% of the 3' ends of loops in the 16S and
1933
23S rRNA have an A; 3) there is a bias for adenosines
1934
to be adjacent to other adenosines (66% of these are at
1935
two unpaired positions, and 15% of these are at
1936
paired/unpaired junctions); and 4) the majority of the
1937
As at the 3' end of loops are adjacent to a paired G.
1938
These results were discerned with this system and are
1939
shown in part in Figure 5and H-4C.
1940
This RNA sequence/structure query system has three
1941
primary fields of input to be selected by the user: the
1942
RNA type, phylogenetic group/cell location, and the
1943
nucleotide/structural element. The options for each of
1944
these fields are listed in Table 5. The system
1945
currently supports four RNA types (5S, 16S, and 23S
1946
rRNAs, and group I introns) and five phylogenetic
1947
groups/cell locations (Bacteria, Archaea, Eucarya
1948
nuclear-encoded, mitochondrial, and chloroplast). Any
1949
combination and number of RNA types and
1950
phylogenetic/cell location groups can be selected,
1951
although at least one RNA type and one
1952
phylogenetic/cell location group must be selected. The
1953
bacterial 16S and 23S rRNAs were selected for the
1954
examples in Figure 5and H-4C. Five nucleotide
1955
categories are searchable: single nucleotides, (two)
1956
adjacent nucleotides, base pairs, three nucleotides,
1957
and four nucleotides. Each category can be searched
1958
against a defined set of structural elements, as
1959
outlined in Table 5. The structural elements for these
1960
nucleotide categories are based on 1) positions that
1961
are paired and unpaired and 2) positions at the center
1962
or 5' and 3' ends of helices and loops.
1963
The sorting function dynamically ranks the
1964
nucleotide patterns. The resulting output reveals, for
1965
any of the selected structural elements, the most
1966
frequent nucleotide pattern, followed by other patterns
1967
in descending order to the least frequent nucleotide
1968
pattern. For the "A Story" example mentioned earlier,
1969
adenosine is the most frequent nucleotide at unpaired
1970
positions (42.64%), followed by G (23.6%), U (21.27%),
1971
and C (12.49%) (Figure 5and H-4C.1). These values are
1972
contained in the orange columns, and reveal the
1973
percentages for each of the nucleotides within each of
1974
the structural elements listed (
1975
i.e., paired, unpaired,
1976
etc. ). This same figure reveals
1977
that 53.5% of the 3' end of loops contain an A. The
1978
unpaired to paired ratio is shown in yellow in Figure
1979
5and H-4C.1; this ratio is greatest for adenosines,
1980
where the value is nearly two (
1981
i.e., there are two unpaired
1982
adenosines for every A that is paired), and lowest for
1983
C, where less than three out of ten cytosines are
1984
unpaired. In contrast with the percentage values in the
1985
orange boxes that reveal the percentage of nucleotides
1986
within each structural element, the percentages in the
1987
green boxes reveal the distribution of nucleotides in
1988
different structural elements for each nucleotide. For
1989
example, 33.76% of the adenosines are paired, while
1990
66.24% are unpaired. In contrast, 77.71% of the C's are
1991
paired and only 22.29% of the C's are unpaired.
1992
The most common adjacent nucleotides in any
1993
structural environment in the Bacterial 16S and 23S
1994
rRNAs are GG (9.86%; H-4C.2), while in loops the most
1995
common dinucleotides are AA (19.2%; H-4C.3), followed
1996
by GA (13.35%), UA (9.821%), AU (6.703%),
1997
etc. The most frequent adjacent
1998
nucleotides at the 3'loop-5'helix junction are AG
1999
(24.99%; H-4C.4), followed by AC (13.28%), GG (8.28%),
2000
etc. For the adjacent AA
2001
sequences, nearly 75% occur in loops, while
2002
approximately 12% of the AA sequences occur in helices,
2003
another 12% occur at the 3'loop-5'helix junction, and
2004
less than 5% occur in 3'helix-5'loop junctions. Thus,
2005
these analyses of single and adjacent nucleotides
2006
reveal several strong biases in the distribution of
2007
nucleotides in different structural environments.
2008
The top section of the output page (Figure 5and
2009
H-4C.1) displays the types of data (RNA molecules and
2010
phylogenetic/cell location groups) that were selected
2011
and analyzed. This section also reveals the number of
2012
structure models that were analyzed; 175 16S and 71 23S
2013
rRNA structure models were analyzed in Figure 5and
2014
H-4C.
2015
A few of the other biases in the distribution of
2016
nucleotide patterns that were determined with this
2017
sequence/structure query system of our comparative
2018
structure models are displayed in Table 6. A more
2019
detailed accounting of this information is available
2020
online.
2021
2022
2023
2024
2025
Auxiliary components of the CRW site
2026
In addition to the sections described above, the CRW
2027
Site also includes online appendices to work published
2028
elsewhere. The "Structure, Motifs, and Folding" section
2029
presently contains three RNA motif projects ("U-Tum" [ 62 ]
2030
, "A Story" [ 63 ] , and "[email protected]" [ 64 ] ) and
2031
two RNA folding projects ("16S rRNA Folding" [ 65 ] and
2032
"23S rRNA Folding" [ 66 ] ). In the "Phylogenetic Structure
2033
Analysis" section, additional information for three
2034
publications is available: "Mollusk Mitochondria" [ 67 ] ,
2035
"Polytoma Leucoplasts" [ 68 ] , and
2036
"Algal Introns" [ 69 ] .
2037
2038
2039
Conclusions
2040
Nearly 10 years ago, our initial goals for our RNA web
2041
page was to disseminate some of the comparative information
2042
we collected and analyzed for our prediction of 16S and 23S
2043
rRNA structure with comparative analysis. With dramatic
2044
increases in the number of ribosomal RNA sequences, we
2045
developed a relational database system to organize basic
2046
information about each sequence and structure entry to
2047
maintain an inventory of our collection, and to retrieve
2048
any one or set of entries that satisfy the conditions of
2049
the search. In parallel, with the significant advancements
2050
in computational and networking hardware and software, our
2051
need for more detailed and quantitative comparative
2052
information for each RNA molecule under study, and our
2053
interest in studying more RNA molecules beyond 16S and 23S
2054
rRNA, we have greatly expanded our web site, and named it
2055
the "Comparative RNA Web" (CRW) Site.
2056
The major types of information available for each RNA
2057
molecule are:
2058
1) the current comparative RNA structure model;
2059
2) nucleotide and base pair frequency tables for all
2060
positions in the reference structure;
2061
3) secondary structure conservation diagrams that reveal
2062
the extent of conservation in the RNA sequence and
2063
structure;
2064
4) representative secondary structure diagrams for
2065
organisms from phylogenetic groups that span the
2066
phylogenetic tree and reveal the major forms of structural
2067
variation;
2068
5) a semi-complete/partial collection of publicly
2069
available sequences that are 90% or more complete; and
2070
6) sequence alignments.
2071
At this time, we maintain the most current comparative
2072
sequence and structure information about the 16S and 23S
2073
rRNA. The other RNA molecules we maintain (5S rRNA, tRNA,
2074
and group I and II introns) are not as advanced at the time
2075
of this writing.
2076
Our future aims for the CRW Site are to: 1) maintain a
2077
complete collection of sequences in our database management
2078
system for each of the RNAs under study; 2) once or twice a
2079
year, release new sequence alignments that contain A)
2080
improvements (if necessary) in the positioning of the
2081
sequences that are associated with similar structural
2082
elements, and B) increases in the number of aligned
2083
sequences; 3) generate more secondary structure diagrams
2084
for sequences that span the phylogenetic tree and reveal
2085
all forms of structural variation; 4) generate more
2086
secondary structure conservation diagrams and nucleotide
2087
and base pair frequency tables for more phylogenetic groups
2088
(
2089
e.g. Fungi: Basidiomycota,
2090
Ascomycota, and Zygomycota); 5) update the structure models
2091
when warranted by the analysis; 6) update current
2092
nucleotide and base pair frequency tables when the
2093
alignments they are derived from have been updated, and
2094
generate more frequency tables for more phylogenetic groups
2095
(see "4)" above); 7) add new types of comparative RNA
2096
sequence/structure information and new modes of presenting
2097
the data; and 8) analyze more types of RNA molecules from a
2098
comparative perspective, and present this data in the same
2099
formats utilized for the RNA molecules currently
2100
supported.
2101
2102
2103
Materials and Methods
2104
2105
Sequence collection
2106
The majority of the sequence alignments presented at
2107
the CRW Site were assembled in the Gutell laboratory. The
2108
alignments that were based on another laboratory's
2109
initial effort and enlarged and refined for the CRW
2110
project are: 1) the prokaryotic (Archaea and Bacteria)
2111
alignments for 16S rRNA [ 85 ] ; 2) the 5S rRNA
2112
alignments [ 55 ] ; and 3) the tRNA alignments [ 81 ] .
2113
The group I and II intron alignments were originally
2114
based upon sequences collected by Michel [ 32 34 ] .
2115
New rRNA and intron sequences were found by searching
2116
the nucleic acid sequence database at GenBank using the
2117
NCBI Entrez system http://www.ncbi.nlm.nih.gov/Entrez/at
2118
least once per week with appropriate search criteria (
2119
e.g., "rrna" [Feature key] and
2120
"intron" [Feature key] to find introns that occur in
2121
rRNA). While the majority of the RNA sequences of
2122
importance to this database are available online at
2123
GenBank, a few sequences are only available in the
2124
literature (
2125
e.g., the
2126
Urospora penicilliformis intron [
2127
86 ] ) or in a thesis; these sequences were manually
2128
2129
entered into the appropriate sequence alignment. A few
2130
sequences were found in GenBank with the sequence
2131
similarity searching program BLAST [ 87 ] . At this time,
2132
we are only trying to identify all sequences that are
2133
more than 90% complete since all sequences that are less
2134
than 90% complete are not currently retrieved with the
2135
CRW RDBMS.
2136
2137
2138
Deviations in GenBank entries
2139
The majority of GenBank entries contain accurate
2140
annotations of the RNAs. However, some GenBank entries
2141
deviate from this norm in a variety of ways. In some
2142
entries, the presence of the rRNA was not annotated and
2143
the rRNA was found by searching for short sequences that
2144
are characteristic of that rRNA (a few examples).
2145
Sometimes, intron sequences are not annotated and were
2146
discovered during the alignment of the corresponding rRNA
2147
exons (
2148
e.g., the unannotated intron in the
2149
uncultured archaeon SAGMA-B 16S rRNA (AB050206) and many
2150
Fungi, including AF401965 [ 88 ] ). Other GenBank entries
2151
contain incorrect annotations for the RNAs; the
2152
boundaries may be misidentified by a small or large
2153
number of nucleotides.
2154
2155
2156
RNA sequence alignment and classification of intron
2157
sequences
2158
2159
Alignment and determination of intron-exon
2160
boundaries
2161
The sequence alignments used for this analysis are
2162
maintained by us at the University of Texas; these
2163
alignments, containing all publicly available sequences
2164
used in the analysis, are or will be available from the
2165
CRW Site http://www.rna.icmb.utexas.edu(Table 2). rRNA,
2166
Type 1 tRNA, and intron sequences were manually aligned
2167
to maximize sequence and structural identity using the
2168
alignment editor AE2 (T. Macke, Scripps Clinic, San
2169
Diego, CA). The rRNA alignments are sorted by phylogeny
2170
and cell location, the intron alignments are sorted by
2171
subgroup, exon, insertion point (for rRNA introns), and
2172
phylogeny, and the tRNA alignments are sorted by
2173
aminoacyl type and phylogeny. Alignment of the rRNA
2174
exons (when available) between closely-related
2175
sequences provided an independent evaluation of the
2176
intron-exon borders for each intron-containing rRNA
2177
sequence; the large number of rRNA sequences in our
2178
collection and the high level of sequence conservation
2179
at intron insertion points provide great confidence in
2180
this evaluation.
2181
2182
2183
Classification of introns
2184
Group I and II intron sequences were classified into
2185
one of the structural subgroups defined by Michel [ 32
2186
34 ] or the more recently determined subgroup IE [ 82 ]
2187
based upon sequence and structural homology to
2188
previously-aligned sequences. Uncertainties in these
2189
assignments come from two main sources. First, some
2190
introns are referred to in rRNA GenBank entries without
2191
the intron sequence being provided; in these cases, we
2192
represent the intron as having length "NSEQ" (No
2193
SEQuence information) and accept the authors' major
2194
intron classification (
2195
e.g., group I or group II) but
2196
not the specific intron type (
2197
e.g., if an author classified an
2198
intron as IA1 and did not publish the sequence, our
2199
system designates its type as "I"). In the second case,
2200
we do have sequence information but cannot fully
2201
classify the intron with confidence; here, we provide
2202
the most plausible classification. The classifications
2203
"I" and "II," respectively, are group I and II introns
2204
of undefined subtype. An intron described as "IB' has
2205
the characteristic features of the IB subgroup but
2206
cannot be subclassified as IB1, IB2, IB3, or IB4. Those
2207
introns that do not belong to either group I or group
2208
II are generally classified as "Unknown" in the "RNA
2209
Class" field (see Section 4A and Table 5); included in
2210
this category are the Archaeal and spliceosomal
2211
introns. At present, the Archaeal and spliceosomal
2212
introns are identified with the phrases "Archaeal" and
2213
"spliceosomal," respectively, in the Comment field of
2214
the RDBMS; a standard designation for these introns
2215
will be added to a future version of the system.
2216
Although the introns in our collection have been
2217
judiciously placed into one of the intron subgroups and
2218
are roughly correct, these intron placements will be
2219
reanalyzed to assure the accurate assignment of
2220
subgroups.
2221
2222
2223
Identification of unannotated or misannotated
2224
introns, with examples
2225
Some examples of introns that were identified or
2226
clarified by the alignment process are: 1)
2227
Aureoumbra lagunensis (U40258;
2228
the intron was annotated as an insertion); 2)
2229
Exophiala dermatitidis (X78481;
2230
the intron was not annotated); and 3)
2231
Chara sp. Qiu 96222 (AF191800;
2232
the intron annotations were shifted approximately 15
2233
positions toward the 5' end of the rRNA sequence).
2234
2235
2236
About TBD and NSEQ
2237
Information that could not be determined either from
2238
the GenBank entries or by using these methods is
2239
represented in the RDBMS system as TBD (To Be
2240
Determined). When a sequence is known but not available
2241
(for example, when an intron is inferred from a rRNA
2242
GenBank entry), the sequence length and percent
2243
completeness are instead represented as NSEQ (No
2244
SEQuence), to show that the sequence itself is not
2245
available.
2246
2247
2248
2249
Database System
2250
2251
Contents of the RDBMS (general and
2252
intron-specific)
2253
The relational database management system (RDBMS)
2254
available from the Comparative RNA Web Site
2255
http://www.rna.icmb.utexas.edudescribed in this work
2256
utilizes the MySQL engine http://www.mysql.com/. The
2257
system contains vital statistics for each sequence
2258
(Table 4). The primary fields are: 1) organism name; 2)
2259
complete phylogeny; 3) cell location; 4) RNA type
2260
(general category;
2261
e.g., rRNA or intron); 5) RNA
2262
class (more detailed identification;
2263
e.g., 16S or IC1); 6) GenBank
2264
Accession Number (linked to GenBank); and 7) secondary
2265
structure diagrams for selected sequences.
2266
Intron-specific data stored in the system are the exon,
2267
intron number (index for multiple introns from a single
2268
exon), intron position (for rRNA introns only: the
2269
E. coli (GenBank Accession Number
2270
J01695) equivalent position number immediately before
2271
the intron), and open reading frame presence. Note that
2272
only sequences that are at least 90% complete are made
2273
available through this system. The majority of this
2274
data is manually entered into the database system; one
2275
exception is the complete NCBI phylogeny database [ 60
2276
61 ] , which is automatically downloaded and
2277
incorporated into this system daily so that all RDBMS
2278
entries appear using the current NCBI scientific name
2279
for a given organism. Changes to the RDBMS phylogeny
2280
data are identified automatically during the
2281
incorporation process and then updated manually. Any
2282
changes made to the data become available to the public
2283
on the next day.
2284
2285
2286
Secondary Structure and Conservation
2287
Diagrams
2288
Secondary structure and conservation diagrams were
2289
developed entirely or in part with the interactive
2290
graphics program XRNA (Weiser & Noller, University
2291
of California, Santa Cruz). The PostScript files output
2292
by XRNA were converted into PDF using ghostscript
2293
(version 7.00;
2294
http://www.cs.wisc.edu/~ghost/index.htm).
2295
2296
2297
2298
Computer details
2299
2300
Hardware and software used
2301
The Comparative RNA Web Site
2302
http://www.rna.icmb.utexas.eduis hosted on a Sun
2303
Microsystems Enterprise 250 dual-processor server.
2304
Apache web server version 1.3.20, from the Apache
2305
Software Foundation http://www.apache.org/, provides
2306
the site's connectivity interface. The MySQL database
2307
(version 3.23.29; http://www.mysql.com/) provides the
2308
RDBMS functions. Web site statistics are collected
2309
using webalizer (version 2.01;
2310
http://www.mrunix.net/webalizer/).
2311
2312
2313
Authentication system
2314
The Comparative RNA Web Site has instituted an
2315
authorization system for its users. Information is
2316
collected to assist in web server administration and
2317
error tracking. On their initial visits, users will
2318
select a username, provide a current email address (for
2319
verificiation purposes), and review the terms and
2320
conditions for use of the CRW Site. An email will be
2321
sent to the provided email address containing a
2322
validation URL for that account. At this URL, the user
2323
may provide additional information; the system will
2324
then email an initial password to the user at the
2325
selected email account. The user then has the two
2326
pieces of information (username and password) necessary
2327
to log in and use the CRW Site. Once logged in, the
2328
user may change the password and update the user
2329
information at any time.
2330
2331
2332
URL rewriting
2333
We strongly encourage all users to access the
2334
Comparative RNA Web Site
2335
http://www.rna.icmb.utexas.eduusing its main address,
2336
http://www.rna.icmb.utexas.edu/, rather than through
2337
specific URLs. As the site grows, specific pages may be
2338
moved, changed, or deleted. As well, use of more
2339
specific URLs may not include the navigation system for
2340
the site, providing the user with a suboptimal
2341
operating experience of the entire site. Therefore, the
2342
system is configured to route an initial request for a
2343
more specific URL to an introductory page, which will
2344
offer users access to the main page and a selection of
2345
specific URLs.
2346
2347
2348
2349
2350
List of abbreviations
2351
CRW = Comparative RNA Web
2352
NCBI = National Center for Biotechnology
2353
Information.
2354
nt = nucleotide
2355
RDBMS = Relational Database Management System.
2356
URL = Uniform Resource Locator
2357
2358
2359
2360
2361