Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
Accurate computational protein function analysis is an
7
important way of extracting value from primary sequence
8
data. Due to the large amount of data, automated systems
9
seem unavoidable (at least for initial, prioritizing
10
steps). Such efforts are complicated, for a variety of
11
reasons. Many proteins belong to large families, as
12
suggested by Dayhoff [ 1 ] . Such families are often
13
composed of subfamilies related to each other by gene
14
duplication events. For example, Ingram [ 2 ] showed that
15
human α, β, and γ chains of hemoglobins are related to each
16
other by gene duplications. Gene duplication allows one
17
copy to assume a new biological role through mutation,
18
while the other copy preserves the original functionality [
19
3 4 ] . Hence, subfamilies often differ in their biological
20
functionality yet still exhibit a high degree of sequence
21
similarity.
22
Other complications in functional analysis include:
23
ignoring the multi-domain organization of proteins; error
24
propagation caused by transfer of information from
25
previously erroneously annotated sequences; insufficient
26
masking of low complexity regions; and alternative splicing
27
[ 5 ] .
28
Typically, automated sequence function analysis relies
29
on pairwise sequence similarity and programs such as BLAST
30
[ 6 ] or FASTA [ 7 ] . Annotating a sequence by
31
transferring annotation from its most similar sequence(s)
32
tends to produce overly specific annotation. In contrast,
33
analyses using profile search algorithms such as HMMER
34
http://hmmer.wustl.edu/and Pfam [ 8 ] classify sequences
35
too generally. They recognize that a query sequence belongs
36
to a certain family (or, to be more precise, indicate which
37
domain(s) the query is likely to contain), but do not
38
subclassify the sequence.
39
At least two scenarios can cause misleading predictions
40
when using pairwise sequence similarity alone for
41
annotation: (i) not having a known annotated representative
42
of the correct subfamily because incomplete sequence
43
databases and/or gene loss (Figure 1), and (ii) unequal
44
rates of evolution (Figure 2). The case of trying to
45
annotate the first (or only) representative of a novel
46
subfamily is of particular interest. Pairwise similarity
47
based methods alone cannot recognize that a new sequence
48
does not belong in any currently known subfamily (e.g.
49
"orphan" G-protein coupled receptors), because every
50
sequence is most similar to something. In contrast, when
51
constructing a phylogenetic tree, this case is easy to
52
observe (as illustrated in Figure 1). A human annotator can
53
use phylogenetic tree analysis to place a new sequence in
54
the subfamily structure of a gene tree of known sequences.
55
This approach was called "phylogenomics" by Eisen [ 9 ] .
56
It would be desirable to automate this procedure, but the
57
best automated methods for subfamily annotation, such as
58
the COGs database [ 10 ] , are clustering methods that do
59
not directly use phylogenetic analysis.
60
It is infeasible to completely automate functional
61
analysis, because it is impossible to precisely define what
62
protein "function" means. However, a principle of
63
phylogenomics is that orthologous sequences (that diverged
64
by speciation) are more likely to conserve protein function
65
than paralogous sequences (that diverged by gene
66
duplication). Orthology and paralogy are well defined and
67
can be inferred from gene and species trees. One useful and
68
automatable phylogenomics approach would be as follows: if
69
a novel sequence has orthologs, annotation can be
70
transferred from them (as in best BLAST analysis); if there
71
are no orthologs, the sequence is classified as just a
72
family member (as in Pfam/InterPro analysis) and flagged as
73
possibly the first representative of a novel subfamily. At
74
the core of such approaches stands therefore the
75
distinction between orthologs and paralogs, and hence the
76
ability to discriminate between duplication and speciation
77
events on a gene tree. Various efficient algorithms to
78
infer gene duplications on a gene tree by comparing it to a
79
species tree have been described (for example: by
80
Eulenstein [ 11 ] , and by Zhang [ 12 ] ). We developed a
81
simple algorithm (named SDI for Speciation Duplication
82
Inference) that appears to solve this problem even more
83
efficiently on realistic data sets, though it has an
84
asymptotic worst-case running time that is less favorable [
85
13 ] .
86
In practice, phylogenetic trees are unreliable. Errors
87
in trees will produce spurious inferred duplications. This
88
is obviously problematic if duplications are to be used as
89
indicators of potential functional changes. Therefore,
90
instead of determining the orthologs of a query sequence on
91
just one gene tree, inference could be performed over
92
bootstrap resampled gene trees [ 14 15 ] to estimate of the
93
reliability of the assignments. Here we describe and test a
94
procedure - RIO (for Resampled Inference of Orthologs) -
95
which allows to perform such analyses in an automated
96
fashion. We present results of using RIO to analyze a plant
97
(
98
A. thaliana [ 16 ] ) and an animal
99
(the nematode
100
C. elegans [ 17 ] ) proteome.
101
102
103
Algorithm
104
105
Definitions
106
Orthologs are defined as two genes that diverged by a
107
speciation event. Paralogs are defined as two genes that
108
diverged by a duplication event [ 18 ] . Other concepts
109
derived from gene trees can be useful for functional
110
prediction. We introduce and justify three such concepts
111
("super-orthologs", "ultra-paralogs", and
112
"subtree-neighbors"):
113
Careless use of orthology relationships without
114
examining the tree itself can lead to incorrect
115
annotations. In the example shown in Figure 3A, the human
116
query sequence has two orthologous sequences in wheat.
117
These two wheat sequences are related to each other by a
118
gene duplication and one (or even both) of them might
119
have undergone functional modification after their
120
divergence. Given a procedure that gave a list of
121
orthologues for the human gene query, such situations
122
should be revealed by only partial (or complete absence
123
of) agreement between the annotations of the wheat
124
orthologs. Now consider the situation in Figure 3B. This
125
is trickier, since in this case only one ortholog will be
126
reported for the query sequence, but it will be just as
127
dangerous to transfer annotation. We do not attempt to
128
solve this problem (the solution is careful manual
129
analysis of the gene tree) but an automated procedure can
130
warn that this situation might be present. For this
131
purpose we introduce the concept of
132
"super-orthologs":
133
134
Definition 1. Given a rooted gene
135
tree with duplication or speciation assigned to each of
136
its internal nodes, two sequences are super-orthologous
137
if and only if each internal node on their connecting
138
path represents a speciation event.
139
Hence, the query sequences in Figure 3have no
140
super-orthologs. In contrast, the rat, mouse, and wheat
141
sequences in Figure 1Aare super-orthologs pf the human
142
query sequence. By definition, the super-orthologs of a
143
given sequence are a subset of its orthologs.
144
Certain sequences underwent multiple recent
145
duplications, resulting in large species specific
146
sequence families, such as the
147
C. elegans seven-transmembrane
148
proteins acting as odorant and chemosensory receptors [
149
19 20 ] . For query sequences belonging to such sequence
150
families, orthologs (if present) are less effective for
151
predicting specific information. In these cases, paralogs
152
of the same (sub) family might be more informative for
153
functional prediction (as long as the duplications indeed
154
happened "late" in evolutionary times). To formalize
155
this, we introduce the concept of "ultra-paralogs":
156
157
Definition 2. Given a rooted gene
158
tree with duplication or speciation assigned to each of
159
its internal nodes, two sequences are ultra-paralogous if
160
and only if the smallest subtree containing them both
161
contains only internal nodes representing
162
duplications.
163
Figure 4illustrates the concept of ultra-paralogs. It
164
follows from definition 2 that two ultra-paralogous
165
sequences must occur in the same species.
166
Often, researchers construct a gene tree and then
167
informally use "subtrees" (clades) to make inferences
168
about sequences (without regard to duplications and
169
speciations). We introduce this concept into our
170
procedure as well, formalized as "subtree-neighbors"
171
(illustrated in Figure 5):
172
173
Definition 3. Given a completely
174
binary and rooted gene tree, the
175
k -subtree-neighbors of a sequence
176
q are defined as all sequences
177
derived from the
178
k -level parent node of
179
q , except
180
q itself (the level of
181
q itself is 0,
182
q 's parent is 1, and so
183
forth).
184
Subtree-neighbors can be useful if there is (partial)
185
agreement among their annotations (for example: if the
186
subtree-neighbors of a query are NAD +-dependent
187
isocitrate dehydrogenase and NADP +-dependent isocitrate
188
dehydrogenase we can suppose that the query is likely to
189
be a isocitrate dehydrogenase, but it is not possible to
190
determine whether it is dependent on NAD +or NADP +). If
191
the subtree-neighbors lack any agreement in their
192
annotations a useful inference is not possible (see [ 9 ]
193
for a more detailed discussion). Furthermore, orthologs
194
that are not also subtree-neighbors can be misleading
195
(for a more detailed discussion of this, see below, and
196
see Figures 10and 11for examples).
197
198
199
The RIO procedure
200
This basic RIO procedure is as follows. For a simple
201
example with only four bootstrap resamples, see Figure
202
6.
203
We use the Pfam protein family database [ 8 ] as a
204
source of high quality curated multiple sequence
205
alignments and profile HMMs (Hidden Markov Models, see [
206
21 ] for a review), as well as programs from the HMMER
207
package http://hmmer.wustl.edu/. RIO can easily be
208
adapted to work with different sources of alignments and
209
different alignment programs. For tree reconstruction,
210
the neighbor joining (NJ) algorithm [ 22 ] is used, since
211
it is reasonably fast, can handle alignments of large
212
numbers of sequences, and does not assume a molecular
213
clock. NJ recreates the correct additive tree as long as
214
the input distances are additive [ 23 ] , and is
215
effective even if additivity is only approximated [ 24 ]
216
.
217
218
Input: A query protein sequence
219
Q with unknown function.
220
A curated multiple alignment
221
A from the Pfam database for the
222
protein family that
223
Q belongs to (as determined by
224
hmmpfam from the HMMER package).
225
A profile HMM
226
H for the protein family that Q
227
belongs to.
228
229
Output: A list (as in Figure 7) of
230
proteins orthologous to
231
Q , sorted according to a bootstrap
232
confidence value (based on orthology, super-orthology, or
233
subtree-neighborings).
234
Optional: A gene tree based on the multiple alignment
235
A and the query
236
Q annotated with orthology
237
bootstrap confidence values for the query
238
Q .
239
240
Procedure:
241
242
243
1. Query sequence
244
Q is aligned to the existing
245
alignment
246
A (using hmmalign from the HMMER
247
package and the Pfam profile HMM
248
H ).
249
250
2. The alignment is bootstrap
251
resampled
252
x times (usually,
253
x = 100).
254
255
3. Maximum likelihood pairwise
256
distance matrices are calculated for each of the
257
x multiple alignments using a model
258
of amino acid substitution (for example, BLOSUM [ 25 ] or
259
Dayhoff PAM [ 26 ] ).
260
261
4. An unrooted phylogenetic tree is
262
inferred for each of the
263
x multiple alignments by neighbor
264
joining [ 22 ] , resulting in
265
x gene trees. Each tree is rooted
266
by a modified version of our SDI algorithm [ 13 ] that
267
minimized the number of duplications postulated (this is
268
discussed in more detail later).
269
270
5. For each of the
271
x rooted gene trees: For each node
272
it is inferred whether it represents a duplication or a
273
speciation event by comparing the gene tree to a trusted
274
species tree.
275
276
6. For each sequence
277
s in the gene tree (except
278
Q ): Count the number of gene trees
279
where
280
s is orthologous to
281
Q (see Figure 6for an illustration
282
of steps 5. and 6.). Bootstrap confidence values for
283
super-orthologies, ultra-paralogies and subtree-neighbors
284
are calculated analogously.
285
286
287
Precalculation of pairwise distances for increased
288
time efficiency
289
The most time consuming step in the procedure
290
described above is the calculation of pairwise distances.
291
[The time complexity is O(
292
xLN 2),
293
N being the number of sequences,
294
L being their length, and
295
x being the number of bootstrap
296
resamples. On an average Intel processor the wall clock
297
time for 100 bootstrapped datasets of a typical Pfam
298
multiple alignment is in the range of hours.]
299
Since the query sequence is aligned to stable Pfam
300
alignments, it is possible to precalculate the pairwise
301
distances for each alignment and store the results. Then,
302
when RIO is being used to analyze a query sequence, only
303
the distances of the query to each sequence in the Pfam
304
alignment have to be calculated. This step becomes thus
305
O(
306
xLN ) instead of O(
307
xLN 2).
308
To do this correctly, the aligned query sequence has
309
to be bootstrap resampled in exactly the same way as was
310
used for precalculating the pairwise distances of the
311
Pfam alignment. For this purpose, bootstrap positions
312
(e.g. which aligned columns from the Pfam alignment were
313
chosen in a particular bootstrap sample) are saved to a
314
file. With this file it is possible to bootstrap the new
315
alignment of N+1 sequences (Pfam alignment plus query
316
sequence) in precisely the same manner, so the NxN
317
precalculated distances are valid for the (N+1)x(N+1)
318
distance matrix. The alignment method must also guarantee
319
that the original Pfam multiple alignment remains
320
unchanged when the query sequence is aligned to it. This
321
requires specially prepared Pfam full alignments and
322
profile HMMs that are created with the HMMER software as
323
follows:
324
325
Input: Original Pfam full alignment
326
A .
327
328
Output: "aln" file containing
329
RIO-ready full alignment
330
"hmm" file containing a RIO-ready profile HMM
331
"nbd" file containing pairwise distances
332
"bsp" file bootstrap positions file
333
"pwd" file containing pairwise distances for bootstrap
334
resampled alignment
335
336
1. Remove sequences from species not
337
in RIO's master species tree from alignment
338
A . If
339
A does not contain enough sequences
340
(<6), abort.
341
342
2. Run hmmbuild -o A' on
343
A , using the same options as were
344
used to build the original Pfam HMM for
345
A , resulting in alignment
346
A' . (HMMER's construction
347
procedure slightly modifies the input alignment in ways
348
that are usually unimportant, but which matter to
349
bootstrapping in RIO.) Keep A' as the "aln" file.
350
351
3. Run hmmbuild with "--hand" option
352
on
353
A' , resulting in HMM
354
H' (using the same options as were
355
used to build the original HMM for
356
A ). Calibrate
357
H' with hmmcalibrate and keep as
358
"hmm" file.
359
360
4. Remove non-consensus (insert)
361
columns from
362
A' (these are annotated by HMMER),
363
resulting in alignment
364
A" .
365
366
5. Calculate pairwise distances for
367
A" , resulting in the "nbd" file
368
(non-bootstrapped distances).
369
370
6. Bootstrap resample the columns of
371
A" , resulting in the "bsp" file
372
(bootstrap positions file).
373
374
7. Calculate pairwise distances for
375
bootstrapped
376
A" , resulting in the "pwd"
377
file.
378
379
380
Rooting of gene trees
381
The concept of speciation and duplication is only
382
meaningful on rooted gene trees, but the neighbor joining
383
algorithm infers unrooted trees. We use a simple
384
parsimony criterion for rooting. Gene trees are rooted on
385
each branch, resulting in 2
386
N -3 differently rooted trees for a
387
gene tree of
388
N sequences. For each of these, the
389
number of inferred duplications is determined. From the
390
trees with a minimal number of duplications (if there is
391
more than one) the tree with the shortest total height is
392
chosen as the rooted tree. Empirical studies on gene
393
trees based on 1750 Pfam alignments show that about 60%
394
of trees rooted in such a way have their root in the same
395
position that direct midpoint rooting [ 27 ] would place
396
it.
397
Naively performing a full duplication/speciation
398
analysis on each of 2
399
N -3 differently rooted trees
400
results in a overall time complexity of O(
401
N 2) or worse, but this can be
402
avoided. For the purpose of the following discussion it
403
is assumed that our SDI algorithm for
404
speciation/duplication inference is employed, but the
405
idea applies to all algorithms based on a mapping
406
function M defined as follows [ 28 ] :
407
408
Definition 4. Let
409
G be the set of nodes in a rooted
410
binary gene tree and
411
S the set of nodes in a rooted
412
binary species tree. For any node
413
g ∈
414
G , let γ (
415
g ) be the set of species in which
416
occur the extant genes descendant from
417
g . For any node
418
s ∈
419
S , let σ (
420
s ) be the set of species in the
421
external nodes descendant from
422
s . For any
423
g ∈
424
G , let M(
425
g ) ∈
426
S be the smallest (lowest) node in
427
S satisfying γ (
428
g ) ⊆ σ (M(
429
g )).
430
Duplications are then defined using M(
431
g ) as follows:
432
433
Definition 5. Let
434
g
435
1 and
436
g
437
2 be the two child nodes of an
438
internal node
439
g of a rooted binary gene tree
440
G . Node
441
g is a duplication if and only if
442
M(
443
g ) = M(
444
g
445
1 ) or M(
446
g ) = M(
447
g
448
2 ).
449
The main task of most algorithms for duplication
450
inference is the calculation of M. After M has been
451
calculated for any rooted gene tree
452
G it is possible to explore
453
different root placements without having to recalculate M
454
for every node of
455
G . As long as the root is moved
456
one node at the time, M has to be recalculated only for
457
two nodes: the one node which was child 1 (if the new
458
root is placed on a branch originating from child 1 of
459
the previous root) or child 2 (otherwise) of the previous
460
root, as well as for the new root itself. Hence, two
461
postorder traversal steps (child 1 or 2 of the old root,
462
then the new root) in the SDI algorithm are all that is
463
needed. The new sum of duplications is determined by
464
keeping track of the change in duplication/speciation
465
status in the two recalculated nodes as well as in the
466
previous root. Performing this over the whole gene tree
467
(some nodes will be visited twice) it is possible to
468
explore all possible root placements and calculate the
469
resulting duplications in practically linear time. The
470
pseudocode algorithm is as follows:
471
472
473
Algorithm for speciation duplication inference
474
combined with rooting
475
476
Input : binary gene tree
477
G , rooted binary species tree
478
S .
479
480
Output:
481
G with "duplication" or
482
"speciation" assigned to each internal node and rooted in
483
such a way that the sum of duplications is minimized.
484
485
SDIunrooted(G, S)
486
487
root gene tree
488
G at the midpoint of any
489
branch;
490
set
491
B = getBranchesInOrder(
492
G );
493
SDIse(
494
G ,
495
S ) (see [ 13 ] );
496
for each branch
497
b in
498
B :
499
set
500
n
501
1 = child 1 of root of
502
G ;
503
set
504
n
505
2 = child 2 of root of
506
G ;
507
root
508
G at the midpoint of branch
509
b ;
510
updateM(
511
n
512
1 ,
513
n
514
2 ,
515
G );
516
if (sum of duplications in
517
G <
518
d
519
min ):
520
set
521
d
522
min =
523
d ;
524
set
525
G
526
dmin =
527
G ;
528
return
529
G
530
dmin ;
531
532
updateM(
533
n
534
1 ,
535
n
536
2 ,
537
G )
538
539
set
540
r = root of
541
G ;
542
if (child 1 of
543
r ==
544
n
545
1 || child 2 of
546
r ==
547
n
548
1 ):
549
calculateMforNode(
550
n
551
1 );
552
else:
553
calculateMforNode(
554
n
555
2 );
556
calculateMforNode(
557
r );
558
559
calculateMforNode(
560
n )
561
562
if (
563
n != external):
564
set
565
a = M(child 1 of
566
n );
567
set
568
b = M(child 2 of
569
n );
570
while (
571
a !=
572
b ):
573
if (
574
a >
575
b ):
576
set
577
a = parent of
578
a ;
579
else:
580
set
581
b = parent of
582
b ;
583
set M(
584
n ) =
585
a ;
586
if (M(
587
n ) == M(child 1 of
588
n ) || M(
589
n ) == M(child 2 of
590
n ):
591
592
n is duplication;
593
else:
594
595
n is speciation;
596
597
getBranchesInOrder(
598
G )
599
600
set
601
n = root of
602
G ;
603
set
604
i = 0;
605
while !(
606
n == root && indicator of
607
n == 2):
608
if (
609
n != external && indicator
610
of
611
n != 2):
612
if (indicator of
613
n == 0):
614
set indicator of
615
n = 1;
616
set
617
n = child 1 of
618
n ;
619
else:
620
set indicator of
621
n = 2;
622
set
623
n = child 2 of
624
n ;
625
if (parent of
626
n != root):
627
set B [
628
i ] = branch connecting
629
n and parent of
630
n ;
631
else:
632
set B [
633
i ] = branch connecting child 1 of
634
root and child 2 of root;
635
set
636
i =
637
i + 1;
638
else:
639
if (parent of
640
n != root &&
641
n != external):
642
set B [
643
i ] = branch connecting
644
n and parent of
645
n ;
646
set
647
i =
648
i + 1;
649
set
650
n = parent of
651
n ;
652
return
653
B ;
654
655
656
Master species tree
657
Duplication inference requires a species tree. For
658
this purpose, a single completely binary master species
659
tree was compiled manually, containing 249 of the most
660
commonly encountered species in Pfam (spanning Archaea,
661
Bacteria, and Eukaryotes). This tree is based mainly on
662
information from Maddison's "Tree of Life" project
663
http://tolweb.org/tree/phylogeny.html, NCBI's taxonomy
664
database
665
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html,
666
the "Deep Green" project
667
http://ucjeps.berkeley.edu/bryolab/greenplantpage.html,
668
and [ 29 30 31 32 ] . This master tree groups nematodes
669
and arthropods into a clade of ecdysozoans (molting
670
animals) as proposed by Aguinaldo [ 29 ] , a
671
classification which is still controversial. The tree is
672
available in NHX format [ 33 ] at
673
http://www.genetics.wustl.edu/eddy/forester/tree_of_life_bin_1-4.nhx.
674
675
676
677
Implementation
678
RIO is implemented in a Perl pipeline of several
679
software programs as follows. Alignment of the query
680
sequence is done programs from the HMMER package
681
http://hmmer.wustl.edu/. Bootstrapping is performed by a
682
bespoke C program. Maximum likelihood pairwise distances
683
are calculated using BLOSUM matrices [ 25 ] by a modified
684
version of TREE-PUZZLE [ 34 ] . Neighbor joining trees are
685
calculated by a modified version of NEIGHBOR from the
686
PHYLIP package [ 35 ]
687
http://evolution.genetics.washington.edu/phylip.html.
688
Rooting and duplication inference are accomplished by
689
"SDIunrooted" - a Java implementation of our SDI algorithm
690
which incorporates various methods for rooting (see above).
691
The actual counting of orthologs is performed by methods of
692
the Java class "RIO". These programs, with the exception of
693
HMMER, are part of the FORESTER package and are available
694
under the GNU license at
695
http://www.genetics.wustl.edu/eddy/forester/.
696
In order to run RIO locally, the following packages and
697
databases need to be present: HMMER, the Pfam database [ 8
698
] , the SWISS-PROT and TrEMBL databases [ 36 ] .
699
RIO is also available as an analysis webserver at
700
http://www.rio.wustl.edu/. The pairwise distance and tree
701
calculations are parallelized in this version (currently,
702
ten 1.26 GHz Pentium III processors are being used).
703
704
705
Results and Discussion
706
707
Precalculation of pairwise distances
708
Pairwise distances to be used in RIO analyses were
709
calculated using the "full" alignments (as opposed to the
710
smaller curated "seed" alignments) from Pfam 6.6 (August
711
2001, 3071 families, [ 8 ] ). Sequences from species not
712
present in the master species tree were removed from the
713
alignments. For computational efficiency reasons,
714
alignments that still contained more than 600 sequences
715
were further pruned; sequences not originating from
716
SWISS-PROT were discarded, and sequences from certain
717
mammals were excluded (mouse, rabbit, hamsters, goat, all
718
primates except human), since mammals are likely to be
719
oversampled in most Pfam families. For some extremely
720
large families [immunoglobulin domain (PF00047), protein
721
kinase domain (PF00069), collagen triple helix repeat
722
(PF01391), and rhodopsin-type 7 transmembrane receptor
723
(PF00001)], all mammalian sequences except those from
724
human and rat were excluded.
725
Alignments of average length <30 amino acids
726
(<40 for zinc finger domains) or with <6 sequences
727
were not analyzed, because of lack of phylogenetic
728
signal. For all other families, pairwise distances for
729
100 bootstrap samples were prepared. Following the above
730
rules, pairwise distances were precalculated for 2384
731
alignments from a total of 3071 in Pfam 6.6 (75
732
alignments were too short and 612 alignments contained
733
less than six sequences from species in the master
734
species tree).
735
736
737
Phylogenomic analyses of the A. thalianaand C.
738
elegansproteomes
739
In order to get an estimate of the effectiveness of
740
this implementation of automated phylogenomics, we used
741
the RIO procedure to analyze the
742
A. thaliana [ 16 ] and
743
C. elegans [ 17 ] proteomes.
744
The input for RIO consists of a query protein sequence
745
together with a Pfam alignment for a protein family that
746
the query belongs to. Before RIO could be applied we
747
therefore had to determine the matching domains for each
748
protein in the
749
A. thaliana and
750
C. elegans proteomes. For proteins
751
composed of different domains, a RIO analysis is
752
performed for each domain individually.
753
The source for protein sequences were:
754
ATH1.pep.03202001, a flatfile database of 25,579
755
A. thaliana amino acid sequences
756
(hypothetical, predicted and experimentally verified)
757
that have been identified as part of the Arabidopsis
758
Genome Initiative (AGI)
759
http://www.arabidopsis.org/info/agi.html, and wormpep 43,
760
a flatfile database of 19,730
761
C. elegans amino acid sequences
762
http://www.sanger.ac.uk/Projects/C_elegans/wormpep/.
763
The program hmmpfam (version 2.2 g) from the HMMER
764
package was used to search each protein sequence in
765
ATH1.pep.03202001 and wormpep 43 against Pfam 6.6. Only
766
domains with a score above the so-called Pfam gathering
767
cutoff were reported ("cut_ga" option) in order to
768
include only confident domain assignments.
769
The sum of domains assigned to the 25,579
770
A. thaliana protein sequences was
771
17,847 (counting multiple copies of the same domain in
772
one protein as one). 12,431 sequences matched one domain
773
(containing possibly multiple copies of this one domain).
774
1,982 sequences matched two different domains (containing
775
possibly multiple copies of both). 453 sequences matched
776
three or more different domains (containing possibly
777
multiple copies of each). Therefore, a total of 14,866
778
(58%) sequences from ATH1.pep.03202001 could be assigned
779
to one or more Pfam families.
780
Similarly, a sum of 12,314 domains was assigned to the
781
19,769
782
C. elegans protein sequences. 7,698
783
sequences matched one domain, 1,632 matched two different
784
domains, and 388 matched three or more different domains.
785
Thus, 9,718 (49%) sequences from wormpep 43 could be
786
assigned to one or more Pfam families.
787
RIO was then used to analyze each protein sequence
788
matching one or more Pfam families. The results from
789
these analyses can be found at
790
http://www.genetics.wustl.edu/eddy/forester/rio_analyses/.
791
The approximate time requirement was between two and
792
three weeks, performed on eight Pentium III 800 Mhz
793
processors.
794
795
796
How many sequences can be analyzed with RIO?
797
The first question we asked was simply how many
798
sequences can be analyzed with RIO. For an overview, see
799
Table 1. From the 17,847
800
A. thaliana domain sequences
801
matching a Pfam family, 14,905 (84%) could be analyzed
802
with RIO using the precalculated distances. 2859 (16%)
803
domain sequences were not analyzed because the
804
corresponding Pfam alignments were either too short or
805
did not contain enough sequences (as described above). 83
806
(0.5%) domain sequences were not analyzed because the
807
E-value for the match to their profile HMM was below the
808
threshold of 0.01. This represents a second filtering
809
step for preventing analyzing false domain assignments
810
(besides only analyzing domain sequences which score
811
above the gathering cutoff in the domain analysis). (RIO
812
performs a preprocessing step before aligning the query
813
sequence to a Pfam alignment, in which the program
814
hmmsearch is used to trim the query sequence by searching
815
it with the appropriate profile HMM. If the resulting
816
E-value was below 0.01 no analysis was performed.)
817
Multiple copies of the same domain in certain sequences
818
result in a sum of individual analyses larger then the
819
number of analyzed domain sequences. In case of
820
A. thaliana this number was
821
17,940.
822
Correspondingly, from the 12,314
823
C. elegans domain sequences
824
matching a Pfam family, 11,287 (92%) could be analyzed
825
with RIO using the precalculated distances. 901 (7%)
826
domain sequences were not analyzed because the
827
corresponding Pfam alignments were either too short or
828
did not contain enough sequences. 53 (0.4%) domain
829
sequences were not analyzed because the E-value for the
830
match to their profile HMM was below the threshold of
831
0.01. In addition, we did not analyze the 73
832
C. elegans sequences matching the
833
immunoglobulin family (PF00047), because we considered
834
the phylogenetic signal in this alignment to be
835
questionable. Furthermore, most of the 73 sequences
836
contain multiple copies of the immunoglobulin domain (for
837
example, CE08028 contains 48 immunoglobulin domains) and
838
we therefore worried that the results from this family
839
might skew our overall results. The sum of RIO analyses
840
was 14,740.
841
Thus, a little less than half of each proteome can be
842
analyzed by RIO. The most important factor is whether a
843
protein sequence has a match to a Pfam domain family.
844
845
846
RIO analysis of lactate/malate dehydrogenase family
847
members
848
In order to test whether RIO performs well on an
849
"easy" case, RIO was used to analyze lactate/malate
850
dehydrogenase family members both in
851
A. thaliana and
852
C. elegans . L-Lactate and malate
853
dehydrogenases are members of the same protein family
854
(represented in Pfam as ldh for the NAD-binding domain
855
and ldh_C for the alpha/beta C-terminal domain), yet they
856
catalyze different reactions. L-lactate dehydrogenase (EC
857
1.1.1.27) catalyzes the following reaction: (S)-lactate +
858
NAD += pyruvate + NADH [ 37 ] . Malate dehydrogenase
859
(NAD) (EC 1.1.1.37) catalyzes: (S)-malate + NAD +=
860
oxaloacetate + NADH [ 38 ] . NADP-dependent malate
861
dehydrogenase (EC 1.1.1.82) utilizes NADP +as cofactor
862
instead of NAD + [ 39 40 ] . According to the Pfam domain
863
analysis described above, the
864
A. thaliana proteome contains ten
865
lactate/malate dehydrogenase family members, whereas the
866
C. elegans proteome contains three.
867
(In addition,
868
C. elegans also contains two
869
putative members of a second lactate/malate dehydrogenase
870
family [ 41 ] , ldh_2, which are not discussed here.) The
871
RIO output for the
872
A. thaliana protein F12M16_14
873
analyzed against the ldh domain alignment is shown as an
874
example in Figure 7. The results are summarized in Tables
875
2and 3. Complete RIO output files (as well as NHX [ 33 ]
876
tree files) are avaliable, herefor
877
A. thaliana and at herefor
878
c.elegans . In all cases,
879
distinction between malate dehydrogenase (NAD) and
880
lactate dehydrogenase is unquestionable and in accordance
881
with existing annotations and BLAST results irrespective
882
which domain (ldh or ldh_C) was used for the RIO analysis
883
(which implies that no domain swapping occurred over long
884
evolutionary times). Furthermore, the same results are
885
achieved whether only the top 1 sequence (the one with
886
the highest orthology value, shown in Tables 2and 3) or
887
the top 10 sequences are used to transfer annotation
888
from. The only likely NADP-dependent malate dehydrogenase
889
is the
890
A. thaliana sequence MCK7_20. For
891
some query sequences, the top orthology values are low.
892
Yet, all subtree-neighborings above 50% exhibit consensus
893
at distinguishing between malate and lactate
894
dehydrogenase. In contrast, a finer distinction (e.g.
895
between mitochondrial and cytoplasmic malate
896
dehydrogenase) proves more problematic. While there is no
897
case of actual conflict between the existing annotation
898
and the RIO results, in many cases there is no compelling
899
evidence in the RIO results to confirm the finer
900
distinctions in the existing annotations. Obviously, the
901
resolution power of RIO is limited by the given
902
annotations and by the number (or even presence) of
903
sequences for each sub(sub)family.
904
905
906
Sequences with no orthologs in the current
907
databases
908
Next, we determined the distribution of the top
909
orthology bootstrap values. The sequence with the top
910
orthology bootstrap value is the one that is most likely
911
to be the true ortholog of the query. If the top
912
orthology bootstrap value is low, then the query sequence
913
is likely to have no ortholog in the Pfam alignment.
914
These results are summarized in Table 4. For example, for
915
2252
916
A. thaliana query sequences, at
917
least one sequence was orthologous in at least 95 out of
918
100 resampled trees. In contrast, for 930
919
A. thaliana query sequences, no
920
sequence was orthologous in more than five out of 100
921
bootstrapped trees. For query sequences with more than
922
one copy of the same domain, each copy had to meet the
923
conditions individually in order for the whole query
924
sequence being counted to be below or above the
925
threshold.
926
We do not think it is possible at this stage to
927
determine reliable threshold values for "true orthologs"
928
or "absence of orthologs". Such thresholds are very
929
likely to be different for different Pfam families since
930
families vary in the phylogenetic signal their alignment
931
contains. Some sequences that are very likely to be true
932
933
orthologs nonetheless exhibit marginal orthology
934
bootstrap values (in the range of 70% or even lower).
935
We focused on sequences that appeared to have no
936
orthologs (<5% bootstrap), since these would be cases
937
where a RIO analysis might be most able to correct overly
938
specific annotations that might be transferred based
939
solely on sequence similarity (as illustrated in Figure
940
1). An example for this is the
941
A. thaliana sequence F28P22_13.
942
(Files related to this analysis are avaliable, here.)
943
This sequence is a zinc-binding dehydrogenase (Pfam:
944
adh_zinc, PF00107). F28P22_13 has been annotated in
945
ATH1.pep.03202001 "as putative cinnamyl-alcohol
946
dehydrogenase", based on sequence similarity (its top 10
947
BLAST matches are all cinnamyl-alcohol dehydrogenases
948
with E-values in the range of 10 -94if analyzed against
949
all non-redundant GenBank CDS
950
translations+PDB+SwissProt+PIR+PRF on Jan 2, 2002).
951
Cinnamyl-alcohol dehydrogenase (EC 1.1.1.195) catalyzes
952
the following reaction: cinnamyl alcohol + NADP +=
953
cinnamaldehyde + NADPH (but it can also act on coniferyl
954
alcohol, sinapyl alcohol and 4-coumaryl alcohol) in the
955
flavonoid, stilbene and lignin biosynthesis pathways [ 40
956
42 ] . According to the RIO analysis, F28P22_13 has no
957
orthologs (see Figure 8for the corresponding tree and
958
Figure 9for the RIO output). Furthermore its
959
subtree-neighbors above 90%, cinnamyl-alcohol
960
dehydrogenases and NADP-dependent alcohol dehydrogenases
961
(EC 1.1.1.2), exhibit only partial annotation agreement
962
(namely that of some type of NADP-dependent alcohol
963
dehydrogenase, but not EC 1.1.1.2 or EC 1.1.1.195).
964
Hence, F28P22_13 is likely to be a (possibly novel) type
965
of NADP-dependent alcohol dehydrogenase (other than EC
966
1.1.1.2), possibly a novel type of cinnamyl-alcohol
967
dehydrogenase.
968
One might expect that each query sequence that appears
969
to have no orthologs is connected with scenario similar
970
to the one described above for F28P22_13. Yet, this is
971
clearly not the case, for the following reasons: (i) Gene
972
duplications might not be followed by functional
973
modification (many Pfam families are composed of
974
sequences which have all the same function, at least at
975
the resolution of the current annotation). (ii) Some Pfam
976
families are composed solely of sequences originating
977
from closely related (or the same) species (such as
978
PF02362, the B3 DNA binding domain of higher plants). For
979
such families, query sequences from the same species
980
group are expected to have low orthology values. In such
981
cases the concept of subtree-neighbors and ultra-paralogs
982
is more useful than orthologs. (iii) Erroneous RIO
983
results caused by an insufficient phylogenetic signal
984
(due to short sequences, for example) can lead to low
985
orthology values. For this reason, RIO also outputs the
986
average bootstrap value for the consensus tree to give
987
the user a hint about the amount of phylogenetic signal
988
in the alignment used.
989
990
991
Inconsistency between orthology bootstrap values
992
and sequence similarity
993
We were next interested in the number of sequences in
994
the two proteomes for which the orthology bootstrap
995
values do not correspond to sequence similarity (Table
996
5). Such disagreements could be caused by the situation
997
illustrated in Figure 2. To determine these numbers, we
998
used the following rules. Two thresholds for orthology
999
bootstrap values were chosen:
1000
O , the minimum for being an
1001
ortholog (e.g. 90%) and
1002
N , the maximum for not being an
1003
ortholog (e.g. 10%). Furthermore, a maximal ratio
1004
R for the distance of the query to
1005
non-orthologs to the distance of the query to orthologs
1006
was chosen (e.g. 0.5). In order for being counted as
1007
exhibiting disagreement between the orthology bootstrap
1008
values and sequence similarity a query sequence had to
1009
fulfill the following two conditions: (i) it must have a
1010
least one ortholog with bootstrap orthology value above
1011
or equal to
1012
O , and (ii)
1013
all sequences in the alignment with
1014
bootstrap orthology values above
1015
N , must have distance ratios
1016
smaller or equal to
1017
R for at least one sequence with
1018
bootstrap orthology lower or equal to
1019
N . Sequences from the following
1020
species were ignored in this analysis (since they were
1021
the species of the query sequence or related to it):
1022
A. thaliana proteome:
1023
Rosidae (
1024
A. thaliana, Pisum sativum ,
1025
Glycine max ,
1026
Cucurbita maxima ,
1027
Cucumis sativus ,
1028
Brassica campestris ,
1029
Brassica napus ,
1030
Citrus unshiu ,
1031
Citrus sinensis ,
1032
Theobroma cacao ,
1033
Gossypium hirsutum );
1034
C. elegans proteome: nematodes (
1035
C. elegans ,
1036
Caenorhabditis briggsae ,
1037
Haemonchus contortus ,
1038
Ascaris suum ).
1039
Manual inspection of the RIO output leads to the
1040
following, somewhat unexpected, conclusion. In many cases
1041
a discrepancy between orthology bootstrap values and
1042
sequence similarity is caused by orthologs in only
1043
phylogenetically distant (relatively to the query
1044
sequence) species. This can lead to errors if functional
1045
annotation is blindly transferred from these orthologs to
1046
the query. As an example of this, the results of
1047
analyzing the
1048
A. thaliana O-methyltransferase
1049
F16P17_38 are shown in Figures 10and 11. (Complete files
1050
are at here.) Even though the F16P17_38 sequence is
1051
orthologous to the bacterial hydroxyneurosporene
1052
methyltransferases (EC 2.1.1.-) [ 43 ] it would be
1053
dangerous to annotate it as such. A more reasonable
1054
annotation for this query would be to annotate it based
1055
on subtree-neighbors and hence call it a plant
1056
O-methyltransferase. An indication of this problem
1057
(besides a discrepancy between orthology bootstrap values
1058
and sequence similarity) is the meeting of the following
1059
three conditions: A query sequence has (i) likely
1060
orthologs and (ii) likely subtree-neighbors in other
1061
species than the query itself, yet (iii) there is no
1062
significant overlap between the orthologs and the
1063
subtree-neighbors.
1064
We were unable to find convincing examples in the
1065
C. elegans and
1066
A. thaliana proteomes where wrong
1067
sequence similarity based annotations might be caused by
1068
unequal rates of evolution (as illustrated in Figure 2).
1069
This is not to say that such cases do not exist in those
1070
two proteomes, but they are likely to be quite rare.
1071
Similarly to the issues described in the previous
1072
section, the detection of such examples is complicated by
1073
the fact that for many cases in which a discrepancy
1074
between orthology bootstrap values and sequence
1075
similarity exists, all sequences in the Pfam alignment
1076
appear to have to same function, the Pfam family is
1077
lineage specific, or the annotations are too
1078
poor/confusing to make any kind of inference.
1079
1080
1081
1082
Conclusions
1083
RIO is a procedure for automated phylogenomics. The RIO
1084
procedure appears to be particularly useful for the
1085
detection of first representatives of novel protein
1086
subfamilies. Sequence similarity based methods can be
1087
misleading in these cases since every query is always "most
1088
similar to something", whereas RIO can detect the absence
1089
of orthologs.
1090
Storm, Sonnhammer, and colleagues have recently
1091
developed similar ideas and procedures in a program called
1092
ORTHOSTRAPPER [ 44 45 ] . One distinction between the two
1093
approaches is that ORTHOSTRAPPER's orthology determination
1094
procedure does not employ a species tree for duplication
1095
inference; it uses a heuristic based on sequence similarity
1096
rather than a formally correct phylogenetic means of
1097
inferring orthology. Another distinction is that
1098
ORTHOSTRAPPER uses uncorrected observed mismatches as a
1099
sequence distance measure, rather than estimating
1100
evolutionary distances. In general, RIO brings more of the
1101
power of known phylogenetic inference algorithms to bear on
1102
the problem of proteomic annotation.
1103
Super-orthology is a very stringent criterion. If a
1104
query sequence is likely to have super-orthologs, they
1105
represent an excellent source to transfer functional
1106
annotation from. In contrast, the absence of
1107
super-orthologs does not imply that a function for a query
1108
sequence cannot be inferred (in the two proteomes analyzed
1109
in this work, most sequences appear to have no
1110
super-orthologs in Pfam 6.6).
1111
Ultra-paralogs are sequences in the same species as the
1112
query and are likely to be the result of recent
1113
duplications and therefore might not have yet undergone
1114
much functional divergence. Operationally, splice variants
1115
can also be thought of as ultra-paralogs (at least as long
1116
as protein sequences are considered).
1117
Subtree-neighbors have two uses: (i) If the
1118
subtree-neighbors of the query sequence exhibit (partial)
1119
agreement in their functional annotations, the elements in
1120
which they agree might be used to infer a (partial)
1121
function for the query. This is useful for query sequences
1122
that are appear to have no orthologs in the current
1123
databases. (ii) For query sequences that do have orthologs,
1124
absence of overlap between the sequences considered
1125
orthologous and those which appear to be subtree-neighbors
1126
raises a red flag, indicating that the orthologs are in
1127
phylogenetically distant species relative to the query.
1128
Transferring annotation from such orthologs is risky. In
1129
this case, subtree-neighbors are a more reliable source to
1130
transfer annotation from.
1131
RIO outputs warnings if the distance of the query
1132
sequence to other sequences is unusually short or long,
1133
relative to other branch lengths on the tree. The
1134
usefulness of this was not investigated in this work.
1135
A RIO procedure based on Pfam alignments analyzes each
1136
protein domain individually since Pfam is protein family
1137
database based on individual domains [ 8 ] . In some
1138
respects, it would be preferable to analyze whole protein
1139
sequences, but well curated databases of complete protein
1140
alignments are not available (to our knowledge). However,
1141
domain-by-domain analysis is not necessarily
1142
disadvantageous. Due to domain shuffling many proteins are
1143
mosaic proteins, composed of domains with different
1144
evolutionary histories [ 46 47 ] . For such proteins it
1145
makes much sense to analyze each domain individually.
1146
Furthermore, mosaic proteins from sufficiently distant
1147
species might be impossible to be aligned over more than
1148
one domain at the time, since they are unlikely to exhibit
1149
the same domain organization. The same is true for multiple
1150
copies of the same domain in protein: Each of them is
1151
analyzed individually (such proteins often differ in their
1152
number of domain copies and could therefore not be aligned
1153
from end to end for the whole family).
1154
In general, the concept of "annotation consensus" is
1155
very important in this work (for example consensus between
1156
subtree-neighbors, or between subtree-neighbors and
1157
orthologs). We have employed this notion loosely. A useful
1158
future extension would be to incorporate automated
1159
annotation consensus detection into RIO. This would include
1160
annotation of internal nodes of a gene tree with a
1161
"biological function". Automated consensus detection is
1162
trivial for a highly formalized notation system, such as EC
1163
numbers (the consensus of EC 1.1.1.3 and EC 1.1.1.23 is EC
1164
1.1.1, a oxidoreductase acting on the CH-OH group of donors
1165
with NAD +or NADP +as acceptor [ 40 ] ). Obviously, it is
1166
much more difficult to analyze natural language annotations
1167
in the same manner. Perhaps this could be accomplished by
1168
utilizing the set of structured vocabularies of the Gene
1169
Ontology (GO) project [ 48 ]
1170
http://www.geneontology.org/.
1171
1172
1173
1174
1175