CoCalc -- 1471-2105-3-14.txt

OANC_GrAF / data / written_2 / technical / biomed / 1471-2105-3-14.txt
³⁹⁶⁷³ views
1

2
  
3
    
4
      
5
        Background
6
        Accurate computational protein function analysis is an
7
        important way of extracting value from primary sequence
8
        data. Due to the large amount of data, automated systems
9
        seem unavoidable (at least for initial, prioritizing
10
        steps). Such efforts are complicated, for a variety of
11
        reasons. Many proteins belong to large families, as
12
        suggested by Dayhoff [ 1 ] . Such families are often
13
        composed of subfamilies related to each other by gene
14
        duplication events. For example, Ingram [ 2 ] showed that
15
        human α, β, and γ chains of hemoglobins are related to each
16
        other by gene duplications. Gene duplication allows one
17
        copy to assume a new biological role through mutation,
18
        while the other copy preserves the original functionality [
19
        3 4 ] . Hence, subfamilies often differ in their biological
20
        functionality yet still exhibit a high degree of sequence
21
        similarity.
22
        Other complications in functional analysis include:
23
        ignoring the multi-domain organization of proteins; error
24
        propagation caused by transfer of information from
25
        previously erroneously annotated sequences; insufficient
26
        masking of low complexity regions; and alternative splicing
27
        [ 5 ] .
28
        Typically, automated sequence function analysis relies
29
        on pairwise sequence similarity and programs such as BLAST
30
        [ 6 ] or FASTA [ 7 ] . Annotating a sequence by
31
        transferring annotation from its most similar sequence(s)
32
        tends to produce overly specific annotation. In contrast,
33
        analyses using profile search algorithms such as HMMER
34
        http://hmmer.wustl.edu/and Pfam [ 8 ] classify sequences
35
        too generally. They recognize that a query sequence belongs
36
        to a certain family (or, to be more precise, indicate which
37
        domain(s) the query is likely to contain), but do not
38
        subclassify the sequence.
39
        At least two scenarios can cause misleading predictions
40
        when using pairwise sequence similarity alone for
41
        annotation: (i) not having a known annotated representative
42
        of the correct subfamily because incomplete sequence
43
        databases and/or gene loss (Figure 1), and (ii) unequal
44
        rates of evolution (Figure 2). The case of trying to
45
        annotate the first (or only) representative of a novel
46
        subfamily is of particular interest. Pairwise similarity
47
        based methods alone cannot recognize that a new sequence
48
        does not belong in any currently known subfamily (e.g.
49
        "orphan" G-protein coupled receptors), because every
50
        sequence is most similar to something. In contrast, when
51
        constructing a phylogenetic tree, this case is easy to
52
        observe (as illustrated in Figure 1). A human annotator can
53
        use phylogenetic tree analysis to place a new sequence in
54
        the subfamily structure of a gene tree of known sequences.
55
        This approach was called "phylogenomics" by Eisen [ 9 ] .
56
        It would be desirable to automate this procedure, but the
57
        best automated methods for subfamily annotation, such as
58
        the COGs database [ 10 ] , are clustering methods that do
59
        not directly use phylogenetic analysis.
60
        It is infeasible to completely automate functional
61
        analysis, because it is impossible to precisely define what
62
        protein "function" means. However, a principle of
63
        phylogenomics is that orthologous sequences (that diverged
64
        by speciation) are more likely to conserve protein function
65
        than paralogous sequences (that diverged by gene
66
        duplication). Orthology and paralogy are well defined and
67
        can be inferred from gene and species trees. One useful and
68
        automatable phylogenomics approach would be as follows: if
69
        a novel sequence has orthologs, annotation can be
70
        transferred from them (as in best BLAST analysis); if there
71
        are no orthologs, the sequence is classified as just a
72
        family member (as in Pfam/InterPro analysis) and flagged as
73
        possibly the first representative of a novel subfamily. At
74
        the core of such approaches stands therefore the
75
        distinction between orthologs and paralogs, and hence the
76
        ability to discriminate between duplication and speciation
77
        events on a gene tree. Various efficient algorithms to
78
        infer gene duplications on a gene tree by comparing it to a
79
        species tree have been described (for example: by
80
        Eulenstein [ 11 ] , and by Zhang [ 12 ] ). We developed a
81
        simple algorithm (named SDI for Speciation Duplication
82
        Inference) that appears to solve this problem even more
83
        efficiently on realistic data sets, though it has an
84
        asymptotic worst-case running time that is less favorable [
85
        13 ] .
86
        In practice, phylogenetic trees are unreliable. Errors
87
        in trees will produce spurious inferred duplications. This
88
        is obviously problematic if duplications are to be used as
89
        indicators of potential functional changes. Therefore,
90
        instead of determining the orthologs of a query sequence on
91
        just one gene tree, inference could be performed over
92
        bootstrap resampled gene trees [ 14 15 ] to estimate of the
93
        reliability of the assignments. Here we describe and test a
94
        procedure - RIO (for Resampled Inference of Orthologs) -
95
        which allows to perform such analyses in an automated
96
        fashion. We present results of using RIO to analyze a plant
97
        ( 
98
        A. thaliana [ 16 ] ) and an animal
99
        (the nematode 
100
        C. elegans [ 17 ] ) proteome.
101
      
102
      
103
        Algorithm
104
        
105
          Definitions
106
          Orthologs are defined as two genes that diverged by a
107
          speciation event. Paralogs are defined as two genes that
108
          diverged by a duplication event [ 18 ] . Other concepts
109
          derived from gene trees can be useful for functional
110
          prediction. We introduce and justify three such concepts
111
          ("super-orthologs", "ultra-paralogs", and
112
          "subtree-neighbors"):
113
          Careless use of orthology relationships without
114
          examining the tree itself can lead to incorrect
115
          annotations. In the example shown in Figure 3A, the human
116
          query sequence has two orthologous sequences in wheat.
117
          These two wheat sequences are related to each other by a
118
          gene duplication and one (or even both) of them might
119
          have undergone functional modification after their
120
          divergence. Given a procedure that gave a list of
121
          orthologues for the human gene query, such situations
122
          should be revealed by only partial (or complete absence
123
          of) agreement between the annotations of the wheat
124
          orthologs. Now consider the situation in Figure 3B. This
125
          is trickier, since in this case only one ortholog will be
126
          reported for the query sequence, but it will be just as
127
          dangerous to transfer annotation. We do not attempt to
128
          solve this problem (the solution is careful manual
129
          analysis of the gene tree) but an automated procedure can
130
          warn that this situation might be present. For this
131
          purpose we introduce the concept of
132
          "super-orthologs":
133
          
134
          Definition 1. Given a rooted gene
135
          tree with duplication or speciation assigned to each of
136
          its internal nodes, two sequences are super-orthologous
137
          if and only if each internal node on their connecting
138
          path represents a speciation event.
139
          Hence, the query sequences in Figure 3have no
140
          super-orthologs. In contrast, the rat, mouse, and wheat
141
          sequences in Figure 1Aare super-orthologs pf the human
142
          query sequence. By definition, the super-orthologs of a
143
          given sequence are a subset of its orthologs.
144
          Certain sequences underwent multiple recent
145
          duplications, resulting in large species specific
146
          sequence families, such as the 
147
          C. elegans seven-transmembrane
148
          proteins acting as odorant and chemosensory receptors [
149
          19 20 ] . For query sequences belonging to such sequence
150
          families, orthologs (if present) are less effective for
151
          predicting specific information. In these cases, paralogs
152
          of the same (sub) family might be more informative for
153
          functional prediction (as long as the duplications indeed
154
          happened "late" in evolutionary times). To formalize
155
          this, we introduce the concept of "ultra-paralogs":
156
          
157
          Definition 2. Given a rooted gene
158
          tree with duplication or speciation assigned to each of
159
          its internal nodes, two sequences are ultra-paralogous if
160
          and only if the smallest subtree containing them both
161
          contains only internal nodes representing
162
          duplications.
163
          Figure 4illustrates the concept of ultra-paralogs. It
164
          follows from definition 2 that two ultra-paralogous
165
          sequences must occur in the same species.
166
          Often, researchers construct a gene tree and then
167
          informally use "subtrees" (clades) to make inferences
168
          about sequences (without regard to duplications and
169
          speciations). We introduce this concept into our
170
          procedure as well, formalized as "subtree-neighbors"
171
          (illustrated in Figure 5):
172
          
173
          Definition 3. Given a completely
174
          binary and rooted gene tree, the 
175
          k -subtree-neighbors of a sequence 
176
          q are defined as all sequences
177
          derived from the 
178
          k -level parent node of 
179
          q , except 
180
          q itself (the level of 
181
          q itself is 0, 
182
          q 's parent is 1, and so
183
          forth).
184
          Subtree-neighbors can be useful if there is (partial)
185
          agreement among their annotations (for example: if the
186
          subtree-neighbors of a query are NAD +-dependent
187
          isocitrate dehydrogenase and NADP +-dependent isocitrate
188
          dehydrogenase we can suppose that the query is likely to
189
          be a isocitrate dehydrogenase, but it is not possible to
190
          determine whether it is dependent on NAD +or NADP +). If
191
          the subtree-neighbors lack any agreement in their
192
          annotations a useful inference is not possible (see [ 9 ]
193
          for a more detailed discussion). Furthermore, orthologs
194
          that are not also subtree-neighbors can be misleading
195
          (for a more detailed discussion of this, see below, and
196
          see Figures 10and 11for examples).
197
        
198
        
199
          The RIO procedure
200
          This basic RIO procedure is as follows. For a simple
201
          example with only four bootstrap resamples, see Figure
202
          6.
203
          We use the Pfam protein family database [ 8 ] as a
204
          source of high quality curated multiple sequence
205
          alignments and profile HMMs (Hidden Markov Models, see [
206
          21 ] for a review), as well as programs from the HMMER
207
          package http://hmmer.wustl.edu/. RIO can easily be
208
          adapted to work with different sources of alignments and
209
          different alignment programs. For tree reconstruction,
210
          the neighbor joining (NJ) algorithm [ 22 ] is used, since
211
          it is reasonably fast, can handle alignments of large
212
          numbers of sequences, and does not assume a molecular
213
          clock. NJ recreates the correct additive tree as long as
214
          the input distances are additive [ 23 ] , and is
215
          effective even if additivity is only approximated [ 24 ]
216
          .
217
          
218
          Input: A query protein sequence 
219
          Q with unknown function.
220
          A curated multiple alignment 
221
          A from the Pfam database for the
222
          protein family that 
223
          Q belongs to (as determined by
224
          hmmpfam from the HMMER package).
225
          A profile HMM 
226
          H for the protein family that Q
227
          belongs to.
228
          
229
          Output: A list (as in Figure 7) of
230
          proteins orthologous to 
231
          Q , sorted according to a bootstrap
232
          confidence value (based on orthology, super-orthology, or
233
          subtree-neighborings).
234
          Optional: A gene tree based on the multiple alignment 
235
          A and the query 
236
          Q annotated with orthology
237
          bootstrap confidence values for the query 
238
          Q .
239
          
240
            Procedure: 
241
          
242
          
243
          1. Query sequence 
244
          Q is aligned to the existing
245
          alignment 
246
          A (using hmmalign from the HMMER
247
          package and the Pfam profile HMM 
248
          H ).
249
          
250
          2. The alignment is bootstrap
251
          resampled 
252
          x times (usually, 
253
          x = 100).
254
          
255
          3. Maximum likelihood pairwise
256
          distance matrices are calculated for each of the 
257
          x multiple alignments using a model
258
          of amino acid substitution (for example, BLOSUM [ 25 ] or
259
          Dayhoff PAM [ 26 ] ).
260
          
261
          4. An unrooted phylogenetic tree is
262
          inferred for each of the 
263
          x multiple alignments by neighbor
264
          joining [ 22 ] , resulting in 
265
          x gene trees. Each tree is rooted
266
          by a modified version of our SDI algorithm [ 13 ] that
267
          minimized the number of duplications postulated (this is
268
          discussed in more detail later).
269
          
270
          5. For each of the 
271
          x rooted gene trees: For each node
272
          it is inferred whether it represents a duplication or a
273
          speciation event by comparing the gene tree to a trusted
274
          species tree.
275
          
276
          6. For each sequence 
277
          s in the gene tree (except 
278
          Q ): Count the number of gene trees
279
          where 
280
          s is orthologous to 
281
          Q (see Figure 6for an illustration
282
          of steps 5. and 6.). Bootstrap confidence values for
283
          super-orthologies, ultra-paralogies and subtree-neighbors
284
          are calculated analogously.
285
        
286
        
287
          Precalculation of pairwise distances for increased
288
          time efficiency
289
          The most time consuming step in the procedure
290
          described above is the calculation of pairwise distances.
291
          [The time complexity is O( 
292
          xLN 2), 
293
          N being the number of sequences, 
294
          L being their length, and 
295
          x being the number of bootstrap
296
          resamples. On an average Intel processor the wall clock
297
          time for 100 bootstrapped datasets of a typical Pfam
298
          multiple alignment is in the range of hours.]
299
          Since the query sequence is aligned to stable Pfam
300
          alignments, it is possible to precalculate the pairwise
301
          distances for each alignment and store the results. Then,
302
          when RIO is being used to analyze a query sequence, only
303
          the distances of the query to each sequence in the Pfam
304
          alignment have to be calculated. This step becomes thus
305
          O( 
306
          xLN ) instead of O( 
307
          xLN 2).
308
          To do this correctly, the aligned query sequence has
309
          to be bootstrap resampled in exactly the same way as was
310
          used for precalculating the pairwise distances of the
311
          Pfam alignment. For this purpose, bootstrap positions
312
          (e.g. which aligned columns from the Pfam alignment were
313
          chosen in a particular bootstrap sample) are saved to a
314
          file. With this file it is possible to bootstrap the new
315
          alignment of N+1 sequences (Pfam alignment plus query
316
          sequence) in precisely the same manner, so the NxN
317
          precalculated distances are valid for the (N+1)x(N+1)
318
          distance matrix. The alignment method must also guarantee
319
          that the original Pfam multiple alignment remains
320
          unchanged when the query sequence is aligned to it. This
321
          requires specially prepared Pfam full alignments and
322
          profile HMMs that are created with the HMMER software as
323
          follows:
324
          
325
          Input: Original Pfam full alignment 
326
          A .
327
          
328
          Output: "aln" file containing
329
          RIO-ready full alignment
330
          "hmm" file containing a RIO-ready profile HMM
331
          "nbd" file containing pairwise distances
332
          "bsp" file bootstrap positions file
333
          "pwd" file containing pairwise distances for bootstrap
334
          resampled alignment
335
          
336
          1. Remove sequences from species not
337
          in RIO's master species tree from alignment 
338
          A . If 
339
          A does not contain enough sequences
340
          (<6), abort.
341
          
342
          2. Run hmmbuild -o A' on 
343
          A , using the same options as were
344
          used to build the original Pfam HMM for 
345
          A , resulting in alignment 
346
          A' . (HMMER's construction
347
          procedure slightly modifies the input alignment in ways
348
          that are usually unimportant, but which matter to
349
          bootstrapping in RIO.) Keep A' as the "aln" file.
350
          
351
          3. Run hmmbuild with "--hand" option
352
          on 
353
          A' , resulting in HMM 
354
          H' (using the same options as were
355
          used to build the original HMM for 
356
          A ). Calibrate 
357
          H' with hmmcalibrate and keep as
358
          "hmm" file.
359
          
360
          4. Remove non-consensus (insert)
361
          columns from 
362
          A' (these are annotated by HMMER),
363
          resulting in alignment 
364
          A" .
365
          
366
          5. Calculate pairwise distances for 
367
          A" , resulting in the "nbd" file
368
          (non-bootstrapped distances).
369
          
370
          6. Bootstrap resample the columns of 
371
          A" , resulting in the "bsp" file
372
          (bootstrap positions file).
373
          
374
          7. Calculate pairwise distances for
375
          bootstrapped 
376
          A" , resulting in the "pwd"
377
          file.
378
        
379
        
380
          Rooting of gene trees
381
          The concept of speciation and duplication is only
382
          meaningful on rooted gene trees, but the neighbor joining
383
          algorithm infers unrooted trees. We use a simple
384
          parsimony criterion for rooting. Gene trees are rooted on
385
          each branch, resulting in 2 
386
          N -3 differently rooted trees for a
387
          gene tree of 
388
          N sequences. For each of these, the
389
          number of inferred duplications is determined. From the
390
          trees with a minimal number of duplications (if there is
391
          more than one) the tree with the shortest total height is
392
          chosen as the rooted tree. Empirical studies on gene
393
          trees based on 1750 Pfam alignments show that about 60%
394
          of trees rooted in such a way have their root in the same
395
          position that direct midpoint rooting [ 27 ] would place
396
          it.
397
          Naively performing a full duplication/speciation
398
          analysis on each of 2 
399
          N -3 differently rooted trees
400
          results in a overall time complexity of O( 
401
          N 2) or worse, but this can be
402
          avoided. For the purpose of the following discussion it
403
          is assumed that our SDI algorithm for
404
          speciation/duplication inference is employed, but the
405
          idea applies to all algorithms based on a mapping
406
          function M defined as follows [ 28 ] :
407
          
408
          Definition 4. Let 
409
          G be the set of nodes in a rooted
410
          binary gene tree and 
411
          S the set of nodes in a rooted
412
          binary species tree. For any node 
413
          g ∈ 
414
          G , let γ ( 
415
          g ) be the set of species in which
416
          occur the extant genes descendant from 
417
          g . For any node 
418
          s ∈ 
419
          S , let σ ( 
420
          s ) be the set of species in the
421
          external nodes descendant from 
422
          s . For any 
423
          g ∈ 
424
          G , let M( 
425
          g ) ∈ 
426
          S be the smallest (lowest) node in 
427
          S satisfying γ ( 
428
          g ) ⊆ σ (M( 
429
          g )).
430
          Duplications are then defined using M( 
431
          g ) as follows:
432
          
433
          Definition 5. Let 
434
          g 
435
          1 and 
436
          g 
437
          2 be the two child nodes of an
438
          internal node 
439
          g of a rooted binary gene tree 
440
          G . Node 
441
          g is a duplication if and only if
442
          M( 
443
          g ) = M( 
444
          g 
445
          1 ) or M( 
446
          g ) = M( 
447
          g 
448
          2 ).
449
          The main task of most algorithms for duplication
450
          inference is the calculation of M. After M has been
451
          calculated for any rooted gene tree 
452
          G it is possible to explore
453
          different root placements without having to recalculate M
454
          for every node of 
455
          G . As long as the root is moved
456
          one node at the time, M has to be recalculated only for
457
          two nodes: the one node which was child 1 (if the new
458
          root is placed on a branch originating from child 1 of
459
          the previous root) or child 2 (otherwise) of the previous
460
          root, as well as for the new root itself. Hence, two
461
          postorder traversal steps (child 1 or 2 of the old root,
462
          then the new root) in the SDI algorithm are all that is
463
          needed. The new sum of duplications is determined by
464
          keeping track of the change in duplication/speciation
465
          status in the two recalculated nodes as well as in the
466
          previous root. Performing this over the whole gene tree
467
          (some nodes will be visited twice) it is possible to
468
          explore all possible root placements and calculate the
469
          resulting duplications in practically linear time. The
470
          pseudocode algorithm is as follows:
471
        
472
        
473
          Algorithm for speciation duplication inference
474
          combined with rooting
475
          
476
          Input : binary gene tree 
477
          G , rooted binary species tree 
478
          S .
479
          
480
          Output: 
481
          G with "duplication" or
482
          "speciation" assigned to each internal node and rooted in
483
          such a way that the sum of duplications is minimized.
484
          
485
            SDIunrooted(G, S) 
486
          
487
          root gene tree 
488
          G at the midpoint of any
489
          branch;
490
          set 
491
          B = getBranchesInOrder( 
492
          G );
493
          SDIse( 
494
          G , 
495
          S ) (see [ 13 ] );
496
          for each branch 
497
          b in 
498
          B :
499
          set 
500
          n 
501
          1 = child 1 of root of 
502
          G ;
503
          set 
504
          n 
505
          2 = child 2 of root of 
506
          G ;
507
          root 
508
          G at the midpoint of branch 
509
          b ;
510
          updateM( 
511
          n 
512
          1 , 
513
          n 
514
          2 , 
515
          G );
516
          if (sum of duplications in 
517
          G < 
518
          d 
519
          min ):
520
          set 
521
          d 
522
          min = 
523
          d ;
524
          set 
525
          G 
526
          dmin = 
527
          G ;
528
          return 
529
          G 
530
          dmin ;
531
          
532
            updateM( 
533
            n 
534
            1 , 
535
            n 
536
            2 , 
537
            G ) 
538
          
539
          set 
540
          r = root of 
541
          G ;
542
          if (child 1 of 
543
          r == 
544
          n 
545
          1 || child 2 of 
546
          r == 
547
          n 
548
          1 ):
549
          calculateMforNode( 
550
          n 
551
          1 );
552
          else:
553
          calculateMforNode( 
554
          n 
555
          2 );
556
          calculateMforNode( 
557
          r );
558
          
559
            calculateMforNode( 
560
            n ) 
561
          
562
          if ( 
563
          n != external):
564
          set 
565
          a = M(child 1 of 
566
          n );
567
          set 
568
          b = M(child 2 of 
569
          n );
570
          while ( 
571
          a != 
572
          b ):
573
          if ( 
574
          a > 
575
          b ):
576
          set 
577
          a = parent of 
578
          a ;
579
          else:
580
          set 
581
          b = parent of 
582
          b ;
583
          set M( 
584
          n ) = 
585
          a ;
586
          if (M( 
587
          n ) == M(child 1 of 
588
          n ) || M( 
589
          n ) == M(child 2 of 
590
          n ):
591
          
592
          n is duplication;
593
          else:
594
          
595
          n is speciation;
596
          
597
            getBranchesInOrder( 
598
            G ) 
599
          
600
          set 
601
          n = root of 
602
          G ;
603
          set 
604
          i = 0;
605
          while !( 
606
          n == root && indicator of 
607
          n == 2):
608
          if ( 
609
          n != external && indicator
610
          of 
611
          n != 2):
612
          if (indicator of 
613
          n == 0):
614
          set indicator of 
615
          n = 1;
616
          set 
617
          n = child 1 of 
618
          n ;
619
          else:
620
          set indicator of 
621
          n = 2;
622
          set 
623
          n = child 2 of 
624
          n ;
625
          if (parent of 
626
          n != root):
627
          set B [ 
628
          i ] = branch connecting 
629
          n and parent of 
630
          n ;
631
          else:
632
          set B [ 
633
          i ] = branch connecting child 1 of
634
          root and child 2 of root;
635
          set 
636
          i = 
637
          i + 1;
638
          else:
639
          if (parent of 
640
          n != root && 
641
          n != external):
642
          set B [ 
643
          i ] = branch connecting 
644
          n and parent of 
645
          n ;
646
          set 
647
          i = 
648
          i + 1;
649
          set 
650
          n = parent of 
651
          n ;
652
          return 
653
          B ;
654
        
655
        
656
          Master species tree
657
          Duplication inference requires a species tree. For
658
          this purpose, a single completely binary master species
659
          tree was compiled manually, containing 249 of the most
660
          commonly encountered species in Pfam (spanning Archaea,
661
          Bacteria, and Eukaryotes). This tree is based mainly on
662
          information from Maddison's "Tree of Life" project
663
          http://tolweb.org/tree/phylogeny.html, NCBI's taxonomy
664
          database
665
          http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html,
666
          the "Deep Green" project
667
          http://ucjeps.berkeley.edu/bryolab/greenplantpage.html,
668
          and [ 29 30 31 32 ] . This master tree groups nematodes
669
          and arthropods into a clade of ecdysozoans (molting
670
          animals) as proposed by Aguinaldo [ 29 ] , a
671
          classification which is still controversial. The tree is
672
          available in NHX format [ 33 ] at
673
          http://www.genetics.wustl.edu/eddy/forester/tree_of_life_bin_1-4.nhx.
674
        
675
      
676
      
677
        Implementation
678
        RIO is implemented in a Perl pipeline of several
679
        software programs as follows. Alignment of the query
680
        sequence is done programs from the HMMER package
681
        http://hmmer.wustl.edu/. Bootstrapping is performed by a
682
        bespoke C program. Maximum likelihood pairwise distances
683
        are calculated using BLOSUM matrices [ 25 ] by a modified
684
        version of TREE-PUZZLE [ 34 ] . Neighbor joining trees are
685
        calculated by a modified version of NEIGHBOR from the
686
        PHYLIP package [ 35 ]
687
        http://evolution.genetics.washington.edu/phylip.html.
688
        Rooting and duplication inference are accomplished by
689
        "SDIunrooted" - a Java implementation of our SDI algorithm
690
        which incorporates various methods for rooting (see above).
691
        The actual counting of orthologs is performed by methods of
692
        the Java class "RIO". These programs, with the exception of
693
        HMMER, are part of the FORESTER package and are available
694
        under the GNU license at
695
        http://www.genetics.wustl.edu/eddy/forester/.
696
        In order to run RIO locally, the following packages and
697
        databases need to be present: HMMER, the Pfam database [ 8
698
        ] , the SWISS-PROT and TrEMBL databases [ 36 ] .
699
        RIO is also available as an analysis webserver at
700
        http://www.rio.wustl.edu/. The pairwise distance and tree
701
        calculations are parallelized in this version (currently,
702
        ten 1.26 GHz Pentium III processors are being used).
703
      
704
      
705
        Results and Discussion
706
        
707
          Precalculation of pairwise distances
708
          Pairwise distances to be used in RIO analyses were
709
          calculated using the "full" alignments (as opposed to the
710
          smaller curated "seed" alignments) from Pfam 6.6 (August
711
          2001, 3071 families, [ 8 ] ). Sequences from species not
712
          present in the master species tree were removed from the
713
          alignments. For computational efficiency reasons,
714
          alignments that still contained more than 600 sequences
715
          were further pruned; sequences not originating from
716
          SWISS-PROT were discarded, and sequences from certain
717
          mammals were excluded (mouse, rabbit, hamsters, goat, all
718
          primates except human), since mammals are likely to be
719
          oversampled in most Pfam families. For some extremely
720
          large families [immunoglobulin domain (PF00047), protein
721
          kinase domain (PF00069), collagen triple helix repeat
722
          (PF01391), and rhodopsin-type 7 transmembrane receptor
723
          (PF00001)], all mammalian sequences except those from
724
          human and rat were excluded.
725
          Alignments of average length <30 amino acids
726
          (<40 for zinc finger domains) or with <6 sequences
727
          were not analyzed, because of lack of phylogenetic
728
          signal. For all other families, pairwise distances for
729
          100 bootstrap samples were prepared. Following the above
730
          rules, pairwise distances were precalculated for 2384
731
          alignments from a total of 3071 in Pfam 6.6 (75
732
          alignments were too short and 612 alignments contained
733
          less than six sequences from species in the master
734
          species tree).
735
        
736
        
737
          Phylogenomic analyses of the A. thalianaand C.
738
          elegansproteomes
739
          In order to get an estimate of the effectiveness of
740
          this implementation of automated phylogenomics, we used
741
          the RIO procedure to analyze the 
742
          A. thaliana [ 16 ] and 
743
          C. elegans [ 17 ] proteomes.
744
          The input for RIO consists of a query protein sequence
745
          together with a Pfam alignment for a protein family that
746
          the query belongs to. Before RIO could be applied we
747
          therefore had to determine the matching domains for each
748
          protein in the 
749
          A. thaliana and 
750
          C. elegans proteomes. For proteins
751
          composed of different domains, a RIO analysis is
752
          performed for each domain individually.
753
          The source for protein sequences were:
754
          ATH1.pep.03202001, a flatfile database of 25,579 
755
          A. thaliana amino acid sequences
756
          (hypothetical, predicted and experimentally verified)
757
          that have been identified as part of the Arabidopsis
758
          Genome Initiative (AGI)
759
          http://www.arabidopsis.org/info/agi.html, and wormpep 43,
760
          a flatfile database of 19,730 
761
          C. elegans amino acid sequences
762
          http://www.sanger.ac.uk/Projects/C_elegans/wormpep/.
763
          The program hmmpfam (version 2.2 g) from the HMMER
764
          package was used to search each protein sequence in
765
          ATH1.pep.03202001 and wormpep 43 against Pfam 6.6. Only
766
          domains with a score above the so-called Pfam gathering
767
          cutoff were reported ("cut_ga" option) in order to
768
          include only confident domain assignments.
769
          The sum of domains assigned to the 25,579 
770
          A. thaliana protein sequences was
771
          17,847 (counting multiple copies of the same domain in
772
          one protein as one). 12,431 sequences matched one domain
773
          (containing possibly multiple copies of this one domain).
774
          1,982 sequences matched two different domains (containing
775
          possibly multiple copies of both). 453 sequences matched
776
          three or more different domains (containing possibly
777
          multiple copies of each). Therefore, a total of 14,866
778
          (58%) sequences from ATH1.pep.03202001 could be assigned
779
          to one or more Pfam families.
780
          Similarly, a sum of 12,314 domains was assigned to the
781
          19,769 
782
          C. elegans protein sequences. 7,698
783
          sequences matched one domain, 1,632 matched two different
784
          domains, and 388 matched three or more different domains.
785
          Thus, 9,718 (49%) sequences from wormpep 43 could be
786
          assigned to one or more Pfam families.
787
          RIO was then used to analyze each protein sequence
788
          matching one or more Pfam families. The results from
789
          these analyses can be found at
790
          http://www.genetics.wustl.edu/eddy/forester/rio_analyses/.
791
          The approximate time requirement was between two and
792
          three weeks, performed on eight Pentium III 800 Mhz
793
          processors.
794
        
795
        
796
          How many sequences can be analyzed with RIO?
797
          The first question we asked was simply how many
798
          sequences can be analyzed with RIO. For an overview, see
799
          Table 1. From the 17,847 
800
          A. thaliana domain sequences
801
          matching a Pfam family, 14,905 (84%) could be analyzed
802
          with RIO using the precalculated distances. 2859 (16%)
803
          domain sequences were not analyzed because the
804
          corresponding Pfam alignments were either too short or
805
          did not contain enough sequences (as described above). 83
806
          (0.5%) domain sequences were not analyzed because the
807
          E-value for the match to their profile HMM was below the
808
          threshold of 0.01. This represents a second filtering
809
          step for preventing analyzing false domain assignments
810
          (besides only analyzing domain sequences which score
811
          above the gathering cutoff in the domain analysis). (RIO
812
          performs a preprocessing step before aligning the query
813
          sequence to a Pfam alignment, in which the program
814
          hmmsearch is used to trim the query sequence by searching
815
          it with the appropriate profile HMM. If the resulting
816
          E-value was below 0.01 no analysis was performed.)
817
          Multiple copies of the same domain in certain sequences
818
          result in a sum of individual analyses larger then the
819
          number of analyzed domain sequences. In case of 
820
          A. thaliana this number was
821
          17,940.
822
          Correspondingly, from the 12,314 
823
          C. elegans domain sequences
824
          matching a Pfam family, 11,287 (92%) could be analyzed
825
          with RIO using the precalculated distances. 901 (7%)
826
          domain sequences were not analyzed because the
827
          corresponding Pfam alignments were either too short or
828
          did not contain enough sequences. 53 (0.4%) domain
829
          sequences were not analyzed because the E-value for the
830
          match to their profile HMM was below the threshold of
831
          0.01. In addition, we did not analyze the 73 
832
          C. elegans sequences matching the
833
          immunoglobulin family (PF00047), because we considered
834
          the phylogenetic signal in this alignment to be
835
          questionable. Furthermore, most of the 73 sequences
836
          contain multiple copies of the immunoglobulin domain (for
837
          example, CE08028 contains 48 immunoglobulin domains) and
838
          we therefore worried that the results from this family
839
          might skew our overall results. The sum of RIO analyses
840
          was 14,740.
841
          Thus, a little less than half of each proteome can be
842
          analyzed by RIO. The most important factor is whether a
843
          protein sequence has a match to a Pfam domain family.
844
        
845
        
846
          RIO analysis of lactate/malate dehydrogenase family
847
          members
848
          In order to test whether RIO performs well on an
849
          "easy" case, RIO was used to analyze lactate/malate
850
          dehydrogenase family members both in 
851
          A. thaliana and 
852
          C. elegans . L-Lactate and malate
853
          dehydrogenases are members of the same protein family
854
          (represented in Pfam as ldh for the NAD-binding domain
855
          and ldh_C for the alpha/beta C-terminal domain), yet they
856
          catalyze different reactions. L-lactate dehydrogenase (EC
857
          1.1.1.27) catalyzes the following reaction: (S)-lactate +
858
          NAD += pyruvate + NADH [ 37 ] . Malate dehydrogenase
859
          (NAD) (EC 1.1.1.37) catalyzes: (S)-malate + NAD +=
860
          oxaloacetate + NADH [ 38 ] . NADP-dependent malate
861
          dehydrogenase (EC 1.1.1.82) utilizes NADP +as cofactor
862
          instead of NAD + [ 39 40 ] . According to the Pfam domain
863
          analysis described above, the 
864
          A. thaliana proteome contains ten
865
          lactate/malate dehydrogenase family members, whereas the 
866
          C. elegans proteome contains three.
867
          (In addition, 
868
          C. elegans also contains two
869
          putative members of a second lactate/malate dehydrogenase
870
          family [ 41 ] , ldh_2, which are not discussed here.) The
871
          RIO output for the 
872
          A. thaliana protein F12M16_14
873
          analyzed against the ldh domain alignment is shown as an
874
          example in Figure 7. The results are summarized in Tables
875
          2and 3. Complete RIO output files (as well as NHX [ 33 ]
876
          tree files) are avaliable, herefor 
877
          A. thaliana and at herefor 
878
          c.elegans . In all cases,
879
          distinction between malate dehydrogenase (NAD) and
880
          lactate dehydrogenase is unquestionable and in accordance
881
          with existing annotations and BLAST results irrespective
882
          which domain (ldh or ldh_C) was used for the RIO analysis
883
          (which implies that no domain swapping occurred over long
884
          evolutionary times). Furthermore, the same results are
885
          achieved whether only the top 1 sequence (the one with
886
          the highest orthology value, shown in Tables 2and 3) or
887
          the top 10 sequences are used to transfer annotation
888
          from. The only likely NADP-dependent malate dehydrogenase
889
          is the 
890
          A. thaliana sequence MCK7_20. For
891
          some query sequences, the top orthology values are low.
892
          Yet, all subtree-neighborings above 50% exhibit consensus
893
          at distinguishing between malate and lactate
894
          dehydrogenase. In contrast, a finer distinction (e.g.
895
          between mitochondrial and cytoplasmic malate
896
          dehydrogenase) proves more problematic. While there is no
897
          case of actual conflict between the existing annotation
898
          and the RIO results, in many cases there is no compelling
899
          evidence in the RIO results to confirm the finer
900
          distinctions in the existing annotations. Obviously, the
901
          resolution power of RIO is limited by the given
902
          annotations and by the number (or even presence) of
903
          sequences for each sub(sub)family.
904
        
905
        
906
          Sequences with no orthologs in the current
907
          databases
908
          Next, we determined the distribution of the top
909
          orthology bootstrap values. The sequence with the top
910
          orthology bootstrap value is the one that is most likely
911
          to be the true ortholog of the query. If the top
912
          orthology bootstrap value is low, then the query sequence
913
          is likely to have no ortholog in the Pfam alignment.
914
          These results are summarized in Table 4. For example, for
915
          2252 
916
          A. thaliana query sequences, at
917
          least one sequence was orthologous in at least 95 out of
918
          100 resampled trees. In contrast, for 930 
919
          A. thaliana query sequences, no
920
          sequence was orthologous in more than five out of 100
921
          bootstrapped trees. For query sequences with more than
922
          one copy of the same domain, each copy had to meet the
923
          conditions individually in order for the whole query
924
          sequence being counted to be below or above the
925
          threshold.
926
          We do not think it is possible at this stage to
927
          determine reliable threshold values for "true orthologs"
928
          or "absence of orthologs". Such thresholds are very
929
          likely to be different for different Pfam families since
930
          families vary in the phylogenetic signal their alignment
931
          contains. Some sequences that are very likely to be true
932

933
          orthologs nonetheless exhibit marginal orthology
934
          bootstrap values (in the range of 70% or even lower).
935
          We focused on sequences that appeared to have no
936
          orthologs (<5% bootstrap), since these would be cases
937
          where a RIO analysis might be most able to correct overly
938
          specific annotations that might be transferred based
939
          solely on sequence similarity (as illustrated in Figure
940
          1). An example for this is the 
941
          A. thaliana sequence F28P22_13.
942
          (Files related to this analysis are avaliable, here.)
943
          This sequence is a zinc-binding dehydrogenase (Pfam:
944
          adh_zinc, PF00107). F28P22_13 has been annotated in
945
          ATH1.pep.03202001 "as putative cinnamyl-alcohol
946
          dehydrogenase", based on sequence similarity (its top 10
947
          BLAST matches are all cinnamyl-alcohol dehydrogenases
948
          with E-values in the range of 10 -94if analyzed against
949
          all non-redundant GenBank CDS
950
          translations+PDB+SwissProt+PIR+PRF on Jan 2, 2002).
951
          Cinnamyl-alcohol dehydrogenase (EC 1.1.1.195) catalyzes
952
          the following reaction: cinnamyl alcohol + NADP +=
953
          cinnamaldehyde + NADPH (but it can also act on coniferyl
954
          alcohol, sinapyl alcohol and 4-coumaryl alcohol) in the
955
          flavonoid, stilbene and lignin biosynthesis pathways [ 40
956
          42 ] . According to the RIO analysis, F28P22_13 has no
957
          orthologs (see Figure 8for the corresponding tree and
958
          Figure 9for the RIO output). Furthermore its
959
          subtree-neighbors above 90%, cinnamyl-alcohol
960
          dehydrogenases and NADP-dependent alcohol dehydrogenases
961
          (EC 1.1.1.2), exhibit only partial annotation agreement
962
          (namely that of some type of NADP-dependent alcohol
963
          dehydrogenase, but not EC 1.1.1.2 or EC 1.1.1.195).
964
          Hence, F28P22_13 is likely to be a (possibly novel) type
965
          of NADP-dependent alcohol dehydrogenase (other than EC
966
          1.1.1.2), possibly a novel type of cinnamyl-alcohol
967
          dehydrogenase.
968
          One might expect that each query sequence that appears
969
          to have no orthologs is connected with scenario similar
970
          to the one described above for F28P22_13. Yet, this is
971
          clearly not the case, for the following reasons: (i) Gene
972
          duplications might not be followed by functional
973
          modification (many Pfam families are composed of
974
          sequences which have all the same function, at least at
975
          the resolution of the current annotation). (ii) Some Pfam
976
          families are composed solely of sequences originating
977
          from closely related (or the same) species (such as
978
          PF02362, the B3 DNA binding domain of higher plants). For
979
          such families, query sequences from the same species
980
          group are expected to have low orthology values. In such
981
          cases the concept of subtree-neighbors and ultra-paralogs
982
          is more useful than orthologs. (iii) Erroneous RIO
983
          results caused by an insufficient phylogenetic signal
984
          (due to short sequences, for example) can lead to low
985
          orthology values. For this reason, RIO also outputs the
986
          average bootstrap value for the consensus tree to give
987
          the user a hint about the amount of phylogenetic signal
988
          in the alignment used.
989
        
990
        
991
          Inconsistency between orthology bootstrap values
992
          and sequence similarity
993
          We were next interested in the number of sequences in
994
          the two proteomes for which the orthology bootstrap
995
          values do not correspond to sequence similarity (Table
996
          5). Such disagreements could be caused by the situation
997
          illustrated in Figure 2. To determine these numbers, we
998
          used the following rules. Two thresholds for orthology
999
          bootstrap values were chosen: 
1000
          O , the minimum for being an
1001
          ortholog (e.g. 90%) and 
1002
          N , the maximum for not being an
1003
          ortholog (e.g. 10%). Furthermore, a maximal ratio 
1004
          R for the distance of the query to
1005
          non-orthologs to the distance of the query to orthologs
1006
          was chosen (e.g. 0.5). In order for being counted as
1007
          exhibiting disagreement between the orthology bootstrap
1008
          values and sequence similarity a query sequence had to
1009
          fulfill the following two conditions: (i) it must have a
1010
          least one ortholog with bootstrap orthology value above
1011
          or equal to 
1012
          O , and (ii) 
1013
          all sequences in the alignment with
1014
          bootstrap orthology values above 
1015
          N , must have distance ratios
1016
          smaller or equal to 
1017
          R for at least one sequence with
1018
          bootstrap orthology lower or equal to 
1019
          N . Sequences from the following
1020
          species were ignored in this analysis (since they were
1021
          the species of the query sequence or related to it): 
1022
          A. thaliana proteome: 
1023
          Rosidae ( 
1024
          A. thaliana, Pisum sativum , 
1025
          Glycine max , 
1026
          Cucurbita maxima , 
1027
          Cucumis sativus , 
1028
          Brassica campestris , 
1029
          Brassica napus , 
1030
          Citrus unshiu , 
1031
          Citrus sinensis , 
1032
          Theobroma cacao , 
1033
          Gossypium hirsutum ); 
1034
          C. elegans proteome: nematodes ( 
1035
          C. elegans , 
1036
          Caenorhabditis briggsae , 
1037
          Haemonchus contortus , 
1038
          Ascaris suum ).
1039
          Manual inspection of the RIO output leads to the
1040
          following, somewhat unexpected, conclusion. In many cases
1041
          a discrepancy between orthology bootstrap values and
1042
          sequence similarity is caused by orthologs in only
1043
          phylogenetically distant (relatively to the query
1044
          sequence) species. This can lead to errors if functional
1045
          annotation is blindly transferred from these orthologs to
1046
          the query. As an example of this, the results of
1047
          analyzing the 
1048
          A. thaliana O-methyltransferase
1049
          F16P17_38 are shown in Figures 10and 11. (Complete files
1050
          are at here.) Even though the F16P17_38 sequence is
1051
          orthologous to the bacterial hydroxyneurosporene
1052
          methyltransferases (EC 2.1.1.-) [ 43 ] it would be
1053
          dangerous to annotate it as such. A more reasonable
1054
          annotation for this query would be to annotate it based
1055
          on subtree-neighbors and hence call it a plant
1056
          O-methyltransferase. An indication of this problem
1057
          (besides a discrepancy between orthology bootstrap values
1058
          and sequence similarity) is the meeting of the following
1059
          three conditions: A query sequence has (i) likely
1060
          orthologs and (ii) likely subtree-neighbors in other
1061
          species than the query itself, yet (iii) there is no
1062
          significant overlap between the orthologs and the
1063
          subtree-neighbors.
1064
          We were unable to find convincing examples in the 
1065
          C. elegans and 
1066
          A. thaliana proteomes where wrong
1067
          sequence similarity based annotations might be caused by
1068
          unequal rates of evolution (as illustrated in Figure 2).
1069
          This is not to say that such cases do not exist in those
1070
          two proteomes, but they are likely to be quite rare.
1071
          Similarly to the issues described in the previous
1072
          section, the detection of such examples is complicated by
1073
          the fact that for many cases in which a discrepancy
1074
          between orthology bootstrap values and sequence
1075
          similarity exists, all sequences in the Pfam alignment
1076
          appear to have to same function, the Pfam family is
1077
          lineage specific, or the annotations are too
1078
          poor/confusing to make any kind of inference.
1079
        
1080
      
1081
      
1082
        Conclusions
1083
        RIO is a procedure for automated phylogenomics. The RIO
1084
        procedure appears to be particularly useful for the
1085
        detection of first representatives of novel protein
1086
        subfamilies. Sequence similarity based methods can be
1087
        misleading in these cases since every query is always "most
1088
        similar to something", whereas RIO can detect the absence
1089
        of orthologs.
1090
        Storm, Sonnhammer, and colleagues have recently
1091
        developed similar ideas and procedures in a program called
1092
        ORTHOSTRAPPER [ 44 45 ] . One distinction between the two
1093
        approaches is that ORTHOSTRAPPER's orthology determination
1094
        procedure does not employ a species tree for duplication
1095
        inference; it uses a heuristic based on sequence similarity
1096
        rather than a formally correct phylogenetic means of
1097
        inferring orthology. Another distinction is that
1098
        ORTHOSTRAPPER uses uncorrected observed mismatches as a
1099
        sequence distance measure, rather than estimating
1100
        evolutionary distances. In general, RIO brings more of the
1101
        power of known phylogenetic inference algorithms to bear on
1102
        the problem of proteomic annotation.
1103
        Super-orthology is a very stringent criterion. If a
1104
        query sequence is likely to have super-orthologs, they
1105
        represent an excellent source to transfer functional
1106
        annotation from. In contrast, the absence of
1107
        super-orthologs does not imply that a function for a query
1108
        sequence cannot be inferred (in the two proteomes analyzed
1109
        in this work, most sequences appear to have no
1110
        super-orthologs in Pfam 6.6).
1111
        Ultra-paralogs are sequences in the same species as the
1112
        query and are likely to be the result of recent
1113
        duplications and therefore might not have yet undergone
1114
        much functional divergence. Operationally, splice variants
1115
        can also be thought of as ultra-paralogs (at least as long
1116
        as protein sequences are considered).
1117
        Subtree-neighbors have two uses: (i) If the
1118
        subtree-neighbors of the query sequence exhibit (partial)
1119
        agreement in their functional annotations, the elements in
1120
        which they agree might be used to infer a (partial)
1121
        function for the query. This is useful for query sequences
1122
        that are appear to have no orthologs in the current
1123
        databases. (ii) For query sequences that do have orthologs,
1124
        absence of overlap between the sequences considered
1125
        orthologous and those which appear to be subtree-neighbors
1126
        raises a red flag, indicating that the orthologs are in
1127
        phylogenetically distant species relative to the query.
1128
        Transferring annotation from such orthologs is risky. In
1129
        this case, subtree-neighbors are a more reliable source to
1130
        transfer annotation from.
1131
        RIO outputs warnings if the distance of the query
1132
        sequence to other sequences is unusually short or long,
1133
        relative to other branch lengths on the tree. The
1134
        usefulness of this was not investigated in this work.
1135
        A RIO procedure based on Pfam alignments analyzes each
1136
        protein domain individually since Pfam is protein family
1137
        database based on individual domains [ 8 ] . In some
1138
        respects, it would be preferable to analyze whole protein
1139
        sequences, but well curated databases of complete protein
1140
        alignments are not available (to our knowledge). However,
1141
        domain-by-domain analysis is not necessarily
1142
        disadvantageous. Due to domain shuffling many proteins are
1143
        mosaic proteins, composed of domains with different
1144
        evolutionary histories [ 46 47 ] . For such proteins it
1145
        makes much sense to analyze each domain individually.
1146
        Furthermore, mosaic proteins from sufficiently distant
1147
        species might be impossible to be aligned over more than
1148
        one domain at the time, since they are unlikely to exhibit
1149
        the same domain organization. The same is true for multiple
1150
        copies of the same domain in protein: Each of them is
1151
        analyzed individually (such proteins often differ in their
1152
        number of domain copies and could therefore not be aligned
1153
        from end to end for the whole family).
1154
        In general, the concept of "annotation consensus" is
1155
        very important in this work (for example consensus between
1156
        subtree-neighbors, or between subtree-neighbors and
1157
        orthologs). We have employed this notion loosely. A useful
1158
        future extension would be to incorporate automated
1159
        annotation consensus detection into RIO. This would include
1160
        annotation of internal nodes of a gene tree with a
1161
        "biological function". Automated consensus detection is
1162
        trivial for a highly formalized notation system, such as EC
1163
        numbers (the consensus of EC 1.1.1.3 and EC 1.1.1.23 is EC
1164
        1.1.1, a oxidoreductase acting on the CH-OH group of donors
1165
        with NAD +or NADP +as acceptor [ 40 ] ). Obviously, it is
1166
        much more difficult to analyze natural language annotations
1167
        in the same manner. Perhaps this could be accomplished by
1168
        utilizing the set of structured vocabularies of the Gene
1169
        Ontology (GO) project [ 48 ]
1170
        http://www.geneontology.org/.
1171
      
1172
    
1173
  
1174

1175
Product

Resources

Company