CoCalc -- journal.pbio.0020042.txt

OANC_GrAF / data / written_2 / technical / plos / journal.pbio.0020042.txt
³⁹⁶⁷³ views
1

2
  
3
    
4
      
5
        
6
        During the last few years, we have seen enormous strides in our abilities to sequence
7
        genomes, and the information that has poured out of these sequences is quite astonishing.
8
        With more than 150 complete genome sequences now available and many laboratories rushing
9
        into microarray analysis, proteomic initiatives, and even systems biology, it seems an
10
        appropriate time to consider not just the opportunities those sequences present, but also
11
        their shortcomings. By far the most serious problem is the quality and degree of
12
        completeness of the annotation of those genomes. Most troublesome are the large numbers of
13
        open reading frames that have been identified by computer programs, but remain labeled as a
14
        “conserved hypothetical protein” when they occur in more than one genome or simply a
15
        “hypothetical protein” when they appear unique to the genome in question. Between them,
16
        these two categories of annotated open reading frames often represent more than half of the
17
        potential protein-coding regions of a genome.
18
        These annotations highlight just one portion of our ignorance about the information
19
        content of genomes and our lack of fundamental knowledge about the function of so many of
20
        the building blocks of cells. Unless we rectify this situation, it is likely to undermine
21
        many of the other “-omic” efforts currently underway. Here I advocate a rather
22
        straightforward approach to address this problem—focused initially on the bacterial
23
        genomes. In contrast to the numerous proposals for big science initiatives to understand
24
        the fundamental workings of biological organisms, I propose a small science, relatively
25
        low-tech approach that could have a dramatic pay off. A relatively small investment could
26
        yield a massive amount of information that would greatly enhance our current efforts to use
27
        genomic approaches to study life.
28
      
29
      
30
        Initial Proposal
31
        The initial proposal is directed at deciphering the role of the “hypothetical proteins”
32
        encoded in the microbial genomes and would involve a community-wide approach to determine
33
        the function of these hypotheticals based on solid, old-fashioned biochemistry. The essence
34
        of the idea is to undertake an interdisciplinary effort that couples our current
35
        bioinformatics capabilities to predict protein function with a directed exploration by
36
        experimental laboratories to test those predictions. I would encourage a consortium of
37
        bioinformaticians to produce a list of all of the conserved hypothetical proteins that are
38
        found in multiple genomes, to carry out the best possible bioinformatics analysis, and then
39
        to offer those proteins to the biochemical community as potential targets for research into
40
        their function. To energize laboratories with appropriate expertise to participate in this
41
        community-wide effort, I suggest that a special program be set up by one or more of the
42
        funding agencies so that laboratories undertaking the investigation of any particular
43
        protein receive a small grant upfront as a supplement to an existing grant. Upon completion
44
        of the project and the identification of the function, they would receive a further
45
        supplement to that grant as a reward. In this way, one might hope to rally some of the best
46
        biochemical talent and apply it to this problem of determining function for a wide range of
47
        new proteins. The cost of such an operation could be quite minimal, and the bureaucracy and
48
        review process could be equally simple. Here is a case where a modest infusion of funds
49
        could greatly enhance our ability to annotate both existing and new genome sequences and
50
        ensure that our current investments in genomic sequences yield the richest biological
51
        harvest possible. There are two key steps in the proposed plan.
52
      
53
      
54
        Key Steps
55
        The first step is to encourage some bioinformaticians with appropriate expertise in the
56
        functional annotation of genomes to form a consortium and undertake the assembly of a list
57
        of prime targets for which an experimental demonstration of function would be most
58
        valuable. Three general classes of such genes come to mind: (1) The conserved hypothetical
59
        genes. These belong to the set of genes that have orthologs in many other genomes, but for
60
        which no function has been experimentally determined in any case. A recent success among
61
        such genes is illustrated in Box 1. (2) The hypothetical genes. These form the set of genes
62
        that are predicted to be protein coding, but that lack similar genes in any other organism
63
        in GenBank. They, too, have no assigned function. (3) The misannotated genes. These genes
64
        are ones for which a function has been assigned, but for which there is a good reason to
65
        believe the annotation is incorrect.
66
        These sets of targets would be combined and arranged into a prioritized list in which
67
        each was accompanied by the best assessment of potential function. The priorities would be
68
        based on which genes were most likely to prove broadly informative. For instance, a
69
        conserved hypothetical gene that occurred in most genomes would be of higher priority than
70
        one that had only two orthologs. The list would be on a public Web site where these targets
71
        and the predicted functions could be examined and modified by alternative or additional
72
        predictions from other groups to guide future experimentation. As function was derived,
73
        that information could be presented and the target removed from the main list.
74
        The second step would be to invite experimentalists to peruse the list and find those
75
        potential genes whose protein products might lie within their realm of expertise so that
76
        they could use their experimental knowledge and reagents to quickly test for function.
77
        Initially, I would advocate allowing laboratory teams to pick and choose among the list and
78
        sign up to study just one of these open reading frames. I would recommend allowing one
79
        laboratory per open reading frame in the initial stages. A laboratory wishing to sign up
80
        would generate a short document highlighting why its expertise might be suitable for a
81
        particular protein. A one-page proposal should suffice, with no experimental plan demanded.
82
        At this point, a small panel could choose among competing efforts and the laboratory chosen
83
        would be given a small grant and up to six months to carry out its analysis. If it was
84
        successful in delineating the function of their target protein, a paper would be written
85
        and submitted for peer review. If the paper was accepted for publication, then an
86
        additional sum would be allocated as a supplement to the laboratory's existing grant. If,
87
        after six months, a laboratory had not managed to delineate the function, it would submit a
88
        short report describing the approaches that have been tried, with the results of its
89
        analyses. This would be posted on the public Web site and that target would then become
90
        open for analysis by other laboratories, under the same conditions as before.
91
        While the initial list of target genes should probably be based on a well-studied and
92
        experimentally tractable organism such as 
93
        Escherichia coli , I would not demand that the biochemical
94
        experiments be done on the 
95
        E. coli gene. Any of the orthologs would do, so long as the
96
        similarity was sufficiently strong to give high expectations that function would be
97
        conserved. In fact, for a laboratory that happened to be already working on one of the
98
        homologs, this program might provide an added bonus and greatly speed its work. I would
99
        also encourage both biochemical and genetic approaches, since one can never be certain when
100
        one method might be better than another. The list would, of course, also include conserved
101
        genes not found in 
102
        E. coli , but commonly distributed in other genomes. In
103
        particular, I would make a pitch for including all genes in 
104
        Mycoplasma genitalium , which, as the free-living organism with
105
        the fewest genes, might be the most suitable as a model system for in-depth understanding
106
        of its biology.
107
      
108
      
109
        The Importance of Community
110
        This proposal for experimental attack on hypothetical genes is really a very traditional
111
        approach that becomes large-scale simply because of the parallel nature of the
112
        implementation. It resembles the successful approach used by the Europeans to achieve the
113
        complete sequence of the 
114
        Saccharomyces cerevisiae genome (Goffeau et al. 1996). The
115
        results would significantly increase our functional knowledge of the genes within the
116
        microbial genomes thus far sequenced. Such annotation would be immediately applicable
117
        across orthologs and could dramatically improve the value of the sequenced genomes. This,
118
        in turn, would facilitate our ability to annotate new genomes as they appear. The proposal
119
        also reinforces the notion that the overwhelming value of bioinformatics is to generate
120
        hypotheses that can be tested experimentally. By enabling the community to join in this
121
        effort, we would also demonstrate that science really is the collaborative enterprise that
122
        requires all of our contributions, not just a select few. Finally, if this initiative
123
        succeeds, it would serve as a suitable model from which to begin the more daunting task of
124
        trying to annotate the functions of the complex eukaryotic genomes, such as the human
125
        genome.
126
      
127
      
128
        
129
      
130
    
131
  
132

133
Product

Resources

Company