Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
6
During the last few years, we have seen enormous strides in our abilities to sequence
7
genomes, and the information that has poured out of these sequences is quite astonishing.
8
With more than 150 complete genome sequences now available and many laboratories rushing
9
into microarray analysis, proteomic initiatives, and even systems biology, it seems an
10
appropriate time to consider not just the opportunities those sequences present, but also
11
their shortcomings. By far the most serious problem is the quality and degree of
12
completeness of the annotation of those genomes. Most troublesome are the large numbers of
13
open reading frames that have been identified by computer programs, but remain labeled as a
14
“conserved hypothetical protein” when they occur in more than one genome or simply a
15
“hypothetical protein” when they appear unique to the genome in question. Between them,
16
these two categories of annotated open reading frames often represent more than half of the
17
potential protein-coding regions of a genome.
18
These annotations highlight just one portion of our ignorance about the information
19
content of genomes and our lack of fundamental knowledge about the function of so many of
20
the building blocks of cells. Unless we rectify this situation, it is likely to undermine
21
many of the other “-omic” efforts currently underway. Here I advocate a rather
22
straightforward approach to address this problem—focused initially on the bacterial
23
genomes. In contrast to the numerous proposals for big science initiatives to understand
24
the fundamental workings of biological organisms, I propose a small science, relatively
25
low-tech approach that could have a dramatic pay off. A relatively small investment could
26
yield a massive amount of information that would greatly enhance our current efforts to use
27
genomic approaches to study life.
28
29
30
Initial Proposal
31
The initial proposal is directed at deciphering the role of the “hypothetical proteins”
32
encoded in the microbial genomes and would involve a community-wide approach to determine
33
the function of these hypotheticals based on solid, old-fashioned biochemistry. The essence
34
of the idea is to undertake an interdisciplinary effort that couples our current
35
bioinformatics capabilities to predict protein function with a directed exploration by
36
experimental laboratories to test those predictions. I would encourage a consortium of
37
bioinformaticians to produce a list of all of the conserved hypothetical proteins that are
38
found in multiple genomes, to carry out the best possible bioinformatics analysis, and then
39
to offer those proteins to the biochemical community as potential targets for research into
40
their function. To energize laboratories with appropriate expertise to participate in this
41
community-wide effort, I suggest that a special program be set up by one or more of the
42
funding agencies so that laboratories undertaking the investigation of any particular
43
protein receive a small grant upfront as a supplement to an existing grant. Upon completion
44
of the project and the identification of the function, they would receive a further
45
supplement to that grant as a reward. In this way, one might hope to rally some of the best
46
biochemical talent and apply it to this problem of determining function for a wide range of
47
new proteins. The cost of such an operation could be quite minimal, and the bureaucracy and
48
review process could be equally simple. Here is a case where a modest infusion of funds
49
could greatly enhance our ability to annotate both existing and new genome sequences and
50
ensure that our current investments in genomic sequences yield the richest biological
51
harvest possible. There are two key steps in the proposed plan.
52
53
54
Key Steps
55
The first step is to encourage some bioinformaticians with appropriate expertise in the
56
functional annotation of genomes to form a consortium and undertake the assembly of a list
57
of prime targets for which an experimental demonstration of function would be most
58
valuable. Three general classes of such genes come to mind: (1) The conserved hypothetical
59
genes. These belong to the set of genes that have orthologs in many other genomes, but for
60
which no function has been experimentally determined in any case. A recent success among
61
such genes is illustrated in Box 1. (2) The hypothetical genes. These form the set of genes
62
that are predicted to be protein coding, but that lack similar genes in any other organism
63
in GenBank. They, too, have no assigned function. (3) The misannotated genes. These genes
64
are ones for which a function has been assigned, but for which there is a good reason to
65
believe the annotation is incorrect.
66
These sets of targets would be combined and arranged into a prioritized list in which
67
each was accompanied by the best assessment of potential function. The priorities would be
68
based on which genes were most likely to prove broadly informative. For instance, a
69
conserved hypothetical gene that occurred in most genomes would be of higher priority than
70
one that had only two orthologs. The list would be on a public Web site where these targets
71
and the predicted functions could be examined and modified by alternative or additional
72
predictions from other groups to guide future experimentation. As function was derived,
73
that information could be presented and the target removed from the main list.
74
The second step would be to invite experimentalists to peruse the list and find those
75
potential genes whose protein products might lie within their realm of expertise so that
76
they could use their experimental knowledge and reagents to quickly test for function.
77
Initially, I would advocate allowing laboratory teams to pick and choose among the list and
78
sign up to study just one of these open reading frames. I would recommend allowing one
79
laboratory per open reading frame in the initial stages. A laboratory wishing to sign up
80
would generate a short document highlighting why its expertise might be suitable for a
81
particular protein. A one-page proposal should suffice, with no experimental plan demanded.
82
At this point, a small panel could choose among competing efforts and the laboratory chosen
83
would be given a small grant and up to six months to carry out its analysis. If it was
84
successful in delineating the function of their target protein, a paper would be written
85
and submitted for peer review. If the paper was accepted for publication, then an
86
additional sum would be allocated as a supplement to the laboratory's existing grant. If,
87
after six months, a laboratory had not managed to delineate the function, it would submit a
88
short report describing the approaches that have been tried, with the results of its
89
analyses. This would be posted on the public Web site and that target would then become
90
open for analysis by other laboratories, under the same conditions as before.
91
While the initial list of target genes should probably be based on a well-studied and
92
experimentally tractable organism such as
93
Escherichia coli , I would not demand that the biochemical
94
experiments be done on the
95
E. coli gene. Any of the orthologs would do, so long as the
96
similarity was sufficiently strong to give high expectations that function would be
97
conserved. In fact, for a laboratory that happened to be already working on one of the
98
homologs, this program might provide an added bonus and greatly speed its work. I would
99
also encourage both biochemical and genetic approaches, since one can never be certain when
100
one method might be better than another. The list would, of course, also include conserved
101
genes not found in
102
E. coli , but commonly distributed in other genomes. In
103
particular, I would make a pitch for including all genes in
104
Mycoplasma genitalium , which, as the free-living organism with
105
the fewest genes, might be the most suitable as a model system for in-depth understanding
106
of its biology.
107
108
109
The Importance of Community
110
This proposal for experimental attack on hypothetical genes is really a very traditional
111
approach that becomes large-scale simply because of the parallel nature of the
112
implementation. It resembles the successful approach used by the Europeans to achieve the
113
complete sequence of the
114
Saccharomyces cerevisiae genome (Goffeau et al. 1996). The
115
results would significantly increase our functional knowledge of the genes within the
116
microbial genomes thus far sequenced. Such annotation would be immediately applicable
117
across orthologs and could dramatically improve the value of the sequenced genomes. This,
118
in turn, would facilitate our ability to annotate new genomes as they appear. The proposal
119
also reinforces the notion that the overwhelming value of bioinformatics is to generate
120
hypotheses that can be tested experimentally. By enabling the community to join in this
121
effort, we would also demonstrate that science really is the collaborative enterprise that
122
requires all of our contributions, not just a select few. Finally, if this initiative
123
succeeds, it would serve as a suitable model from which to begin the more daunting task of
124
trying to annotate the functions of the complex eukaryotic genomes, such as the human
125
genome.
126
127
128
129
130
131
132
133