Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
7
Computational methods for whole genome
8
studies
9
Comparative genomics and more specialized fields such
10
as comparative virology, etc., involve the comparison of
11
DNA sequences, genes and genomes [ 12 13 14 ] . Recent
12
rapid data acquisition is allowing the analyses of whole
13
genome sequences, especially the smaller genomes such as
14
mitochondria and chloroplasts [ 15 16 17 ] , as well as
15
the larger bacterial genomes [ 18 19 ] and large tracts
16
of eukaryotic chromosomes, especially from related
17
organisms [ 12 13 14 20 21 22 23 ] . These studies
18
include the determination of the order of genes,
19
i.e., co-linearity [ 24 25 ] , the
20
location of synteny [ 26 27 28 ] and the identification
21
of clusters of orthologous genes [cog] between two
22
genomes [ 21 22 23 ] . Along similar lines of thought, it
23
should be extremely useful to locate, identify and
24
catalog the sets of "core" genes common to these
25
genomes-genomes which otherwise may be related or
26
semi-related or unrelated in other respects. These global
27
views allow for a deeper understanding of one organism in
28
the context of another, especially in regards to their
29
genomic contents. In addition, the comparison of multiple
30
genomes and the identification of related genes and
31
"core" genes can lead to insight into the structure and
32
function of genes and genomes [ 4 ] . This is very useful
33
in genome annotations and also in the identification and
34
characterization of functions for "newly found" putative
35
genes.
36
Identification of "core" genes from small whole
37
genomes is useful and complements other data derived from
38
these genomes. Small genomes include those from viruses [
39
3 ] , mitochondria [ 14 15 ] and chloroplasts [ 16 ] .
40
The increasing importance of the large amount of DNA
41
sequence data recently collected from these small genomes
42
is reflected in the better understanding of their biology
43
[ 3 4 12 13 14 ] and in the upsurge of publications
44
analyzing these genomes and the organisms to which they
45
belong [ 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ] .
46
Genome co-linearity, gene clustering and homolog
47
identification are three global genome analyses which are
48
important in many fields of research, including resolving
49
phylogenetic and evolutionary relationships [ 15 16 17 ]
50
.
51
52
53
54
Results and Discussion
55
56
Description of CoreGenes
57
CoreGenes is written in JAVA-based programming
58
incorporating the 'setdb' and 'BLASTP' programs from the
59
WU-BLAST package of Washington University,
60
http://BLAST.wustl.edu. The basis of this iterative
61
comparison rests on the BLASTP algorithm [ 29 ] . A
62
flowchart of the processes is illustrated in Figure 1.
63
This software allows for the identification,
64
characterization, catalog and visualization of putatively
65
essential "core" genes in sets of two to five genomes in
66
a user-friendly GUI environment. A table with additional
67
content information is generated from the analyses.
68
CoreGenes has been validated with representative genomes
69
from several families of viruses, as well as
70
mitochondrion and chloroplast genomes. In these examples,
71
it locates and identifies putatively related genes
72
directly and gene clustering indirectly. In light of the
73
similarities of certain genes generated by CoreGenes, one
74
may ponder their relationships upon further and closer
75
inspection, given that the high BLAST scores between two
76
genes do not always imply an orthologous relationship [
77
30 ] . In other words, the complexity of these BLAST
78
scores suggests that the user should perform rigorous
79
phylogenetic analysis of each set of homologous genes to
80
determine true orthology. Though if the user uses a high
81
threshold value while using GeneCore, s/he will increase
82
the chances to retrieve orthologous genes.
83
One obvious application is to use this tool as a step
84
in the characterization of an "alphabet" of putatively
85
essential "core" genes in a set of closely related
86
genomes such as from a collection of poxvirus genomes (
87
unpublished data ).
88
89
90
CoreGenes graphical user interface
91
The CoreGenes GUI contains three levels of data
92
input/output, starting with an interface for the entry of
93
two to five genomes via GenBank accession numbers (Figure
94
2) and ending with a display of the corresponding protein
95
of interest as archived in the NCBI database. Contained
96
within the top-level GUI (Figure 2) is an entry field for
97
up to five genome sequences.
98
Nota Bene, entering GenBank
99
accession numbers with dyslexic renderings will result in
100
"error messages." It is preferable to use the recent
101
versions of GenBank accession numbers, i.e., prefixed
102
with "NC_..."
103
Once the program is initiated, the respective genome
104
data are downloaded from the GenBank database (Figure 1).
105
These genome sequence data are subsequently parsed into
106
protein-coding sequences [as annotated in the GenBank
107
database] and are converted by CoreGenes into
108
"GeneOrder2.0-FASTA" format [ 7 8 29 31 ] . Comparisons
109
are performed and the results are presented in a tabular
110
format in the subsequent GUI. Each gene has a hyperlink
111
to its entry in the NCBI database.
112
113
114
Data mining algorithm
115
BLASTP protein similarity analyses [ 29 ] between the
116
reference sequence and the first query sequence are
117
performed sequentially, with each query protein compared
118
individually to the entire protein database of the
119
reference genome. This is similar to the algorithm for
120
the GeneOrder analyses [ 7 8 ] . If the alignment score
121
between the reference protein and a query protein meets
122
or exceeds a defined similarity threshold number, then
123
the proteins are paired and their accession numbers
124
stored. A consensus map of related genes is generated and
125
stored. Hierarchical comparisons with additionally
126
entered query genomes, up to four in total, are performed
127
in each session.
128
In detail, the process continues as query genome
129
number 2 data are retrieved from GenBank and treated as
130
described above,
131
i.e., this set of proteins is
132
compared against the first consensus set of paired genes
133
formed between the reference genome and query genome
134
number 1. A second consensus set of related genes is
135
generated and stored. Query genome numbers 3 and 4 are
136
iteratively and separately analyzed in an analogous
137
manner. A caveat is that if query genome number 1 does
138
not have a match to the reference genome, then a
139
subsequent query genome number 2 match to the original
140
reference genome (
141
i.e., possible true related gene)
142
will be discarded. In other words, hierarchical matches
143
must occur between the reference genome, query genome
144
number 1 and query genome number 2 in order for CoreGenes
145
to identify BLAST matches between the reference genome
146
and the query genome number 2. A visual presentation of
147
this is shown in Figure 3(top panel), where the genomes
148
are aligned with the reference genome serving as the
149
"x-axis." Genes from query genomes that have the desired
150
BLAST matches are arrayed vertically above the reference
151
genome. This, despite its shortcoming of terminating a
152
further analysis should there be no match between the two
153
immediate genomes, is useful as a simple map of the order
154
of genes contained in the reference genome. It also
155
serves as a quick simple survey of the set of genomes in
156
terms of BLAST matches.
157
However, permutations of the five genomes must be
158
analyzed in order to collect the comprehensive set of
159
putatively related core genes. Given the five genomes to
160
be queried, this task is daunting manually. Of course it
161
would be useful to generate a table of genes that bin
162
across only 2, 3 or 4 genomes. This is being addressed
163
actively. It is anticipated that this comprehensive table
164
of genes including rows with matches across only two,
165
three or four genomes will be made available in the near
166
future. Meanwhile, upon the completion of the above
167
algorithm, a table containing the extracted GenBank data
168
and summarizing the "core" genes within the queried
169
genomes is generated (Figure 3bottom panel). The columns
170
of this table can also be exported via "cut and paste"
171
into Microsoft Excel and Word programs to generate
172
publication quality figures.
173
Accession numbers of each gene and very brief
174
descriptions are presented in each individual block
175
within this matrix, as extracted directly from the
176
GenBank database. Each individual gene is hyperlinked
177
from this table to the NCBI website to allow the
178
investigator an opportunity to view the unique GenBank
179
file for the gene of interest.
180
181
182
Using CoreGenes
183
184
Similarity ranges
185
Contained in the top-level window is a field to
186
define the minimum protein similarity score (
187
i. e., "BLASTP" threshold score).
188
These can be either the default ("75") or a
189
user-defined value. Score ranges are related to the
190
similarities of the proteins being queried [ 7 8 ] .
191
For reference, the three similarity ranges that can be
192
defined for running GeneOrder2.0 are highest ("A"),
193
high ("B") and low ("C") [ 7 8 ] . The BLASTP threshold
194
score ranges for each are as follows: "A" is defined
195
from [200-∞), "B" is defined from [100-200) and "C" is
196
defined from [75-100). Genes with matches in the "A"
197
range are true homologs, while those in the "B" range
198
are likely related and those in the "C" range require
199
visual validation of the level of identity in order to
200
ensure a true match. Related gene matching values for
201
CoreGenes are also defined in this manner. Caveat: it
202
is always recommended that the results between two
203
BLAST matches be scrutinized as reports have suggested
204
that the closest BLAST match is often not the nearest
205
neighbor [ 30 ] .
206
207
208
Examples of CoreGenes analyses
209
This tool has been validated with analyses of
210
several diverse virus, chloroplast and mitochondrion
211
genomes. For example, a set of four chloroplast genomes
212
(Figures 2and 3) and a set of five mitochondrion
213
genomes (data not shown) from evolutionary divergent
214
sets of organisms were run independently to demonstrate
215
the power and capabilities of CoreGenes. Shown in
216
Figure 3is an output from one of these analyses. With
217
the BLASTP threshold score set at "75," the "core"
218
genes are cataloged and displayed with brief
219
identifying information from the GenBank database.
220
Sixty-one "core" genes were cataloged from the set of
221
chloroplast genomes (data not shown). The genomes are
222
as follows:
223
Arabidopsis thaliana , NC_000932;
224
225
Nicotiana tabacum , NC_001879;
226
Oryza sativa , NC_001320; and
227
Chlorella vulgaris , NC_001865.
228
Mitochondrion genomes are as follows:
229
Homo sapiens (NC_001807),
230
Gallus gallus (NC_001323),
231
Caenorhabditis
232
elegans (NC_001328),
233
Drosophila
234
melanogaster (NC_001709) and
235
Schizosaccharomyces
236
pombe (NC_001326). An analysis was also performed
237
with a mixture of mitochondrion and chloroplast
238
genomes. Interestingly, several putatively related
239
genes were detected in this particular analysis (data
240
not shown).
241
242
243
Additional validations
244
In addition to the aforementioned chloroplast and
245
mitochondrion genomes, and of more interest to our
246
research group, CoreGenes has been validated with virus
247
genomes ranging in size from 35 kb to 330 kb (data not
248
shown). Specifically, it has been run with combinations
249
and permutations of adenovirus genomes,
250
ca. 35 kb (NC_001405, NC_001406,
251
NC_002067, NC_001454, NC_001460, NC_000942, NC_001813
252
and NC_002501, poxvirus genomes, ca. 250 kb NC_001559,
253
NC_001266, NC_001266, NC_003027, NC_001132, NC_001731
254
and NC_002642), and other viruses of varying sizes:
255
ca. 150 kb (
256
e.g., baculoviruses:
257
Heliocoverpa armigera
258
nucleopolyhedrovirus G4 , NC_002654 and
259
Lymantria dispar
260
nucleopolyhedrovirus NC_001973) and
261
ca. 330 kb (
262
Paramecium bursaria Chlorella virus
263
1 , NC_000852).
264
A group of three chordopox viruses (vaccinia
265
NC_001559,
266
Molluscum contagiosum virus
267
NC_001731, and fowlpox virus NC_001266) and two
268
entomopox viruses (
269
Melanoplus sanguinipes
270
entomopoxvirus NC_001993 and
271
Amsacta moorei
272
entomopoxvirus NC_002520) was analyzed with
273
CoreGenes. With related genomes such as these, the data
274
can also be used as a predictive tool for the
275
elucidation of an "alphabet" of essential genes
276
especially in collaboration with "wet bench" analyses
277
such as the characterization of temperature sensitive
278
mutants, for example, poxviruses (data not shown).
279
280
281
282
Limitations
283
284
Server Connectivity
285
CoreGenes run time is a function of the network
286
connections. If one party, such as the NCBI server, is
287
experiencing heavy traffic or is down due to technical
288
difficulties, then the application will stall and be
289
unsuccessful. Sets of orthopoxviruses,
290
ca. 250 kb, take approximately 25
291
minutes to run on a PowerMac G3 running Mac OS 9.0 and
292
Netscape Communicator 6.1. Larger genomes are currently
293
problematic due to the computational speed, the NCBI
294
server and/or the user's connection timing out. This
295
issue is being addressed.
296
Some network "firewalls" may be incompatible with
297
this software, causing the connections to terminate
298
prematurely. An error message "An internal error has
299
occurred. Please try again later java lang.NullPointer
300
Exception." will be displayed. Also, entering incorrect
301
accession numbers may give this same message.
302
Alternatively, CoreGenes has been run successfully on
303
university and public library terminals with internet
304
access. These organizations do not seem to have the
305
"firewall" needs/concerns as other organizations.
306
307
308
Platform Limitations
309
CoreGenes has been validated with several different
310
platforms and also with different web browsers:
311
Macintosh (Explorer 4.5 and Netscape 6.1), PC (Explorer
312
5.0 and Netscape 4.08), SGI (Netscape) and SUN
313
(Netscape) workstations. There are compatibility issues
314
between CoreGenes and Macintosh (Netscape 4.7 and
315
below). Using Netscape 6.1 surmounts these problems.
316
This problem appears to lie in the JAVA applet included
317
with the earlier version of Netscape for Macintosh.
318
Moving an Apple-supplied "JAVA Accelerator for PowerPC"
319
into the "extensions" folder may allow earlier versions
320
of Netscape to run this program. Printing the CoreGenes
321
applet-generated graph may be problematic due to an
322
applet incompatibility; capturing the graph as a
323
"screenshot" via the PC and the Mac platforms and
324
printing independently circumvents this.
325
Run times vary from 1 minute and 21 seconds for a
326
set of five adenovirus genomes (
327
ca. 35 kb) to 40 minutes for a
328
set of five poxvirus genomes (
329
ca. 250 kb). Currently, if there
330
are multiple requests, the computation may take much
331
longer as the requests are queued. This inconvenience
332
is being addressed and is due to the server hosting the
333
software. Depending on the hardware, some local servers
334
may time out during this period while waiting for this
335
request to be processed, which will result in an error
336
message stating that "The attempt to load 'servlet'
337
failed." Adjusting "preference" settings on the local
338
web browser may rectify this problem. Immediate goals
339
of improvement include an option to have results
340
e-mailed back to the user. We expect that there will be
341
additional improvements in both speed and response
342
issues when we upgrade our server hardware and rewrite
343
some of the CoreGenes software to accommodate the
344
larger megabase-sized genomes.
345
346
347
Software Limitations
348
Only the NCBI database can be searched at this time;
349
in other words, only GenBank accession numbers can be
350
used. If there is an operator error in entering the
351
number correctly, then an error message will be
352
displayed,
353
e.g., "The attempt to load
354
'servlet' failed." Improvements to this software will
355
include providing an additional field to enter
356
proprietary and non-GenBank genome data, similar to an
357
option developed for GeneOrder2.0 [ 8 ] .
358
359
360
361
362
Conclusions
363
CoreGenes fits into the niche for GUI-based interactive
364
computational tools [ 1 2 3 4 5 6 7 8 9 10 ] that enhance
365
the visualization of DNA sequence data, especially in the
366
context of genome comparisons. It meets a critical need for
367
tool sets containing global "whole genome" analyses tools.
368
As noted earlier, small genomes are still of great interest
369
to many researchers. This tool is a base to expand upon,
370
for example, to build more robust, elegant and
371
complementary "whole genome" computational tools. Although
372
CoreGenes successfully expedites the determination of
373
"core" genes during the comparisons of several small whole
374
genomes simultaneously, it will likely be succeeded by
375
improved software to compare and analyze even much larger
376
genomes, especially in the megabase range. This feature is
377
being pursued with urgency. One known current limitation in
378
analyzing larger genomes is computational,
379
e. g., hardware; this will be
380
addressed shortly. Increasingly powerful workstations to
381
act as servers will allow the much more computationally
382
intensive comparisons of megabase-sized genomes. However,
383
this version of CoreGenes is very useful and fills a
384
current unmet need in genome analyses, that of collecting
385
related genes in a family of genomes. In addition to
386
stimulating the development of similar tools, CoreGenes
387
will allow continuing improvements to it. We plan to
388
support aggressively this version of CoreGenes, updating
389
with improvements and additional features, as well as to
390
work on a more robust faster version.
391
392
393
394
395