Background
Computational methods for whole genome
studies
Comparative genomics and more specialized fields such
as comparative virology, etc., involve the comparison of
DNA sequences, genes and genomes [ 12 13 14 ] . Recent
rapid data acquisition is allowing the analyses of whole
genome sequences, especially the smaller genomes such as
mitochondria and chloroplasts [ 15 16 17 ] , as well as
the larger bacterial genomes [ 18 19 ] and large tracts
of eukaryotic chromosomes, especially from related
organisms [ 12 13 14 20 21 22 23 ] . These studies
include the determination of the order of genes,
i.e., co-linearity [ 24 25 ] , the
location of synteny [ 26 27 28 ] and the identification
of clusters of orthologous genes [cog] between two
genomes [ 21 22 23 ] . Along similar lines of thought, it
should be extremely useful to locate, identify and
catalog the sets of "core" genes common to these
genomes-genomes which otherwise may be related or
semi-related or unrelated in other respects. These global
views allow for a deeper understanding of one organism in
the context of another, especially in regards to their
genomic contents. In addition, the comparison of multiple
genomes and the identification of related genes and
"core" genes can lead to insight into the structure and
function of genes and genomes [ 4 ] . This is very useful
in genome annotations and also in the identification and
characterization of functions for "newly found" putative
genes.
Identification of "core" genes from small whole
genomes is useful and complements other data derived from
these genomes. Small genomes include those from viruses [
3 ] , mitochondria [ 14 15 ] and chloroplasts [ 16 ] .
The increasing importance of the large amount of DNA
sequence data recently collected from these small genomes
is reflected in the better understanding of their biology
[ 3 4 12 13 14 ] and in the upsurge of publications
analyzing these genomes and the organisms to which they
belong [ 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ] .
Genome co-linearity, gene clustering and homolog
identification are three global genome analyses which are
important in many fields of research, including resolving
phylogenetic and evolutionary relationships [ 15 16 17 ]
.
Results and Discussion
Description of CoreGenes
CoreGenes is written in JAVA-based programming
incorporating the 'setdb' and 'BLASTP' programs from the
WU-BLAST package of Washington University,
http://BLAST.wustl.edu. The basis of this iterative
comparison rests on the BLASTP algorithm [ 29 ] . A
flowchart of the processes is illustrated in Figure 1.
This software allows for the identification,
characterization, catalog and visualization of putatively
essential "core" genes in sets of two to five genomes in
a user-friendly GUI environment. A table with additional
content information is generated from the analyses.
CoreGenes has been validated with representative genomes
from several families of viruses, as well as
mitochondrion and chloroplast genomes. In these examples,
it locates and identifies putatively related genes
directly and gene clustering indirectly. In light of the
similarities of certain genes generated by CoreGenes, one
may ponder their relationships upon further and closer
inspection, given that the high BLAST scores between two
genes do not always imply an orthologous relationship [
30 ] . In other words, the complexity of these BLAST
scores suggests that the user should perform rigorous
phylogenetic analysis of each set of homologous genes to
determine true orthology. Though if the user uses a high
threshold value while using GeneCore, s/he will increase
the chances to retrieve orthologous genes.
One obvious application is to use this tool as a step
in the characterization of an "alphabet" of putatively
essential "core" genes in a set of closely related
genomes such as from a collection of poxvirus genomes (
unpublished data ).
CoreGenes graphical user interface
The CoreGenes GUI contains three levels of data
input/output, starting with an interface for the entry of
two to five genomes via GenBank accession numbers (Figure
2) and ending with a display of the corresponding protein
of interest as archived in the NCBI database. Contained
within the top-level GUI (Figure 2) is an entry field for
up to five genome sequences.
Nota Bene, entering GenBank
accession numbers with dyslexic renderings will result in
"error messages." It is preferable to use the recent
versions of GenBank accession numbers, i.e., prefixed
with "NC_..."
Once the program is initiated, the respective genome
data are downloaded from the GenBank database (Figure 1).
These genome sequence data are subsequently parsed into
protein-coding sequences [as annotated in the GenBank
database] and are converted by CoreGenes into
"GeneOrder2.0-FASTA" format [ 7 8 29 31 ] . Comparisons
are performed and the results are presented in a tabular
format in the subsequent GUI. Each gene has a hyperlink
to its entry in the NCBI database.
Data mining algorithm
BLASTP protein similarity analyses [ 29 ] between the
reference sequence and the first query sequence are
performed sequentially, with each query protein compared
individually to the entire protein database of the
reference genome. This is similar to the algorithm for
the GeneOrder analyses [ 7 8 ] . If the alignment score
between the reference protein and a query protein meets
or exceeds a defined similarity threshold number, then
the proteins are paired and their accession numbers
stored. A consensus map of related genes is generated and
stored. Hierarchical comparisons with additionally
entered query genomes, up to four in total, are performed
in each session.
In detail, the process continues as query genome
number 2 data are retrieved from GenBank and treated as
described above,
i.e., this set of proteins is
compared against the first consensus set of paired genes
formed between the reference genome and query genome
number 1. A second consensus set of related genes is
generated and stored. Query genome numbers 3 and 4 are
iteratively and separately analyzed in an analogous
manner. A caveat is that if query genome number 1 does
not have a match to the reference genome, then a
subsequent query genome number 2 match to the original
reference genome (
i.e., possible true related gene)
will be discarded. In other words, hierarchical matches
must occur between the reference genome, query genome
number 1 and query genome number 2 in order for CoreGenes
to identify BLAST matches between the reference genome
and the query genome number 2. A visual presentation of
this is shown in Figure 3(top panel), where the genomes
are aligned with the reference genome serving as the
"x-axis." Genes from query genomes that have the desired
BLAST matches are arrayed vertically above the reference
genome. This, despite its shortcoming of terminating a
further analysis should there be no match between the two
immediate genomes, is useful as a simple map of the order
of genes contained in the reference genome. It also
serves as a quick simple survey of the set of genomes in
terms of BLAST matches.
However, permutations of the five genomes must be
analyzed in order to collect the comprehensive set of
putatively related core genes. Given the five genomes to
be queried, this task is daunting manually. Of course it
would be useful to generate a table of genes that bin
across only 2, 3 or 4 genomes. This is being addressed
actively. It is anticipated that this comprehensive table
of genes including rows with matches across only two,
three or four genomes will be made available in the near
future. Meanwhile, upon the completion of the above
algorithm, a table containing the extracted GenBank data
and summarizing the "core" genes within the queried
genomes is generated (Figure 3bottom panel). The columns
of this table can also be exported via "cut and paste"
into Microsoft Excel and Word programs to generate
publication quality figures.
Accession numbers of each gene and very brief
descriptions are presented in each individual block
within this matrix, as extracted directly from the
GenBank database. Each individual gene is hyperlinked
from this table to the NCBI website to allow the
investigator an opportunity to view the unique GenBank
file for the gene of interest.
Using CoreGenes
Similarity ranges
Contained in the top-level window is a field to
define the minimum protein similarity score (
i. e., "BLASTP" threshold score).
These can be either the default ("75") or a
user-defined value. Score ranges are related to the
similarities of the proteins being queried [ 7 8 ] .
For reference, the three similarity ranges that can be
defined for running GeneOrder2.0 are highest ("A"),
high ("B") and low ("C") [ 7 8 ] . The BLASTP threshold
score ranges for each are as follows: "A" is defined
from [200-∞), "B" is defined from [100-200) and "C" is
defined from [75-100). Genes with matches in the "A"
range are true homologs, while those in the "B" range
are likely related and those in the "C" range require
visual validation of the level of identity in order to
ensure a true match. Related gene matching values for
CoreGenes are also defined in this manner. Caveat: it
is always recommended that the results between two
BLAST matches be scrutinized as reports have suggested
that the closest BLAST match is often not the nearest
neighbor [ 30 ] .
Examples of CoreGenes analyses
This tool has been validated with analyses of
several diverse virus, chloroplast and mitochondrion
genomes. For example, a set of four chloroplast genomes
(Figures 2and 3) and a set of five mitochondrion
genomes (data not shown) from evolutionary divergent
sets of organisms were run independently to demonstrate
the power and capabilities of CoreGenes. Shown in
Figure 3is an output from one of these analyses. With
the BLASTP threshold score set at "75," the "core"
genes are cataloged and displayed with brief
identifying information from the GenBank database.
Sixty-one "core" genes were cataloged from the set of
chloroplast genomes (data not shown). The genomes are
as follows:
Arabidopsis thaliana , NC_000932;
Nicotiana tabacum , NC_001879;
Oryza sativa , NC_001320; and
Chlorella vulgaris , NC_001865.
Mitochondrion genomes are as follows:
Homo sapiens (NC_001807),
Gallus gallus (NC_001323),
Caenorhabditis
elegans (NC_001328),
Drosophila
melanogaster (NC_001709) and
Schizosaccharomyces
pombe (NC_001326). An analysis was also performed
with a mixture of mitochondrion and chloroplast
genomes. Interestingly, several putatively related
genes were detected in this particular analysis (data
not shown).
Additional validations
In addition to the aforementioned chloroplast and
mitochondrion genomes, and of more interest to our
research group, CoreGenes has been validated with virus
genomes ranging in size from 35 kb to 330 kb (data not
shown). Specifically, it has been run with combinations
and permutations of adenovirus genomes,
ca. 35 kb (NC_001405, NC_001406,
NC_002067, NC_001454, NC_001460, NC_000942, NC_001813
and NC_002501, poxvirus genomes, ca. 250 kb NC_001559,
NC_001266, NC_001266, NC_003027, NC_001132, NC_001731
and NC_002642), and other viruses of varying sizes:
ca. 150 kb (
e.g., baculoviruses:
Heliocoverpa armigera
nucleopolyhedrovirus G4 , NC_002654 and
Lymantria dispar
nucleopolyhedrovirus NC_001973) and
ca. 330 kb (
Paramecium bursaria Chlorella virus
1 , NC_000852).
A group of three chordopox viruses (vaccinia
NC_001559,
Molluscum contagiosum virus
NC_001731, and fowlpox virus NC_001266) and two
entomopox viruses (
Melanoplus sanguinipes
entomopoxvirus NC_001993 and
Amsacta moorei
entomopoxvirus NC_002520) was analyzed with
CoreGenes. With related genomes such as these, the data
can also be used as a predictive tool for the
elucidation of an "alphabet" of essential genes
especially in collaboration with "wet bench" analyses
such as the characterization of temperature sensitive
mutants, for example, poxviruses (data not shown).
Limitations
Server Connectivity
CoreGenes run time is a function of the network
connections. If one party, such as the NCBI server, is
experiencing heavy traffic or is down due to technical
difficulties, then the application will stall and be
unsuccessful. Sets of orthopoxviruses,
ca. 250 kb, take approximately 25
minutes to run on a PowerMac G3 running Mac OS 9.0 and
Netscape Communicator 6.1. Larger genomes are currently
problematic due to the computational speed, the NCBI
server and/or the user's connection timing out. This
issue is being addressed.
Some network "firewalls" may be incompatible with
this software, causing the connections to terminate
prematurely. An error message "An internal error has
occurred. Please try again later java lang.NullPointer
Exception." will be displayed. Also, entering incorrect
accession numbers may give this same message.
Alternatively, CoreGenes has been run successfully on
university and public library terminals with internet
access. These organizations do not seem to have the
"firewall" needs/concerns as other organizations.
Platform Limitations
CoreGenes has been validated with several different
platforms and also with different web browsers:
Macintosh (Explorer 4.5 and Netscape 6.1), PC (Explorer
5.0 and Netscape 4.08), SGI (Netscape) and SUN
(Netscape) workstations. There are compatibility issues
between CoreGenes and Macintosh (Netscape 4.7 and
below). Using Netscape 6.1 surmounts these problems.
This problem appears to lie in the JAVA applet included
with the earlier version of Netscape for Macintosh.
Moving an Apple-supplied "JAVA Accelerator for PowerPC"
into the "extensions" folder may allow earlier versions
of Netscape to run this program. Printing the CoreGenes
applet-generated graph may be problematic due to an
applet incompatibility; capturing the graph as a
"screenshot" via the PC and the Mac platforms and
printing independently circumvents this.
Run times vary from 1 minute and 21 seconds for a
set of five adenovirus genomes (
ca. 35 kb) to 40 minutes for a
set of five poxvirus genomes (
ca. 250 kb). Currently, if there
are multiple requests, the computation may take much
longer as the requests are queued. This inconvenience
is being addressed and is due to the server hosting the
software. Depending on the hardware, some local servers
may time out during this period while waiting for this
request to be processed, which will result in an error
message stating that "The attempt to load 'servlet'
failed." Adjusting "preference" settings on the local
web browser may rectify this problem. Immediate goals
of improvement include an option to have results
e-mailed back to the user. We expect that there will be
additional improvements in both speed and response
issues when we upgrade our server hardware and rewrite
some of the CoreGenes software to accommodate the
larger megabase-sized genomes.
Software Limitations
Only the NCBI database can be searched at this time;
in other words, only GenBank accession numbers can be
used. If there is an operator error in entering the
number correctly, then an error message will be
displayed,
e.g., "The attempt to load
'servlet' failed." Improvements to this software will
include providing an additional field to enter
proprietary and non-GenBank genome data, similar to an
option developed for GeneOrder2.0 [ 8 ] .
Conclusions
CoreGenes fits into the niche for GUI-based interactive
computational tools [ 1 2 3 4 5 6 7 8 9 10 ] that enhance
the visualization of DNA sequence data, especially in the
context of genome comparisons. It meets a critical need for
tool sets containing global "whole genome" analyses tools.
As noted earlier, small genomes are still of great interest
to many researchers. This tool is a base to expand upon,
for example, to build more robust, elegant and
complementary "whole genome" computational tools. Although
CoreGenes successfully expedites the determination of
"core" genes during the comparisons of several small whole
genomes simultaneously, it will likely be succeeded by
improved software to compare and analyze even much larger
genomes, especially in the megabase range. This feature is
being pursued with urgency. One known current limitation in
analyzing larger genomes is computational,
e. g., hardware; this will be
addressed shortly. Increasingly powerful workstations to
act as servers will allow the much more computationally
intensive comparisons of megabase-sized genomes. However,
this version of CoreGenes is very useful and fills a
current unmet need in genome analyses, that of collecting
related genes in a family of genomes. In addition to
stimulating the development of similar tools, CoreGenes
will allow continuing improvements to it. We plan to
support aggressively this version of CoreGenes, updating
with improvements and additional features, as well as to
work on a more robust faster version.