Background
        
          Computational methods for whole genome
          studies
          Comparative genomics and more specialized fields such
          as comparative virology, etc., involve the comparison of
          DNA sequences, genes and genomes [ 12 13 14 ] . Recent
          rapid data acquisition is allowing the analyses of whole
          genome sequences, especially the smaller genomes such as
          mitochondria and chloroplasts [ 15 16 17 ] , as well as
          the larger bacterial genomes [ 18 19 ] and large tracts
          of eukaryotic chromosomes, especially from related
          organisms [ 12 13 14 20 21 22 23 ] . These studies
          include the determination of the order of genes, 
          i.e., co-linearity [ 24 25 ] , the
          location of synteny [ 26 27 28 ] and the identification
          of clusters of orthologous genes [cog] between two
          genomes [ 21 22 23 ] . Along similar lines of thought, it
          should be extremely useful to locate, identify and
          catalog the sets of "core" genes common to these
          genomes-genomes which otherwise may be related or
          semi-related or unrelated in other respects. These global
          views allow for a deeper understanding of one organism in
          the context of another, especially in regards to their
          genomic contents. In addition, the comparison of multiple
          genomes and the identification of related genes and
          "core" genes can lead to insight into the structure and
          function of genes and genomes [ 4 ] . This is very useful
          in genome annotations and also in the identification and
          characterization of functions for "newly found" putative
          genes.
          Identification of "core" genes from small whole
          genomes is useful and complements other data derived from
          these genomes. Small genomes include those from viruses [
          3 ] , mitochondria [ 14 15 ] and chloroplasts [ 16 ] .
          The increasing importance of the large amount of DNA
          sequence data recently collected from these small genomes
          is reflected in the better understanding of their biology
          [ 3 4 12 13 14 ] and in the upsurge of publications
          analyzing these genomes and the organisms to which they
          belong [ 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ] .
          Genome co-linearity, gene clustering and homolog
          identification are three global genome analyses which are
          important in many fields of research, including resolving
          phylogenetic and evolutionary relationships [ 15 16 17 ]
          .
        
      
      
        Results and Discussion
        
          Description of CoreGenes
          CoreGenes is written in JAVA-based programming
          incorporating the 'setdb' and 'BLASTP' programs from the
          WU-BLAST package of Washington University,
          http://BLAST.wustl.edu. The basis of this iterative
          comparison rests on the BLASTP algorithm [ 29 ] . A
          flowchart of the processes is illustrated in Figure 1.
          This software allows for the identification,
          characterization, catalog and visualization of putatively
          essential "core" genes in sets of two to five genomes in
          a user-friendly GUI environment. A table with additional
          content information is generated from the analyses.
          CoreGenes has been validated with representative genomes
          from several families of viruses, as well as
          mitochondrion and chloroplast genomes. In these examples,
          it locates and identifies putatively related genes
          directly and gene clustering indirectly. In light of the
          similarities of certain genes generated by CoreGenes, one
          may ponder their relationships upon further and closer
          inspection, given that the high BLAST scores between two
          genes do not always imply an orthologous relationship [
          30 ] . In other words, the complexity of these BLAST
          scores suggests that the user should perform rigorous
          phylogenetic analysis of each set of homologous genes to
          determine true orthology. Though if the user uses a high
          threshold value while using GeneCore, s/he will increase
          the chances to retrieve orthologous genes.
          One obvious application is to use this tool as a step
          in the characterization of an "alphabet" of putatively
          essential "core" genes in a set of closely related
          genomes such as from a collection of poxvirus genomes ( 
          unpublished data ).
        
        
          CoreGenes graphical user interface
          The CoreGenes GUI contains three levels of data
          input/output, starting with an interface for the entry of
          two to five genomes via GenBank accession numbers (Figure
          2) and ending with a display of the corresponding protein
          of interest as archived in the NCBI database. Contained
          within the top-level GUI (Figure 2) is an entry field for
          up to five genome sequences. 
          Nota Bene, entering GenBank
          accession numbers with dyslexic renderings will result in
          "error messages." It is preferable to use the recent
          versions of GenBank accession numbers, i.e., prefixed
          with "NC_..."
          Once the program is initiated, the respective genome
          data are downloaded from the GenBank database (Figure 1).
          These genome sequence data are subsequently parsed into
          protein-coding sequences [as annotated in the GenBank
          database] and are converted by CoreGenes into
          "GeneOrder2.0-FASTA" format [ 7 8 29 31 ] . Comparisons
          are performed and the results are presented in a tabular
          format in the subsequent GUI. Each gene has a hyperlink
          to its entry in the NCBI database.
        
        
          Data mining algorithm
          BLASTP protein similarity analyses [ 29 ] between the
          reference sequence and the first query sequence are
          performed sequentially, with each query protein compared
          individually to the entire protein database of the
          reference genome. This is similar to the algorithm for
          the GeneOrder analyses [ 7 8 ] . If the alignment score
          between the reference protein and a query protein meets
          or exceeds a defined similarity threshold number, then
          the proteins are paired and their accession numbers
          stored. A consensus map of related genes is generated and
          stored. Hierarchical comparisons with additionally
          entered query genomes, up to four in total, are performed
          in each session.
          In detail, the process continues as query genome
          number 2 data are retrieved from GenBank and treated as
          described above, 
          i.e., this set of proteins is
          compared against the first consensus set of paired genes
          formed between the reference genome and query genome
          number 1. A second consensus set of related genes is
          generated and stored. Query genome numbers 3 and 4 are
          iteratively and separately analyzed in an analogous
          manner. A caveat is that if query genome number 1 does
          not have a match to the reference genome, then a
          subsequent query genome number 2 match to the original
          reference genome ( 
          i.e., possible true related gene)
          will be discarded. In other words, hierarchical matches
          must occur between the reference genome, query genome
          number 1 and query genome number 2 in order for CoreGenes
          to identify BLAST matches between the reference genome
          and the query genome number 2. A visual presentation of
          this is shown in Figure 3(top panel), where the genomes
          are aligned with the reference genome serving as the
          "x-axis." Genes from query genomes that have the desired
          BLAST matches are arrayed vertically above the reference
          genome. This, despite its shortcoming of terminating a
          further analysis should there be no match between the two
          immediate genomes, is useful as a simple map of the order
          of genes contained in the reference genome. It also
          serves as a quick simple survey of the set of genomes in
          terms of BLAST matches.
          However, permutations of the five genomes must be
          analyzed in order to collect the comprehensive set of
          putatively related core genes. Given the five genomes to
          be queried, this task is daunting manually. Of course it
          would be useful to generate a table of genes that bin
          across only 2, 3 or 4 genomes. This is being addressed
          actively. It is anticipated that this comprehensive table
          of genes including rows with matches across only two,
          three or four genomes will be made available in the near
          future. Meanwhile, upon the completion of the above
          algorithm, a table containing the extracted GenBank data
          and summarizing the "core" genes within the queried
          genomes is generated (Figure 3bottom panel). The columns
          of this table can also be exported via "cut and paste"
          into Microsoft Excel and Word programs to generate
          publication quality figures.
          Accession numbers of each gene and very brief
          descriptions are presented in each individual block
          within this matrix, as extracted directly from the
          GenBank database. Each individual gene is hyperlinked
          from this table to the NCBI website to allow the
          investigator an opportunity to view the unique GenBank
          file for the gene of interest.
        
        
          Using CoreGenes
          
            Similarity ranges
            Contained in the top-level window is a field to
            define the minimum protein similarity score ( 
            i. e., "BLASTP" threshold score).
            These can be either the default ("75") or a
            user-defined value. Score ranges are related to the
            similarities of the proteins being queried [ 7 8 ] .
            For reference, the three similarity ranges that can be
            defined for running GeneOrder2.0 are highest ("A"),
            high ("B") and low ("C") [ 7 8 ] . The BLASTP threshold
            score ranges for each are as follows: "A" is defined
            from [200-∞), "B" is defined from [100-200) and "C" is
            defined from [75-100). Genes with matches in the "A"
            range are true homologs, while those in the "B" range
            are likely related and those in the "C" range require
            visual validation of the level of identity in order to
            ensure a true match. Related gene matching values for
            CoreGenes are also defined in this manner. Caveat: it
            is always recommended that the results between two
            BLAST matches be scrutinized as reports have suggested
            that the closest BLAST match is often not the nearest
            neighbor [ 30 ] .
          
          
            Examples of CoreGenes analyses
            This tool has been validated with analyses of
            several diverse virus, chloroplast and mitochondrion
            genomes. For example, a set of four chloroplast genomes
            (Figures 2and 3) and a set of five mitochondrion
            genomes (data not shown) from evolutionary divergent
            sets of organisms were run independently to demonstrate
            the power and capabilities of CoreGenes. Shown in
            Figure 3is an output from one of these analyses. With
            the BLASTP threshold score set at "75," the "core"
            genes are cataloged and displayed with brief
            identifying information from the GenBank database.
            Sixty-one "core" genes were cataloged from the set of
            chloroplast genomes (data not shown). The genomes are
            as follows: 
            Arabidopsis thaliana , NC_000932;
            
            Nicotiana tabacum , NC_001879; 
            Oryza sativa , NC_001320; and 
            Chlorella vulgaris , NC_001865.
            Mitochondrion genomes are as follows: 
            Homo sapiens (NC_001807), 
            Gallus gallus (NC_001323), 
            Caenorhabditis
            elegans (NC_001328), 
            Drosophila
            melanogaster (NC_001709) and 
            Schizosaccharomyces
            pombe (NC_001326). An analysis was also performed
            with a mixture of mitochondrion and chloroplast
            genomes. Interestingly, several putatively related
            genes were detected in this particular analysis (data
            not shown).
          
          
            Additional validations
            In addition to the aforementioned chloroplast and
            mitochondrion genomes, and of more interest to our
            research group, CoreGenes has been validated with virus
            genomes ranging in size from 35 kb to 330 kb (data not
            shown). Specifically, it has been run with combinations
            and permutations of adenovirus genomes, 
            ca. 35 kb (NC_001405, NC_001406,
            NC_002067, NC_001454, NC_001460, NC_000942, NC_001813
            and NC_002501, poxvirus genomes, ca. 250 kb NC_001559,
            NC_001266, NC_001266, NC_003027, NC_001132, NC_001731
            and NC_002642), and other viruses of varying sizes: 
            ca. 150 kb ( 
            e.g., baculoviruses: 
            Heliocoverpa armigera
            nucleopolyhedrovirus G4 , NC_002654 and 
            Lymantria dispar
            nucleopolyhedrovirus NC_001973) and 
            ca. 330 kb ( 
            Paramecium bursaria Chlorella virus
            1 , NC_000852).
            A group of three chordopox viruses (vaccinia
            NC_001559, 
            Molluscum contagiosum virus
            NC_001731, and fowlpox virus NC_001266) and two
            entomopox viruses ( 
            Melanoplus sanguinipes
            entomopoxvirus NC_001993 and 
            Amsacta moorei
            entomopoxvirus NC_002520) was analyzed with
            CoreGenes. With related genomes such as these, the data
            can also be used as a predictive tool for the
            elucidation of an "alphabet" of essential genes
            especially in collaboration with "wet bench" analyses
            such as the characterization of temperature sensitive
            mutants, for example, poxviruses (data not shown).
          
        
        
          Limitations
          
            Server Connectivity
            CoreGenes run time is a function of the network
            connections. If one party, such as the NCBI server, is
            experiencing heavy traffic or is down due to technical
            difficulties, then the application will stall and be
            unsuccessful. Sets of orthopoxviruses, 
            ca. 250 kb, take approximately 25
            minutes to run on a PowerMac G3 running Mac OS 9.0 and
            Netscape Communicator 6.1. Larger genomes are currently
            problematic due to the computational speed, the NCBI
            server and/or the user's connection timing out. This
            issue is being addressed.
            Some network "firewalls" may be incompatible with
            this software, causing the connections to terminate
            prematurely. An error message "An internal error has
            occurred. Please try again later java lang.NullPointer
            Exception." will be displayed. Also, entering incorrect
            accession numbers may give this same message.
            Alternatively, CoreGenes has been run successfully on
            university and public library terminals with internet
            access. These organizations do not seem to have the
            "firewall" needs/concerns as other organizations.
          
          
            Platform Limitations
            CoreGenes has been validated with several different
            platforms and also with different web browsers:
            Macintosh (Explorer 4.5 and Netscape 6.1), PC (Explorer
            5.0 and Netscape 4.08), SGI (Netscape) and SUN
            (Netscape) workstations. There are compatibility issues
            between CoreGenes and Macintosh (Netscape 4.7 and
            below). Using Netscape 6.1 surmounts these problems.
            This problem appears to lie in the JAVA applet included
            with the earlier version of Netscape for Macintosh.
            Moving an Apple-supplied "JAVA Accelerator for PowerPC"
            into the "extensions" folder may allow earlier versions
            of Netscape to run this program. Printing the CoreGenes
            applet-generated graph may be problematic due to an
            applet incompatibility; capturing the graph as a
            "screenshot" via the PC and the Mac platforms and
            printing independently circumvents this.
            Run times vary from 1 minute and 21 seconds for a
            set of five adenovirus genomes ( 
            ca. 35 kb) to 40 minutes for a
            set of five poxvirus genomes ( 
            ca. 250 kb). Currently, if there
            are multiple requests, the computation may take much
            longer as the requests are queued. This inconvenience
            is being addressed and is due to the server hosting the
            software. Depending on the hardware, some local servers
            may time out during this period while waiting for this
            request to be processed, which will result in an error
            message stating that "The attempt to load 'servlet'
            failed." Adjusting "preference" settings on the local
            web browser may rectify this problem. Immediate goals
            of improvement include an option to have results
            e-mailed back to the user. We expect that there will be
            additional improvements in both speed and response
            issues when we upgrade our server hardware and rewrite
            some of the CoreGenes software to accommodate the
            larger megabase-sized genomes.
          
          
            Software Limitations
            Only the NCBI database can be searched at this time;
            in other words, only GenBank accession numbers can be
            used. If there is an operator error in entering the
            number correctly, then an error message will be
            displayed, 
            e.g., "The attempt to load
            'servlet' failed." Improvements to this software will
            include providing an additional field to enter
            proprietary and non-GenBank genome data, similar to an
            option developed for GeneOrder2.0 [ 8 ] .
          
        
      
      
        Conclusions
        CoreGenes fits into the niche for GUI-based interactive
        computational tools [ 1 2 3 4 5 6 7 8 9 10 ] that enhance
        the visualization of DNA sequence data, especially in the
        context of genome comparisons. It meets a critical need for
        tool sets containing global "whole genome" analyses tools.
        As noted earlier, small genomes are still of great interest
        to many researchers. This tool is a base to expand upon,
        for example, to build more robust, elegant and
        complementary "whole genome" computational tools. Although
        CoreGenes successfully expedites the determination of
        "core" genes during the comparisons of several small whole
        genomes simultaneously, it will likely be succeeded by
        improved software to compare and analyze even much larger
        genomes, especially in the megabase range. This feature is
        being pursued with urgency. One known current limitation in
        analyzing larger genomes is computational, 
        e. g., hardware; this will be
        addressed shortly. Increasingly powerful workstations to
        act as servers will allow the much more computationally
        intensive comparisons of megabase-sized genomes. However,
        this version of CoreGenes is very useful and fills a
        current unmet need in genome analyses, that of collecting
        related genes in a family of genomes. In addition to
        stimulating the development of similar tools, CoreGenes
        will allow continuing improvements to it. We plan to
        support aggressively this version of CoreGenes, updating
        with improvements and additional features, as well as to
        work on a more robust faster version.