CoCalc -- 1471-2105-3-12.txt

OANC_GrAF / data / written_2 / technical / biomed / 1471-2105-3-12.txt
³⁹⁶⁷⁴ views
1

2
  
3
    
4
      
5
        Background
6
        
7
          Computational methods for whole genome
8
          studies
9
          Comparative genomics and more specialized fields such
10
          as comparative virology, etc., involve the comparison of
11
          DNA sequences, genes and genomes [ 12 13 14 ] . Recent
12
          rapid data acquisition is allowing the analyses of whole
13
          genome sequences, especially the smaller genomes such as
14
          mitochondria and chloroplasts [ 15 16 17 ] , as well as
15
          the larger bacterial genomes [ 18 19 ] and large tracts
16
          of eukaryotic chromosomes, especially from related
17
          organisms [ 12 13 14 20 21 22 23 ] . These studies
18
          include the determination of the order of genes, 
19
          i.e., co-linearity [ 24 25 ] , the
20
          location of synteny [ 26 27 28 ] and the identification
21
          of clusters of orthologous genes [cog] between two
22
          genomes [ 21 22 23 ] . Along similar lines of thought, it
23
          should be extremely useful to locate, identify and
24
          catalog the sets of "core" genes common to these
25
          genomes-genomes which otherwise may be related or
26
          semi-related or unrelated in other respects. These global
27
          views allow for a deeper understanding of one organism in
28
          the context of another, especially in regards to their
29
          genomic contents. In addition, the comparison of multiple
30
          genomes and the identification of related genes and
31
          "core" genes can lead to insight into the structure and
32
          function of genes and genomes [ 4 ] . This is very useful
33
          in genome annotations and also in the identification and
34
          characterization of functions for "newly found" putative
35
          genes.
36
          Identification of "core" genes from small whole
37
          genomes is useful and complements other data derived from
38
          these genomes. Small genomes include those from viruses [
39
          3 ] , mitochondria [ 14 15 ] and chloroplasts [ 16 ] .
40
          The increasing importance of the large amount of DNA
41
          sequence data recently collected from these small genomes
42
          is reflected in the better understanding of their biology
43
          [ 3 4 12 13 14 ] and in the upsurge of publications
44
          analyzing these genomes and the organisms to which they
45
          belong [ 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ] .
46
          Genome co-linearity, gene clustering and homolog
47
          identification are three global genome analyses which are
48
          important in many fields of research, including resolving
49
          phylogenetic and evolutionary relationships [ 15 16 17 ]
50
          .
51
        
52
      
53
      
54
        Results and Discussion
55
        
56
          Description of CoreGenes
57
          CoreGenes is written in JAVA-based programming
58
          incorporating the 'setdb' and 'BLASTP' programs from the
59
          WU-BLAST package of Washington University,
60
          http://BLAST.wustl.edu. The basis of this iterative
61
          comparison rests on the BLASTP algorithm [ 29 ] . A
62
          flowchart of the processes is illustrated in Figure 1.
63
          This software allows for the identification,
64
          characterization, catalog and visualization of putatively
65
          essential "core" genes in sets of two to five genomes in
66
          a user-friendly GUI environment. A table with additional
67
          content information is generated from the analyses.
68
          CoreGenes has been validated with representative genomes
69
          from several families of viruses, as well as
70
          mitochondrion and chloroplast genomes. In these examples,
71
          it locates and identifies putatively related genes
72
          directly and gene clustering indirectly. In light of the
73
          similarities of certain genes generated by CoreGenes, one
74
          may ponder their relationships upon further and closer
75
          inspection, given that the high BLAST scores between two
76
          genes do not always imply an orthologous relationship [
77
          30 ] . In other words, the complexity of these BLAST
78
          scores suggests that the user should perform rigorous
79
          phylogenetic analysis of each set of homologous genes to
80
          determine true orthology. Though if the user uses a high
81
          threshold value while using GeneCore, s/he will increase
82
          the chances to retrieve orthologous genes.
83
          One obvious application is to use this tool as a step
84
          in the characterization of an "alphabet" of putatively
85
          essential "core" genes in a set of closely related
86
          genomes such as from a collection of poxvirus genomes ( 
87
          unpublished data ).
88
        
89
        
90
          CoreGenes graphical user interface
91
          The CoreGenes GUI contains three levels of data
92
          input/output, starting with an interface for the entry of
93
          two to five genomes via GenBank accession numbers (Figure
94
          2) and ending with a display of the corresponding protein
95
          of interest as archived in the NCBI database. Contained
96
          within the top-level GUI (Figure 2) is an entry field for
97
          up to five genome sequences. 
98
          Nota Bene, entering GenBank
99
          accession numbers with dyslexic renderings will result in
100
          "error messages." It is preferable to use the recent
101
          versions of GenBank accession numbers, i.e., prefixed
102
          with "NC_..."
103
          Once the program is initiated, the respective genome
104
          data are downloaded from the GenBank database (Figure 1).
105
          These genome sequence data are subsequently parsed into
106
          protein-coding sequences [as annotated in the GenBank
107
          database] and are converted by CoreGenes into
108
          "GeneOrder2.0-FASTA" format [ 7 8 29 31 ] . Comparisons
109
          are performed and the results are presented in a tabular
110
          format in the subsequent GUI. Each gene has a hyperlink
111
          to its entry in the NCBI database.
112
        
113
        
114
          Data mining algorithm
115
          BLASTP protein similarity analyses [ 29 ] between the
116
          reference sequence and the first query sequence are
117
          performed sequentially, with each query protein compared
118
          individually to the entire protein database of the
119
          reference genome. This is similar to the algorithm for
120
          the GeneOrder analyses [ 7 8 ] . If the alignment score
121
          between the reference protein and a query protein meets
122
          or exceeds a defined similarity threshold number, then
123
          the proteins are paired and their accession numbers
124
          stored. A consensus map of related genes is generated and
125
          stored. Hierarchical comparisons with additionally
126
          entered query genomes, up to four in total, are performed
127
          in each session.
128
          In detail, the process continues as query genome
129
          number 2 data are retrieved from GenBank and treated as
130
          described above, 
131
          i.e., this set of proteins is
132
          compared against the first consensus set of paired genes
133
          formed between the reference genome and query genome
134
          number 1. A second consensus set of related genes is
135
          generated and stored. Query genome numbers 3 and 4 are
136
          iteratively and separately analyzed in an analogous
137
          manner. A caveat is that if query genome number 1 does
138
          not have a match to the reference genome, then a
139
          subsequent query genome number 2 match to the original
140
          reference genome ( 
141
          i.e., possible true related gene)
142
          will be discarded. In other words, hierarchical matches
143
          must occur between the reference genome, query genome
144
          number 1 and query genome number 2 in order for CoreGenes
145
          to identify BLAST matches between the reference genome
146
          and the query genome number 2. A visual presentation of
147
          this is shown in Figure 3(top panel), where the genomes
148
          are aligned with the reference genome serving as the
149
          "x-axis." Genes from query genomes that have the desired
150
          BLAST matches are arrayed vertically above the reference
151
          genome. This, despite its shortcoming of terminating a
152
          further analysis should there be no match between the two
153
          immediate genomes, is useful as a simple map of the order
154
          of genes contained in the reference genome. It also
155
          serves as a quick simple survey of the set of genomes in
156
          terms of BLAST matches.
157
          However, permutations of the five genomes must be
158
          analyzed in order to collect the comprehensive set of
159
          putatively related core genes. Given the five genomes to
160
          be queried, this task is daunting manually. Of course it
161
          would be useful to generate a table of genes that bin
162
          across only 2, 3 or 4 genomes. This is being addressed
163
          actively. It is anticipated that this comprehensive table
164
          of genes including rows with matches across only two,
165
          three or four genomes will be made available in the near
166
          future. Meanwhile, upon the completion of the above
167
          algorithm, a table containing the extracted GenBank data
168
          and summarizing the "core" genes within the queried
169
          genomes is generated (Figure 3bottom panel). The columns
170
          of this table can also be exported via "cut and paste"
171
          into Microsoft Excel and Word programs to generate
172
          publication quality figures.
173
          Accession numbers of each gene and very brief
174
          descriptions are presented in each individual block
175
          within this matrix, as extracted directly from the
176
          GenBank database. Each individual gene is hyperlinked
177
          from this table to the NCBI website to allow the
178
          investigator an opportunity to view the unique GenBank
179
          file for the gene of interest.
180
        
181
        
182
          Using CoreGenes
183
          
184
            Similarity ranges
185
            Contained in the top-level window is a field to
186
            define the minimum protein similarity score ( 
187
            i. e., "BLASTP" threshold score).
188
            These can be either the default ("75") or a
189
            user-defined value. Score ranges are related to the
190
            similarities of the proteins being queried [ 7 8 ] .
191
            For reference, the three similarity ranges that can be
192
            defined for running GeneOrder2.0 are highest ("A"),
193
            high ("B") and low ("C") [ 7 8 ] . The BLASTP threshold
194
            score ranges for each are as follows: "A" is defined
195
            from [200-∞), "B" is defined from [100-200) and "C" is
196
            defined from [75-100). Genes with matches in the "A"
197
            range are true homologs, while those in the "B" range
198
            are likely related and those in the "C" range require
199
            visual validation of the level of identity in order to
200
            ensure a true match. Related gene matching values for
201
            CoreGenes are also defined in this manner. Caveat: it
202
            is always recommended that the results between two
203
            BLAST matches be scrutinized as reports have suggested
204
            that the closest BLAST match is often not the nearest
205
            neighbor [ 30 ] .
206
          
207
          
208
            Examples of CoreGenes analyses
209
            This tool has been validated with analyses of
210
            several diverse virus, chloroplast and mitochondrion
211
            genomes. For example, a set of four chloroplast genomes
212
            (Figures 2and 3) and a set of five mitochondrion
213
            genomes (data not shown) from evolutionary divergent
214
            sets of organisms were run independently to demonstrate
215
            the power and capabilities of CoreGenes. Shown in
216
            Figure 3is an output from one of these analyses. With
217
            the BLASTP threshold score set at "75," the "core"
218
            genes are cataloged and displayed with brief
219
            identifying information from the GenBank database.
220
            Sixty-one "core" genes were cataloged from the set of
221
            chloroplast genomes (data not shown). The genomes are
222
            as follows: 
223
            Arabidopsis thaliana , NC_000932;
224
            
225
            Nicotiana tabacum , NC_001879; 
226
            Oryza sativa , NC_001320; and 
227
            Chlorella vulgaris , NC_001865.
228
            Mitochondrion genomes are as follows: 
229
            Homo sapiens (NC_001807), 
230
            Gallus gallus (NC_001323), 
231
            Caenorhabditis
232
            elegans (NC_001328), 
233
            Drosophila
234
            melanogaster (NC_001709) and 
235
            Schizosaccharomyces
236
            pombe (NC_001326). An analysis was also performed
237
            with a mixture of mitochondrion and chloroplast
238
            genomes. Interestingly, several putatively related
239
            genes were detected in this particular analysis (data
240
            not shown).
241
          
242
          
243
            Additional validations
244
            In addition to the aforementioned chloroplast and
245
            mitochondrion genomes, and of more interest to our
246
            research group, CoreGenes has been validated with virus
247
            genomes ranging in size from 35 kb to 330 kb (data not
248
            shown). Specifically, it has been run with combinations
249
            and permutations of adenovirus genomes, 
250
            ca. 35 kb (NC_001405, NC_001406,
251
            NC_002067, NC_001454, NC_001460, NC_000942, NC_001813
252
            and NC_002501, poxvirus genomes, ca. 250 kb NC_001559,
253
            NC_001266, NC_001266, NC_003027, NC_001132, NC_001731
254
            and NC_002642), and other viruses of varying sizes: 
255
            ca. 150 kb ( 
256
            e.g., baculoviruses: 
257
            Heliocoverpa armigera
258
            nucleopolyhedrovirus G4 , NC_002654 and 
259
            Lymantria dispar
260
            nucleopolyhedrovirus NC_001973) and 
261
            ca. 330 kb ( 
262
            Paramecium bursaria Chlorella virus
263
            1 , NC_000852).
264
            A group of three chordopox viruses (vaccinia
265
            NC_001559, 
266
            Molluscum contagiosum virus
267
            NC_001731, and fowlpox virus NC_001266) and two
268
            entomopox viruses ( 
269
            Melanoplus sanguinipes
270
            entomopoxvirus NC_001993 and 
271
            Amsacta moorei
272
            entomopoxvirus NC_002520) was analyzed with
273
            CoreGenes. With related genomes such as these, the data
274
            can also be used as a predictive tool for the
275
            elucidation of an "alphabet" of essential genes
276
            especially in collaboration with "wet bench" analyses
277
            such as the characterization of temperature sensitive
278
            mutants, for example, poxviruses (data not shown).
279
          
280
        
281
        
282
          Limitations
283
          
284
            Server Connectivity
285
            CoreGenes run time is a function of the network
286
            connections. If one party, such as the NCBI server, is
287
            experiencing heavy traffic or is down due to technical
288
            difficulties, then the application will stall and be
289
            unsuccessful. Sets of orthopoxviruses, 
290
            ca. 250 kb, take approximately 25
291
            minutes to run on a PowerMac G3 running Mac OS 9.0 and
292
            Netscape Communicator 6.1. Larger genomes are currently
293
            problematic due to the computational speed, the NCBI
294
            server and/or the user's connection timing out. This
295
            issue is being addressed.
296
            Some network "firewalls" may be incompatible with
297
            this software, causing the connections to terminate
298
            prematurely. An error message "An internal error has
299
            occurred. Please try again later java lang.NullPointer
300
            Exception." will be displayed. Also, entering incorrect
301
            accession numbers may give this same message.
302
            Alternatively, CoreGenes has been run successfully on
303
            university and public library terminals with internet
304
            access. These organizations do not seem to have the
305
            "firewall" needs/concerns as other organizations.
306
          
307
          
308
            Platform Limitations
309
            CoreGenes has been validated with several different
310
            platforms and also with different web browsers:
311
            Macintosh (Explorer 4.5 and Netscape 6.1), PC (Explorer
312
            5.0 and Netscape 4.08), SGI (Netscape) and SUN
313
            (Netscape) workstations. There are compatibility issues
314
            between CoreGenes and Macintosh (Netscape 4.7 and
315
            below). Using Netscape 6.1 surmounts these problems.
316
            This problem appears to lie in the JAVA applet included
317
            with the earlier version of Netscape for Macintosh.
318
            Moving an Apple-supplied "JAVA Accelerator for PowerPC"
319
            into the "extensions" folder may allow earlier versions
320
            of Netscape to run this program. Printing the CoreGenes
321
            applet-generated graph may be problematic due to an
322
            applet incompatibility; capturing the graph as a
323
            "screenshot" via the PC and the Mac platforms and
324
            printing independently circumvents this.
325
            Run times vary from 1 minute and 21 seconds for a
326
            set of five adenovirus genomes ( 
327
            ca. 35 kb) to 40 minutes for a
328
            set of five poxvirus genomes ( 
329
            ca. 250 kb). Currently, if there
330
            are multiple requests, the computation may take much
331
            longer as the requests are queued. This inconvenience
332
            is being addressed and is due to the server hosting the
333
            software. Depending on the hardware, some local servers
334
            may time out during this period while waiting for this
335
            request to be processed, which will result in an error
336
            message stating that "The attempt to load 'servlet'
337
            failed." Adjusting "preference" settings on the local
338
            web browser may rectify this problem. Immediate goals
339
            of improvement include an option to have results
340
            e-mailed back to the user. We expect that there will be
341
            additional improvements in both speed and response
342
            issues when we upgrade our server hardware and rewrite
343
            some of the CoreGenes software to accommodate the
344
            larger megabase-sized genomes.
345
          
346
          
347
            Software Limitations
348
            Only the NCBI database can be searched at this time;
349
            in other words, only GenBank accession numbers can be
350
            used. If there is an operator error in entering the
351
            number correctly, then an error message will be
352
            displayed, 
353
            e.g., "The attempt to load
354
            'servlet' failed." Improvements to this software will
355
            include providing an additional field to enter
356
            proprietary and non-GenBank genome data, similar to an
357
            option developed for GeneOrder2.0 [ 8 ] .
358
          
359
        
360
      
361
      
362
        Conclusions
363
        CoreGenes fits into the niche for GUI-based interactive
364
        computational tools [ 1 2 3 4 5 6 7 8 9 10 ] that enhance
365
        the visualization of DNA sequence data, especially in the
366
        context of genome comparisons. It meets a critical need for
367
        tool sets containing global "whole genome" analyses tools.
368
        As noted earlier, small genomes are still of great interest
369
        to many researchers. This tool is a base to expand upon,
370
        for example, to build more robust, elegant and
371
        complementary "whole genome" computational tools. Although
372
        CoreGenes successfully expedites the determination of
373
        "core" genes during the comparisons of several small whole
374
        genomes simultaneously, it will likely be succeeded by
375
        improved software to compare and analyze even much larger
376
        genomes, especially in the megabase range. This feature is
377
        being pursued with urgency. One known current limitation in
378
        analyzing larger genomes is computational, 
379
        e. g., hardware; this will be
380
        addressed shortly. Increasingly powerful workstations to
381
        act as servers will allow the much more computationally
382
        intensive comparisons of megabase-sized genomes. However,
383
        this version of CoreGenes is very useful and fills a
384
        current unmet need in genome analyses, that of collecting
385
        related genes in a family of genomes. In addition to
386
        stimulating the development of similar tools, CoreGenes
387
        will allow continuing improvements to it. We plan to
388
        support aggressively this version of CoreGenes, updating
389
        with improvements and additional features, as well as to
390
        work on a more robust faster version.
391
      
392
    
393
  
394

395
Product

Resources

Company