CoCalc -- 1471-2105-2-9.txt

OANC_GrAF / data / written_2 / technical / biomed / 1471-2105-2-9.txt
³⁹⁶⁷³ views
1

2
  
3
    
4
      
5
        Background
6
        High-throughput DNA sequencing makes it economically
7
        possible to produce very large sequence data sets in short
8
        time periods. With this technology it is now possible to do
9
        experiments that were impossible only a couple of years
10
        ago. For example, a series of landmark papers in the late
11
        1980's and early 1990's showed that microbial diversity
12
        could be analyzed by sequencing 16S rDNAs from
13
        environmental samples (reviewed by [ 1 ] ). Giovannoni used
14
        this approach to show that there is a cosmopolitan marine
15
        bacterium, designated SAR11, using 44 16S rDNA sequences [
16
        2 ] . Today, it would be reasonable to perform the same
17
        study with thousands of 16S rDNA sequences. This
18
        exponential increase in the size of sequence data sets
19
        necessitates new computer tools.
20
        Here we introduce a Java program, FastGroup, that is
21
        appropriate for comparing thousands of sequences to each
22
        other and grouping them based on user-defined criteria.
23
        While FastGroup is optimized to dereplicate libraries of
24
        16S rDNA sequences, it can easily be adapted to dereplicate
25
        any protein or DNA sequence library.
26
      
27
      
28
        Results and discussion
29
        
30
          Description of program and algorithms
31
          
32
            Overview of FastGroup program
33
            Figure 1shows the FastGroup graphical user interface
34
            (GUI). The GUI reflects the order in which operations
35
            are carried out by the FastGroup program. First,
36
            sequences are loaded into the program from a directory
37
            of files (e.g., seq or txt files) or from a
38
            FASTA-formatted document. The program trims the
39
            sequences according to user-defined parameters and the
40
            trimmed sequences are matched against each other and
41
            grouped. In the Grouping step, the user can either
42
            define a percent sequence identity (PSI) that will be
43
            used to group the sequences together or a consecutive
44
            number of mismatches (MM) that will prevent sequences
45
            from grouping together (both algorithms are described
46
            below).
47
          
48
          
49
            Trimming sequences
50
            Sections of the input sequences containing
51
            mismatched and/or ambiguous bases must be removed or
52
            they will prevent proper grouping. To make trimming as
53
            flexible as possible, FastGroup can trim sequences in
54
            three ways: 1) a user-specified number of bases from
55
            the 5' or 3' ends can be used (the rest of the sequence
56
            is discarded), 2) sequence 5' or 3' of a defined site
57
            can be removed, or 3) sequence with ambiguous bases
58
            (i.e., "Ns") can be removed from the ends. For the
59
            latter two methods, trimming criteria can be entered
60
            separately for the 5' and 3' ends. If a primer sequence
61
            is specified, the user may adjust the stringency of the
62
            match by varying the PSI or MM parameter (explained in
63
            detail below).
64
          
65
          
66
            Matching
67
            Both algorithms initiate grouping by first finding a
68
            window (i.e., a short sequence) that is shared between
69
            the two sequences being compared. Both the window size
70
            and direction of matching (e.g., 5' vs. 3') are
71
            specified by the user.
72
          
73
          
74
            Overview of grouping step
75
            When FastGroup is initiated, the first sequence in
76
            the library is trimmed and placed in a new group, g1.
77
            The second sequence in the library is then trimmed and
78
            compared against the sequence in g1. If the two
79
            sequences are determined to be similar, as defined by
80
            the user-derived matching and grouping criteria, both
81
            sequences are placed in group g1. If the sequences are
82
            not similar, the first sequence is placed into g1 and
83
            the second sequence is placed into a new group, g2. The
84
            next sequence in the input library is then retrieved,
85
            trimmed, and compared against the sequences in the
86
            groups. This process is repeated with every sequence in
87
            the library until all sequences belong to a group. New
88
            groups are created as necessary. Sequences in groups
89
            are Targets. A sequence being compared to the Targets
90
            is a Query sequence. It is important to note that the
91
            first sequence used to create a group is the sequence
92
            used for comparison against all subsequent sequences.
93
            The name for each group begins with "g#-", where the #
94
            is assigned sequentially as groups are found by the
95
            program. After the hyphen, the name of the first
96
            sequence put into the group is given.
97
          
98
          
99
            Percent Sequence Identity (PSI) algorithm
100
            The PSI algorithm starts at the first position after
101
            the matching window and compares each base in the Query
102
            sequence to that of the Target sequence. This is done
103
            in sequential order and at each position the algorithm
104
            records if the bases match. This process is repeated
105
            through the length of the smaller sequence. The PSI is
106
            calculated by dividing the number of bases found to be
107
            the same in both sequences by the number of bases in
108
            the smaller sequence. If two sequences have a percent
109
            sequence identity that is greater than or equal to the
110
            value entered by the user into the Percent Sequence
111
            Identity window, then the Query sequence is added to a
112
            Target sequence group.
113
          
114
          
115
            Mismatching (MM) algorithm
116
            The MM algorithm starts at the first position after
117
            the matching window and compares the bases in the Query
118
            sequence to the Target. If these two bases are the
119
            same, the program moves on to the next pair. If the
120
            bases are not equal, a one base pair gap is inserted
121
            into the Query sequence, effectively sliding the Query
122
            sequence relative to the Target sequence. The base in
123
            the Query sequence is then compared to the newly
124
            aligned Target base. If the bases match, the algorithm
125
            leaves the gap and moves to the next base for
126
            comparison. If the bases do not match, the gap in the
127
            Query sequence is removed and a gap is placed in the
128
            Target sequence. The newly aligned bases are then
129
            checked. If they are the same, the program moves to the
130
            next base in both sequences. However, if the gap in the
131
            Target sequence does not cause the bases to pair this
132
            is considered one mismatch. If the user-defined MM is
133
            <=1, the sequences will not be grouped. If a 2 base
134
            MM is assigned, the algorithm will also try using this
135
            size of the gap in both the Target and Query sequence,
136
            after initially using a 1 base gap. This algorithm is
137
            essentially the same as bounded diagonal band alignment
138
            [ 3 ] .
139
          
140
          
141
            Output files
142
            Once all sequences in a data set have been analyzed
143
            by FastGroup, five output text files are produced. The
144
            fasta_groups.txt output file contains the group name
145
            and a representative sequence from each group in FASTA
146
            format. The fasta_groups.txt file is particularly
147
            useful for subsequent Clustal X (Clustal X Help
148
            http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html;
149
            [ 4 5 ] ) and BLAST analyses (BLAST
150
            http://www.ncbi.nlm.nih.gov/BLAST/; [ 6 ] ). The second
151
            output file, group_seqs.txt, contains the group name
152
            and all sequences from the group. This file is most
153
            useful for visual confirmation of groupings. The third
154
            output file, group_files.txt, contains the group name,
155
            name of each sequence in the group, and the percent
156
            that each group makes of the total. The fourth output
157
            file, coverage.txt, shows how many sequences are in
158
            each group and calculates coverage by the method of
159
            Good [ 7 ] . Finally, the infile.txt file contains all
160
            the user specified parameters for a record of the
161
            analysis.
162
          
163
        
164
        
165
          Testing of FastGroup
166
          The library used to test FastGroup consisted of 94
167
          bacterial 16S rDNA obtained from an environmental sample.
168
          The library was made by PCR amplifying with the
169
          bacterial-specific primers Bact27F and Bact1492R, cloning
170
          into a plasmid vector, and then sequencing the inserts
171
          using the Bact27F primer (Figure 2). All sequences were
172
          single-pass and unedited.
173
          A number of factors were considered when designing an
174
          approach for dereplicating 16S rDNA libraries. First,
175
          miscalled bases would prevent related sequences from
176
          grouping together. To remove these bases, it was assumed
177
          that: 1) miscalled and ambiguous bases occur together
178
          (i.e., the presence of N's could be used to differentiate
179
          "bad" sequence), and 2) as you move 3' of the sequencing
180
          primer miscalled and ambiguous bases become more
181
          prevalent, especially beyond ~600 bp. Therefore, trimming
182
          criteria that remove 3' sequence are necessary. A second
183
          factor influencing our trimming strategy arises from the
184
          fact that FastGroup must find a window in common between
185
          two sequences before it starts the grouping algorithm.
186
          Therefore, a conserved region at the matching end would
187
          be expected to increase analysis speed. For bacterial 16S
188
          rDNAs (Figure 2) sequenced using Bact27F, the Bact517
189
          conserved site is ideal because: 1) the Bact517 site is
190
          highly conserved and should be easy to find in most
191
          bacterial 16S rDNA sequences 2) the site is ~500 bp away
192
          from the sequencing primer, therefore sequence 5' of this
193
          site should be good quality, and 3) the Bact517 site is
194
          just 3' of the V3 region. This final point is important
195
          because the V3 region is highly variable and usually
196
          contains information sufficient to differentiate between
197
          different bacterial species. Therefore, including the V3
198
          region increases the resolving power of this approach for
199
          measuring bacterial diversity.
200
        
201
        
202
          Analyses of trimming and matching parameters
203
          Our analysis strategy necessitated that the Bact517
204
          site be accurately identified in the 16S rDNA sequences.
205
          As shown in Table 1, if the Bact517 site must be
206
          perfectly matched to be identified (PSI for primer
207
          matching = 100%), 75 out of the 94 sequences (80%) were
208
          trimmed correctly. If the PSI parameter for matching to
209
          the Bact517 site was lowered to 70%, 82 out of the 94
210
          sequences (87%) were correctly trimmed. However, lowering
211
          the detection stringency for the Bact517 site also
212
          increased the possibility that false positive sites would
213
          be detected, resulting in prematurely trimmed sequences.
214
          False sites did not appear to be a problem with
215
          PSI>=70%, but lowering the PSI for finding the Bact517
216
          site to 60% did result in 9 false positives (Table 1).
217
          Therefore, for our data set, using a 70% PSI for finding
218
          the Bact517 site appeared optimal. We specifically chose
219
          a library of low quality sequences for the FastGroup
220
          analyses. Therefore, the bact517 position was not found
221
          in many of the test sequences because of sequencing
222
          errors.
223
          As predicted, using the 3' conserved region for
224
          trimming and matching from the 3' end resulted in quicker
225
          FastGroup analysis (Table 2), presumably because the
226
          conserved region increases the chance that a window will
227
          be quickly found. Aligning from the 3' end also increased
228
          grouping frequency (Table 2), possibly because the
229
          conserved region increased the accuracy of the matching
230
          step. Because both algorithms require accurate matching
231
          for initiation, the added accuracy offered by the
232
          conserved regions as the matching sites increased the
233
          efficiency of grouping. Even when the trimming criteria
234
          did utilize the Bact517 site, the presence of this site
235
          in the sequence increased grouping efficiency. For an
236
          example of this phenomenon, compare the analyses where
237
          the sequence was trimmed by taking the first 500 bases
238
          and then was matched from the 5' versus the 3' ends
239
          (Table 2). The presence of the conserved sequence
240
          increased the grouping efficiency. Trimming to the
241
          Bact517 site also allowed smaller windows to be used,
242
          which dramatically increased grouping speed (Table 2).
243
          Trimming sections of sequence with ambiguous bases did
244
          not improve the sequence quality enough for accurate
245
          grouping (Table 2).
246
        
247
        
248
          Comparison of the PSI and MM algorithms
249
          As shown in Table 3, the MM algorithm was much faster
250
          than PSI. The sequence composition of Groups obtained
251
          using a PSI value of 97% were roughly equivalent with
252
          those obtained using a MM = 2. The MM = 2 did result in
253
          some of the bigger groups being broken into two or more
254
          smaller groups. We believe that the PSI algorithm was
255
          more appropriate for analyses of 16S rDNA for a number of
256
          reasons. First, gaps in unedited sequences were not as
257
          big of a problem as we initially believed. We have
258
          analyzed one bacterial 16S rDNA library in which 96% of
259
          the sequences were grouped together using the PSI
260
          algorithm. This result would not have been obtained if
261
          gaps were a major problem. The second reason we prefer
262
          the PSI algorithm for analyses of 16S rDNA is that there
263
          are reasons to believe that bacteria with 16S rDNA
264
          >=97% identity belong to closely related bacteria [ 8
265
          ] .
266
        
267
        
268
          Analyzing partial sequences to increase speed of
269
          FastGroup analyses
270
          With a large data set, it may be desirable to speed up
271
          the FastGroup analysis, possibly by using only part of
272
          the input sequence during grouping. This approach would
273
          only work if most of the information positions are not
274
          lost by the truncation. That is, if a sequence is 500 bp
275
          after trimming and only 80% of the sequence (i.e., 400
276
          bp) is used in the Grouping step, how representative are
277
          the results? It was expected that, since the
278
          hypervariable region V3 is immediately 5' of the Bact517
279
          site, grouping should be much faster and representative
280
          if matching was initiated from the 3' end. As shown in
281
          Table 4, using partial sequence does dramatically speed
282
          up FastGroup, but with a significant loss of resolution.
283
          The loss of resolution occurred even though the V3 region
284
          was included in the portion of the sequence analyzed. For
285
          this reason, we suggest using the longest sequence
286
          possible.
287
        
288
        
289
          Comparison of FastGroup with ClustalX output
290
          ClustalX [ 4 ] uses the ClustalW algorithm [ 5 ] to
291
          align sequences using a combination of progressive
292
          alignment and dynamic programming, making this algorithm
293
          sensitive to divergence between closely related sequences
294
          (<35% identity). The ClustalW algorithm was used to
295
          align the 94 test sequences using default parameters. A
296
          tree was generated from aligned sequences using
297
          ClustalX's Draw Neighbor Joining (NJ) Tree program. The
298
          resulting tree data were plotted (Figures 3) using
299
          NJPLOT, which was included as part of ClustalX software
300
          distribution. The average running time to produce an
301
          alignment from 94 sequences was one hour and 20 minutes
302
          plus an average of 5 minutes to generate tree data using
303
          Draw NJ tree.
304
          FastGroup was used to group the same test sequences
305
          using the PSI algorithm. Sequences were trimmed at the 5'
306
          end for every N occurring within 50 bases, and at the 3'
307
          end to 70% of the Bact517 site. Trimmed sequences were
308
          grouped at 97% PSI. All other FastGroup parameters were
309
          left at default values. Run time was ~25 seconds.
310
          The NJ tree from the ClustalX analysis is shown in
311
          Figure 3. The groups from FastGroup are color coded on
312
          the Tree (Figure 3). In general, the ClustalX Clades and
313
          FastGroups are identical. The main exception were the
314
          FastGroups 1 and 8, which corresponds to ClustalX Clades
315
          1-5. If the PSI is raised to 99% (i.e., 1 bp change per
316
          100 bp) in FastGroup, then the two major ClustalX Groups
317
          become apparent (e.g., Group 1 includes Clades 1-4 and
318
          Group 2 is equivalent to Clade 5). The FastGroup 8
319
          contain sequences that differ from FastGroup 1 by a one
320
          bp gap, which explains the reason that ClustalX placed
321
          these sequences in Clade 1. Because this gap occurred in
322
          all four FastGroup 8 sequences, and not in any of the
323
          FastGroup 1 sequences, these two groups probably
324
          represent different 16S rDNAs and possibly two bacterial
325
          species.
326
          What other options exist for dereplicating large
327
          libraries of 16S rDNA besides FastGroup? One possibility
328
          is to align the sequences with Clustal X and then use the
329
          alignments to determine which sequences are the same.
330
          This approach is time consuming because it requires that:
331
          1) the sequences be trimmed individually before the
332
          alignment, and 2) duplicate sequences be manually removed
333
          from the original library after the alignment. Advantages
334
          of the Clustal X approach is that visual confirmation of
335
          grouping is easy. However, results can also be visualized
336
          in FastGroup by having the program display the sequences
337
          from the 3' end and looking at the group_seqs.txt output
338
          file. FastGroup can also speed up alignment analysis by
339
          rapidly trimming, dereplicating, and outputting sequences
340
          to the fasta_groups.txt file, which is ideal for Clustal
341
          X alignments. A second possible approach to library
342
          dereplication is to compare sequences against each other
343
          using BLAST2
344
          http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.htmland then
345
          delete duplicates. This approach works well but is too
346
          time consuming for libraries over a couple of hundred
347
          sequences. A third way that large libraries are often
348
          dereplicated requires submitting the sequences as batch
349
          files to a database (either local or remote), then
350
          searching the same sequences against the updated database
351
          using BLAST or Sequence Similarity
352
          http://www.cme.msu.edu/RDP/docs/sim_matrix_issues.html.
353
          Again, this method works well for a small number of
354
          sequences but is very time intensive with large data
355
          sets.
356
          Due to technological advances, it is now possible to
357
          cheaply sequence thousands of 16S rDNA per day. This
358
          change in sequencing power necessitates a reassessment of
359
          how microbial diversity and biogeography is studied. Many
360
          of the techniques commonly used for these sorts of
361
          studies were designed to minimize efforts and cost in the
362
          pre-genomics era [ 9 10 ] . However, these techniques
363
          suffer from a number of limitations. In the case of
364
          denaturing gradient gel electrophoresis (DGGE) it is
365
          essentially impossible to compare samples from one gel to
366
          another. Because the DGGE banding patterns can not be
367
          standardized, DGGE data does not represent a permanent
368
          record of microbial diversity or biogeography. In fact,
369
          to get a permanent record of what microbe each band on
370
          the DGGE represents it is necessary to clone and sequence
371
          the band. This is costly both in time and reagents.
372
          Terminal-restriction fragment length polymorphism
373
          (T-RFLP) banding patterns can be standardized. Therefore,
374
          T-RFLP data represents a permanent record of microbial
375
          diversity. T-RFLP resolution is, however, limited (e.g.,
376
          it is dependent on the different restriction sites being
377
          present) and it is hard to link the T-RFLP pattern to a
378
          specific microbial species. To make this connection, it
379
          is necessary to analyze clones both by T-RFLP and by
380
          sequencing. In contrast, 16S rDNA sequences allow
381
          bacteria to be placed in taxonomical groups. Ribosomal
382
          16S DNA sequences also allow the occurrence of a specific
383
          phylotype to be documented in an unequivocal manner.
384
          This, in turn, will allow databases of microbial
385
          biogeography to be constructed.
386
          Sequencing large 16S rDNA libraries as we have
387
          outlined here offers the advantages of sequence data,
388
          while minimizing cost (i.e., 1 sequencing reaction per
389
          clone). The disadvantages of this approach include: 1) an
390
          underestimation of diversity because only part of the 16S
391
          rDNA locus is used, 2) the smaller sequences (~500 bp)
392
          are not ideal for taxonomical identification, and 3)
393
          dirty data (i.e., sequences with mistakes). A conscious
394
          effort should be made to save these libraries. That way,
395
          if the "cleaner" data or larger sequences are needed in
396
          the future, the libraries can be resequenced. Another
397
          concern with this approach is that it will cost more
398
          money than alternative methods. High-throughput
399
          sequencing is becoming very cheap. For example, the Joint
400
          Genome Institute estimates that each sequencing reaction
401
          costs $1.00-1.50 (Paul Predki, personal communication).
402
          When compared to the cost of people-power, extra
403
          reagents, and impermanence of the data of the other
404
          approaches, sequencing of 16S rDNA libraries is probably
405
          already a bargain, and it is only getting cheaper.
406
        
407
      
408
      
409
        Conclusions
410
        As high-throughput sequencing of 16S rDNA libraries
411
        becomes more common, data analysis becomes the bottle-neck.
412
        The FastGroup program is a first generation bioinformatics
413
        tool for analyzing these data sets. It is designed for
414
        moderately sized 16S rDNA libraries produced in individual
415
        laboratories. Future generations of FastGroup should be
416
        incorporated into relational databases that link the
417
        sequence to other relevant data (e.g., where, when, how the
418
        sequence was obtained). These sorts of databases will allow
419
        detailed analyses of microbial biogeography and diversity
420
        to be made.
421
      
422
      
423
        Materials and methods
424
        FastGroup was written in Java 1.3. Unless otherwise
425
        stated, FastGroup was tested on a Compaq Armada E700
426
        (Pentium III, 300 MHz, 300 Mb RAM) running Windows 2000.
427
        The FastGroup executable can be found as an additional file
428
        ( Additional File 1) . The dataset used in these analyses
429
        are available as a FASTA formatted document ( Additional
430
        File 2). Frequently Asked Questions (FAQs) and instructions
431
        for installing FastGroup are given in Additional File
432
        3.
433
        The 16S rDNA library was constructed as previously
434
        described [ 11 ] . The clones in the libraries were
435
        sequenced once from the 5' end using Bact27F (ABI PRISM
436
        BigDye Terminators on an ABI377XL sequencer (PE Applied
437
        Biosystems, Inc.; Foster City, CA) at the San Diego State
438
        University Microchemical Core Facility). Unedited sequence
439
        was used in all analyses (i.e., all sequences were
440
        single-pass and exactly as the sequencer software, ABI
441
        Prism Sequencing Analysis v. 3.3, called them).
442
        Additional file 1
443
        This is the FastGroup program.
444
        Click here for file
445
        Additional file 2
446
        The bacterial 16S rDNA sequences that FastGroup was
447
        tested on are included in a text document.
448
        Click here for file
449
        Additional file 3
450
        A list of FAQs for a user trying to install and execute
451
        the FastGroup program.
452
        Click here for file
453
      
454
    
455
  
456

457
Product

Resources

Company