Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
High-throughput DNA sequencing makes it economically
7
possible to produce very large sequence data sets in short
8
time periods. With this technology it is now possible to do
9
experiments that were impossible only a couple of years
10
ago. For example, a series of landmark papers in the late
11
1980's and early 1990's showed that microbial diversity
12
could be analyzed by sequencing 16S rDNAs from
13
environmental samples (reviewed by [ 1 ] ). Giovannoni used
14
this approach to show that there is a cosmopolitan marine
15
bacterium, designated SAR11, using 44 16S rDNA sequences [
16
2 ] . Today, it would be reasonable to perform the same
17
study with thousands of 16S rDNA sequences. This
18
exponential increase in the size of sequence data sets
19
necessitates new computer tools.
20
Here we introduce a Java program, FastGroup, that is
21
appropriate for comparing thousands of sequences to each
22
other and grouping them based on user-defined criteria.
23
While FastGroup is optimized to dereplicate libraries of
24
16S rDNA sequences, it can easily be adapted to dereplicate
25
any protein or DNA sequence library.
26
27
28
Results and discussion
29
30
Description of program and algorithms
31
32
Overview of FastGroup program
33
Figure 1shows the FastGroup graphical user interface
34
(GUI). The GUI reflects the order in which operations
35
are carried out by the FastGroup program. First,
36
sequences are loaded into the program from a directory
37
of files (e.g., seq or txt files) or from a
38
FASTA-formatted document. The program trims the
39
sequences according to user-defined parameters and the
40
trimmed sequences are matched against each other and
41
grouped. In the Grouping step, the user can either
42
define a percent sequence identity (PSI) that will be
43
used to group the sequences together or a consecutive
44
number of mismatches (MM) that will prevent sequences
45
from grouping together (both algorithms are described
46
below).
47
48
49
Trimming sequences
50
Sections of the input sequences containing
51
mismatched and/or ambiguous bases must be removed or
52
they will prevent proper grouping. To make trimming as
53
flexible as possible, FastGroup can trim sequences in
54
three ways: 1) a user-specified number of bases from
55
the 5' or 3' ends can be used (the rest of the sequence
56
is discarded), 2) sequence 5' or 3' of a defined site
57
can be removed, or 3) sequence with ambiguous bases
58
(i.e., "Ns") can be removed from the ends. For the
59
latter two methods, trimming criteria can be entered
60
separately for the 5' and 3' ends. If a primer sequence
61
is specified, the user may adjust the stringency of the
62
match by varying the PSI or MM parameter (explained in
63
detail below).
64
65
66
Matching
67
Both algorithms initiate grouping by first finding a
68
window (i.e., a short sequence) that is shared between
69
the two sequences being compared. Both the window size
70
and direction of matching (e.g., 5' vs. 3') are
71
specified by the user.
72
73
74
Overview of grouping step
75
When FastGroup is initiated, the first sequence in
76
the library is trimmed and placed in a new group, g1.
77
The second sequence in the library is then trimmed and
78
compared against the sequence in g1. If the two
79
sequences are determined to be similar, as defined by
80
the user-derived matching and grouping criteria, both
81
sequences are placed in group g1. If the sequences are
82
not similar, the first sequence is placed into g1 and
83
the second sequence is placed into a new group, g2. The
84
next sequence in the input library is then retrieved,
85
trimmed, and compared against the sequences in the
86
groups. This process is repeated with every sequence in
87
the library until all sequences belong to a group. New
88
groups are created as necessary. Sequences in groups
89
are Targets. A sequence being compared to the Targets
90
is a Query sequence. It is important to note that the
91
first sequence used to create a group is the sequence
92
used for comparison against all subsequent sequences.
93
The name for each group begins with "g#-", where the #
94
is assigned sequentially as groups are found by the
95
program. After the hyphen, the name of the first
96
sequence put into the group is given.
97
98
99
Percent Sequence Identity (PSI) algorithm
100
The PSI algorithm starts at the first position after
101
the matching window and compares each base in the Query
102
sequence to that of the Target sequence. This is done
103
in sequential order and at each position the algorithm
104
records if the bases match. This process is repeated
105
through the length of the smaller sequence. The PSI is
106
calculated by dividing the number of bases found to be
107
the same in both sequences by the number of bases in
108
the smaller sequence. If two sequences have a percent
109
sequence identity that is greater than or equal to the
110
value entered by the user into the Percent Sequence
111
Identity window, then the Query sequence is added to a
112
Target sequence group.
113
114
115
Mismatching (MM) algorithm
116
The MM algorithm starts at the first position after
117
the matching window and compares the bases in the Query
118
sequence to the Target. If these two bases are the
119
same, the program moves on to the next pair. If the
120
bases are not equal, a one base pair gap is inserted
121
into the Query sequence, effectively sliding the Query
122
sequence relative to the Target sequence. The base in
123
the Query sequence is then compared to the newly
124
aligned Target base. If the bases match, the algorithm
125
leaves the gap and moves to the next base for
126
comparison. If the bases do not match, the gap in the
127
Query sequence is removed and a gap is placed in the
128
Target sequence. The newly aligned bases are then
129
checked. If they are the same, the program moves to the
130
next base in both sequences. However, if the gap in the
131
Target sequence does not cause the bases to pair this
132
is considered one mismatch. If the user-defined MM is
133
<=1, the sequences will not be grouped. If a 2 base
134
MM is assigned, the algorithm will also try using this
135
size of the gap in both the Target and Query sequence,
136
after initially using a 1 base gap. This algorithm is
137
essentially the same as bounded diagonal band alignment
138
[ 3 ] .
139
140
141
Output files
142
Once all sequences in a data set have been analyzed
143
by FastGroup, five output text files are produced. The
144
fasta_groups.txt output file contains the group name
145
and a representative sequence from each group in FASTA
146
format. The fasta_groups.txt file is particularly
147
useful for subsequent Clustal X (Clustal X Help
148
http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html;
149
[ 4 5 ] ) and BLAST analyses (BLAST
150
http://www.ncbi.nlm.nih.gov/BLAST/; [ 6 ] ). The second
151
output file, group_seqs.txt, contains the group name
152
and all sequences from the group. This file is most
153
useful for visual confirmation of groupings. The third
154
output file, group_files.txt, contains the group name,
155
name of each sequence in the group, and the percent
156
that each group makes of the total. The fourth output
157
file, coverage.txt, shows how many sequences are in
158
each group and calculates coverage by the method of
159
Good [ 7 ] . Finally, the infile.txt file contains all
160
the user specified parameters for a record of the
161
analysis.
162
163
164
165
Testing of FastGroup
166
The library used to test FastGroup consisted of 94
167
bacterial 16S rDNA obtained from an environmental sample.
168
The library was made by PCR amplifying with the
169
bacterial-specific primers Bact27F and Bact1492R, cloning
170
into a plasmid vector, and then sequencing the inserts
171
using the Bact27F primer (Figure 2). All sequences were
172
single-pass and unedited.
173
A number of factors were considered when designing an
174
approach for dereplicating 16S rDNA libraries. First,
175
miscalled bases would prevent related sequences from
176
grouping together. To remove these bases, it was assumed
177
that: 1) miscalled and ambiguous bases occur together
178
(i.e., the presence of N's could be used to differentiate
179
"bad" sequence), and 2) as you move 3' of the sequencing
180
primer miscalled and ambiguous bases become more
181
prevalent, especially beyond ~600 bp. Therefore, trimming
182
criteria that remove 3' sequence are necessary. A second
183
factor influencing our trimming strategy arises from the
184
fact that FastGroup must find a window in common between
185
two sequences before it starts the grouping algorithm.
186
Therefore, a conserved region at the matching end would
187
be expected to increase analysis speed. For bacterial 16S
188
rDNAs (Figure 2) sequenced using Bact27F, the Bact517
189
conserved site is ideal because: 1) the Bact517 site is
190
highly conserved and should be easy to find in most
191
bacterial 16S rDNA sequences 2) the site is ~500 bp away
192
from the sequencing primer, therefore sequence 5' of this
193
site should be good quality, and 3) the Bact517 site is
194
just 3' of the V3 region. This final point is important
195
because the V3 region is highly variable and usually
196
contains information sufficient to differentiate between
197
different bacterial species. Therefore, including the V3
198
region increases the resolving power of this approach for
199
measuring bacterial diversity.
200
201
202
Analyses of trimming and matching parameters
203
Our analysis strategy necessitated that the Bact517
204
site be accurately identified in the 16S rDNA sequences.
205
As shown in Table 1, if the Bact517 site must be
206
perfectly matched to be identified (PSI for primer
207
matching = 100%), 75 out of the 94 sequences (80%) were
208
trimmed correctly. If the PSI parameter for matching to
209
the Bact517 site was lowered to 70%, 82 out of the 94
210
sequences (87%) were correctly trimmed. However, lowering
211
the detection stringency for the Bact517 site also
212
increased the possibility that false positive sites would
213
be detected, resulting in prematurely trimmed sequences.
214
False sites did not appear to be a problem with
215
PSI>=70%, but lowering the PSI for finding the Bact517
216
site to 60% did result in 9 false positives (Table 1).
217
Therefore, for our data set, using a 70% PSI for finding
218
the Bact517 site appeared optimal. We specifically chose
219
a library of low quality sequences for the FastGroup
220
analyses. Therefore, the bact517 position was not found
221
in many of the test sequences because of sequencing
222
errors.
223
As predicted, using the 3' conserved region for
224
trimming and matching from the 3' end resulted in quicker
225
FastGroup analysis (Table 2), presumably because the
226
conserved region increases the chance that a window will
227
be quickly found. Aligning from the 3' end also increased
228
grouping frequency (Table 2), possibly because the
229
conserved region increased the accuracy of the matching
230
step. Because both algorithms require accurate matching
231
for initiation, the added accuracy offered by the
232
conserved regions as the matching sites increased the
233
efficiency of grouping. Even when the trimming criteria
234
did utilize the Bact517 site, the presence of this site
235
in the sequence increased grouping efficiency. For an
236
example of this phenomenon, compare the analyses where
237
the sequence was trimmed by taking the first 500 bases
238
and then was matched from the 5' versus the 3' ends
239
(Table 2). The presence of the conserved sequence
240
increased the grouping efficiency. Trimming to the
241
Bact517 site also allowed smaller windows to be used,
242
which dramatically increased grouping speed (Table 2).
243
Trimming sections of sequence with ambiguous bases did
244
not improve the sequence quality enough for accurate
245
grouping (Table 2).
246
247
248
Comparison of the PSI and MM algorithms
249
As shown in Table 3, the MM algorithm was much faster
250
than PSI. The sequence composition of Groups obtained
251
using a PSI value of 97% were roughly equivalent with
252
those obtained using a MM = 2. The MM = 2 did result in
253
some of the bigger groups being broken into two or more
254
smaller groups. We believe that the PSI algorithm was
255
more appropriate for analyses of 16S rDNA for a number of
256
reasons. First, gaps in unedited sequences were not as
257
big of a problem as we initially believed. We have
258
analyzed one bacterial 16S rDNA library in which 96% of
259
the sequences were grouped together using the PSI
260
algorithm. This result would not have been obtained if
261
gaps were a major problem. The second reason we prefer
262
the PSI algorithm for analyses of 16S rDNA is that there
263
are reasons to believe that bacteria with 16S rDNA
264
>=97% identity belong to closely related bacteria [ 8
265
] .
266
267
268
Analyzing partial sequences to increase speed of
269
FastGroup analyses
270
With a large data set, it may be desirable to speed up
271
the FastGroup analysis, possibly by using only part of
272
the input sequence during grouping. This approach would
273
only work if most of the information positions are not
274
lost by the truncation. That is, if a sequence is 500 bp
275
after trimming and only 80% of the sequence (i.e., 400
276
bp) is used in the Grouping step, how representative are
277
the results? It was expected that, since the
278
hypervariable region V3 is immediately 5' of the Bact517
279
site, grouping should be much faster and representative
280
if matching was initiated from the 3' end. As shown in
281
Table 4, using partial sequence does dramatically speed
282
up FastGroup, but with a significant loss of resolution.
283
The loss of resolution occurred even though the V3 region
284
was included in the portion of the sequence analyzed. For
285
this reason, we suggest using the longest sequence
286
possible.
287
288
289
Comparison of FastGroup with ClustalX output
290
ClustalX [ 4 ] uses the ClustalW algorithm [ 5 ] to
291
align sequences using a combination of progressive
292
alignment and dynamic programming, making this algorithm
293
sensitive to divergence between closely related sequences
294
(<35% identity). The ClustalW algorithm was used to
295
align the 94 test sequences using default parameters. A
296
tree was generated from aligned sequences using
297
ClustalX's Draw Neighbor Joining (NJ) Tree program. The
298
resulting tree data were plotted (Figures 3) using
299
NJPLOT, which was included as part of ClustalX software
300
distribution. The average running time to produce an
301
alignment from 94 sequences was one hour and 20 minutes
302
plus an average of 5 minutes to generate tree data using
303
Draw NJ tree.
304
FastGroup was used to group the same test sequences
305
using the PSI algorithm. Sequences were trimmed at the 5'
306
end for every N occurring within 50 bases, and at the 3'
307
end to 70% of the Bact517 site. Trimmed sequences were
308
grouped at 97% PSI. All other FastGroup parameters were
309
left at default values. Run time was ~25 seconds.
310
The NJ tree from the ClustalX analysis is shown in
311
Figure 3. The groups from FastGroup are color coded on
312
the Tree (Figure 3). In general, the ClustalX Clades and
313
FastGroups are identical. The main exception were the
314
FastGroups 1 and 8, which corresponds to ClustalX Clades
315
1-5. If the PSI is raised to 99% (i.e., 1 bp change per
316
100 bp) in FastGroup, then the two major ClustalX Groups
317
become apparent (e.g., Group 1 includes Clades 1-4 and
318
Group 2 is equivalent to Clade 5). The FastGroup 8
319
contain sequences that differ from FastGroup 1 by a one
320
bp gap, which explains the reason that ClustalX placed
321
these sequences in Clade 1. Because this gap occurred in
322
all four FastGroup 8 sequences, and not in any of the
323
FastGroup 1 sequences, these two groups probably
324
represent different 16S rDNAs and possibly two bacterial
325
species.
326
What other options exist for dereplicating large
327
libraries of 16S rDNA besides FastGroup? One possibility
328
is to align the sequences with Clustal X and then use the
329
alignments to determine which sequences are the same.
330
This approach is time consuming because it requires that:
331
1) the sequences be trimmed individually before the
332
alignment, and 2) duplicate sequences be manually removed
333
from the original library after the alignment. Advantages
334
of the Clustal X approach is that visual confirmation of
335
grouping is easy. However, results can also be visualized
336
in FastGroup by having the program display the sequences
337
from the 3' end and looking at the group_seqs.txt output
338
file. FastGroup can also speed up alignment analysis by
339
rapidly trimming, dereplicating, and outputting sequences
340
to the fasta_groups.txt file, which is ideal for Clustal
341
X alignments. A second possible approach to library
342
dereplication is to compare sequences against each other
343
using BLAST2
344
http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.htmland then
345
delete duplicates. This approach works well but is too
346
time consuming for libraries over a couple of hundred
347
sequences. A third way that large libraries are often
348
dereplicated requires submitting the sequences as batch
349
files to a database (either local or remote), then
350
searching the same sequences against the updated database
351
using BLAST or Sequence Similarity
352
http://www.cme.msu.edu/RDP/docs/sim_matrix_issues.html.
353
Again, this method works well for a small number of
354
sequences but is very time intensive with large data
355
sets.
356
Due to technological advances, it is now possible to
357
cheaply sequence thousands of 16S rDNA per day. This
358
change in sequencing power necessitates a reassessment of
359
how microbial diversity and biogeography is studied. Many
360
of the techniques commonly used for these sorts of
361
studies were designed to minimize efforts and cost in the
362
pre-genomics era [ 9 10 ] . However, these techniques
363
suffer from a number of limitations. In the case of
364
denaturing gradient gel electrophoresis (DGGE) it is
365
essentially impossible to compare samples from one gel to
366
another. Because the DGGE banding patterns can not be
367
standardized, DGGE data does not represent a permanent
368
record of microbial diversity or biogeography. In fact,
369
to get a permanent record of what microbe each band on
370
the DGGE represents it is necessary to clone and sequence
371
the band. This is costly both in time and reagents.
372
Terminal-restriction fragment length polymorphism
373
(T-RFLP) banding patterns can be standardized. Therefore,
374
T-RFLP data represents a permanent record of microbial
375
diversity. T-RFLP resolution is, however, limited (e.g.,
376
it is dependent on the different restriction sites being
377
present) and it is hard to link the T-RFLP pattern to a
378
specific microbial species. To make this connection, it
379
is necessary to analyze clones both by T-RFLP and by
380
sequencing. In contrast, 16S rDNA sequences allow
381
bacteria to be placed in taxonomical groups. Ribosomal
382
16S DNA sequences also allow the occurrence of a specific
383
phylotype to be documented in an unequivocal manner.
384
This, in turn, will allow databases of microbial
385
biogeography to be constructed.
386
Sequencing large 16S rDNA libraries as we have
387
outlined here offers the advantages of sequence data,
388
while minimizing cost (i.e., 1 sequencing reaction per
389
clone). The disadvantages of this approach include: 1) an
390
underestimation of diversity because only part of the 16S
391
rDNA locus is used, 2) the smaller sequences (~500 bp)
392
are not ideal for taxonomical identification, and 3)
393
dirty data (i.e., sequences with mistakes). A conscious
394
effort should be made to save these libraries. That way,
395
if the "cleaner" data or larger sequences are needed in
396
the future, the libraries can be resequenced. Another
397
concern with this approach is that it will cost more
398
money than alternative methods. High-throughput
399
sequencing is becoming very cheap. For example, the Joint
400
Genome Institute estimates that each sequencing reaction
401
costs $1.00-1.50 (Paul Predki, personal communication).
402
When compared to the cost of people-power, extra
403
reagents, and impermanence of the data of the other
404
approaches, sequencing of 16S rDNA libraries is probably
405
already a bargain, and it is only getting cheaper.
406
407
408
409
Conclusions
410
As high-throughput sequencing of 16S rDNA libraries
411
becomes more common, data analysis becomes the bottle-neck.
412
The FastGroup program is a first generation bioinformatics
413
tool for analyzing these data sets. It is designed for
414
moderately sized 16S rDNA libraries produced in individual
415
laboratories. Future generations of FastGroup should be
416
incorporated into relational databases that link the
417
sequence to other relevant data (e.g., where, when, how the
418
sequence was obtained). These sorts of databases will allow
419
detailed analyses of microbial biogeography and diversity
420
to be made.
421
422
423
Materials and methods
424
FastGroup was written in Java 1.3. Unless otherwise
425
stated, FastGroup was tested on a Compaq Armada E700
426
(Pentium III, 300 MHz, 300 Mb RAM) running Windows 2000.
427
The FastGroup executable can be found as an additional file
428
( Additional File 1) . The dataset used in these analyses
429
are available as a FASTA formatted document ( Additional
430
File 2). Frequently Asked Questions (FAQs) and instructions
431
for installing FastGroup are given in Additional File
432
3.
433
The 16S rDNA library was constructed as previously
434
described [ 11 ] . The clones in the libraries were
435
sequenced once from the 5' end using Bact27F (ABI PRISM
436
BigDye Terminators on an ABI377XL sequencer (PE Applied
437
Biosystems, Inc.; Foster City, CA) at the San Diego State
438
University Microchemical Core Facility). Unedited sequence
439
was used in all analyses (i.e., all sequences were
440
single-pass and exactly as the sequencer software, ABI
441
Prism Sequencing Analysis v. 3.3, called them).
442
Additional file 1
443
This is the FastGroup program.
444
Click here for file
445
Additional file 2
446
The bacterial 16S rDNA sequences that FastGroup was
447
tested on are included in a text document.
448
Click here for file
449
Additional file 3
450
A list of FAQs for a user trying to install and execute
451
the FastGroup program.
452
Click here for file
453
454
455
456
457