CoCalc -- 1471-2105-3-30.txt

OANC_GrAF / data / written_2 / technical / biomed / 1471-2105-3-30.txt
³⁹⁶⁷³ views
1

2
  
3
    
4
      
5
        Background
6
        In higher eukaryotes, many genes feature differential
7
        spatial-temporal expression during development and
8
        throughout the life cycle of the organism. Their complex
9
        transcription regulation is thought to be achieved by the
10
        combinatorial action of multiple transcription factors
11
        which bind to 
12
        cis- regulatory DNA sequences. Here,
13
        transcription factors are defined as proteins which
14
        recognize and bind regulatory sites and have a potential to
15
        modulate directly or indirectly through the recruitment of
16
        cofactors the activity of the basal transcriptional
17
        apparatus of proximal genes. The number of transcription
18
        factors is a substantial part of the total number of genes
19
        in any organism, for example about 700 out of 13,500 genes
20
        in Drosophila [ 1 ] .
21
        Although combinatorial action of transcription factors
22
        has been studied throughout the life cycle of organisms [ 2
23
        ] , perhaps the most coherent picture has emerged in the
24
        context of developmental processes [ 3 4 ] . Here, a great
25
        number of experiments suggest that a major part of the gene
26
        regulatory apparatus is organized in the form of separable 
27
        cis- regulatory modules [ 3 ] . A
28
        given module defines specific aspects of the
29
        spatio-temporal pattern of gene expression by the
30
        combinatorial action of multiple transcription factors
31
        which together define the rate of transcription. Modules
32
        thus integrate inputs from several genes and regulate
33
        another gene to form developmental networks. Modules seem
34
        to share several architectural features [ 5 ] : They are
35
        typically only hundreds of nucleotides in length and
36
        contain multiple binding sites for as many as 4-5 different
37
        transcription factors. The frequent occurence of multiple
38
        copies of the same motif as well as the enrichment of
39
        certain combinations of motifs in a module in comparison
40
        with the genome at large provide the basis for our
41
        computational strategies to predict genes which are part of
42
        the same regulatory network. Existing algorithms for
43
        discovering modules [ 6 7 8 ] are based on counting the
44
        number of matches of a certain minimal strength to known
45
        motifs and thus require ad-hoc parameters for each motif,
46
        resulting in parameter dependent predictions. These
47
        algorithms are also bound to miss multiple weak binding
48
        sites which are known to be present in many modules. We
49
        demonstrate a novel algorithm, Ahab, which overcomes these
50
        problems.
51
        However, the binding sites (motifs) which reside in
52
        modules are often not known. A recent paper [ 9 ] proposed
53
        an algorithm for identfying these sites; here we show that
54
        a standard method is capable of identifying typically half
55
        of the known binding sites in a module. Its entire output
56
        can then be used as input to Ahab (Figure 1). Finally, we
57
        ask if the redundancy of sites inside modules is strong
58
        enough to predict modules using only genomic sequence. To
59
        our knowledge, our algorithm Argos is the first successful
60
        attempt to do this for a metazoan genome.
61
        Our system of choice is the body patterning of the early
62
        Drosophila embryo which is established by a multi-tiered
63
        hierarchy of transcription factors [ 10 ] . Broadly
64
        distributed maternal factors trigger zygotic gap gene
65
        expression in discrete domains along the anterior-posterior
66
        axis of the embryo. Maternal and gap gene factors together
67
        trigger pair rule gene expression in 7 alternating stripes,
68
        which in turn regulate segment polarity and homeotic gene
69
        expression in 14 stripes. Many of these factors are known,
70
        their binding sites have been studied, and more than 20
71
        modules have been identified. For this system, we show by
72
        comparing our predictions to literature results that we can
73
        predict regulatory modules using different levels of input
74
        - binding sites, regulatory sequence identified by
75
        'promoter bashing'/sufficiency tests, and genomic sequence
76
        itself. Altogether, we predict roughly 200 new modules and
77
        7 new sequence motifs and we analyze and validate the
78
        performance of each of the three algorithms.
79
        Figure 1summarizes the input/output levels and serves as
80
        a reference for what follows.
81
      
82
      
83
        Results
84
        
85
          Using known binding sites, Ahab finds 135
86
          significant new modules in the genome
87
          There are a sufficient number of binding sites in the
88
          literature (see additional File 1and 2) for us to
89
          construct frequency weight matrices for the maternal
90
          transription factors Bicoid (Bcd), Caudal (Cad) and
91
          Dorsal (Dl), as well as the zygotic gap gene factors
92
          Hunchback (Hb), Kruppel (Kr), Knirps (Kni), and Tailless
93
          (Tll), and the torso response element (torRE). With the
94
          exception of giant, which is too ill-defined to model,
95
          these are all the maternal/gap genes at the top of the
96
          segmentation gene hierarchy.
97
          Modules can be located by scanning the genome in
98
          windows, counting the number of matches to each matrix
99
          with a score better than some value, and then combining
100
          scores and ranking [ 6 7 ] . We have designed an
101
          algorithm, Ahab, which eliminates all parameters other
102
          than a final rank, and, following the logic of the
103
          mobydick algorithm [ 11 ] , computes via maximum
104
          likelihood the probability that the window sequence is
105
          made up by sampling from the known weight matrices or
106
          background. Mobydick was designed to find overrepresented
107
          "words" in noncoding DNA, and applied to yeast. However,
108
          in the genomes of higher eukaryotes, binding sites are
109
          much fuzzier than in yeast and simple motif models are
110
          unlikely to yield meaningful results. Therefore, Ahab
111
          generalizes the motif model to positional weightmatrices.
112
          Furthermore, we introduce a 
113
          local background model to cut down
114
          the number of false positives arising from local
115
          variations in sequence composition and degeneracy of
116
          motifs (for example motifs that contain a poly A string).
117
          Finally, the maximum likelihood fit has to be done
118
          separately for each window and thus hundreds of millions
119
          of times for the genomes of higher eukaryotes, thus
120
          requiring an efficient implementation of the numerical
121
          procedures. Ahab tallies all possible segmentations of
122
          the sequence into binding sites (also called parsings).
123
          Thus motif overlaps are allowed and weighted according to
124
          how well they explain the data, and multiple weak copies
125
          of a factor are all counted [ 5 12 ] . Our background
126
          model is a Markov model fit to all triples of bases in
127
          the window and accounts for local compositional variation
128
          which could otherwise push up the representation of any
129
          matrix that matched one of the high copy number base
130
          triples (eg hb matches poly A tracts).
131
          Additional file 1
132
          
133
          Click here for file
134
          Additional file 2
135
          
136
          Click here for file
137
          Additional file 3
138
          
139
          Click here for file
140
          Additional file 4
141
          
142
          Click here for file
143
          Additional file 5
144
          
145
          Click here for file
146
          Additional file 6
147
          
148
          Click here for file
149
          Additional file 7
150
          
151
          Click here for file
152
          We used Ahab with a 500 bp window to scan the
153
          Drosophila genome. As a representative/typical example we
154
          show the hairy locus, Fig. 2, a pair-rule gene expressed
155
          in 7 stripes. Five modules driving individual stripes and
156
          stripe pairs have been identified experimentally. Ahab
157
          predicts four of them (stripes 1, 5, 6, 7), only the
158
          stripe 3+4 element is not recovered. Note that two
159
          modules, stripes 1 and 7, were not used as training data.
160
          The support for each local maximum corresponds nicely to
161
          the experimentally estimated sizes of the modules. Ahab
162
          also reports scores for putative binding sites in each
163
          window as defined by (4). As an example, we show the
164
          even-skipped stripe 3+7 module, one of the best studied
165
          cases in the literature (Fig. 3). Most of the known sites
166
          are recovered and some new ones predicted.
167
          Ahab finds 146 highly significant modules in the
168
          genome, most located in the non-transcribed regions. 27
169
          modules are inside introns, only 6 overlap with exons. It
170
          recovers the 11 modules used to construct our weight
171
          matrices and predicts 6 other known modules with
172
          maternal/gap gene input. Thus, 17 out of 27 known modules
173
          (see additional File 1) are found. Three of the missing
174
          10 modules (kni64, snail, sog) are very high in score
175
          (< rank 100) but lost when filtering for the presence
176
          of at least three different factors. Three modules are
177
          lost because they contain only dorsal sites (rho, twist,
178
          zen) and we do not have matrices for the other factors
179
          known to bind (such as snail and twist), however
180
          searching with only the dorsal weight matrix does recover
181
          them. The four remaining missing modules are low in score
182
          (rank > 700), but Kruppel CD2 is recovered for a
183
          window size of 700; the others are evidently low in
184
          maternal/gap gene binding sites.
185
          To determine whether any of the 129 novel module
186
          predictions were correct, we asked whether any of the
187
          adjacent genes were patterned in the blastoderm. For 15
188
          modules patterned expression has been reported for one of
189
          the adjacent genes; the patterns are either gap or pair
190
          rule like. An additional 11 modules are suggestive, since
191
          they are clustered with another, nonoverlapping
192
          prediction (Table 1). Interestingly, two predicted
193
          modules are in proximity to well studied segmentation
194
          genes (giant, runt) but still outside known regulatory
195
          regions.
196
          We tested the stability of Ahab against unspecific
197
          input weight matrices by eliminating the least specific
198
          matrix in our list (Tailless). Tailless has ~600
199
          predicted sites in the 146 modules (Table 2), twice as
200
          much as any other factor. However, we found that roughly
201
          75 % of the predictions without using tailless (but with
202
          the same significance cutoff) were also present in the
203
          list of 146. Thus, although Tailless is rather
204
          unspecific, it makes a contribution to the predictions
205
          without dominating them.
206
          We also varied the window size to 700 bp and masked
207
          out the repeats genome wide, before running Ahab, (see
208
          web site). Our top 150 predictions now included 11 out of
209
          the 27 known modules (see additional File 1and 3).
210
          Lowering the cutoff value in the module score did not
211
          help, only one additional known module was recovered when
212
          including ranks 151-250. Overall, 84 modules or 58 % of
213
          the window 500 dataset are also among the top 200 of the
214
          window 700 set, and thus might be accorded more
215
          significance. Although some known modules disappeared,
216
          some very interesting new modules are predicted, in
217
          window 700 runs, including modules for key genes in the
218
          segmentation process such as caudal, grainyhead, odd
219
          skipped (odd), and sloppy paired 1,2 (slp1,2) (see
220
          additional File 3). Thus Ahab could be improved by
221
          allowing for variable window lengths.
222
          We estimated the false positive rate by scrambling the
223
          columns (positions) in the input frequency matrices. The
224
          new matrices are thus unlikely to be functional, but
225
          retain the same specificity. Ahab found roughly half as
226
          many "modules" for the same score cutoff as for the
227
          original set ie a 50% false positive rate. Note that this
228
          is a conservative estimate because part of the consensus
229
          motif recognized by hunchback and caudal is largely a
230
          poly T motif, and thus preserved by scrambling. Moreover,
231
          only very few known patterned gene are predicted by the
232
          scrambled matrices.
233
          Another perhaps more straightforward estimate of
234
          statistical significance and our true positive rate is to
235
          use the experimental result (see footnote in [ 6 ] ) that
236
          less than 2% of the genome or ~300 genes are patterned
237
          during the blastoderm. Since we can not tell which of two
238
          neighboring genes is regulated by each of the 102
239
          intergenic modules we predict, we are obliged to label
240
          237 genes (adding the 33 genes with intragenic modules)
241
          as potentially patterned. For 237 random predictions one
242
          expects 2% or five genes to be patterned, and the
243
          probability to get 21 or more genes (see additional File
244
          3) by chance (p-value) is ~10 -10. It should be stressed
245
          that the true success rate of Ahab will be much higher
246
          since the number of genes for which blastoderm expression
247
          is demonstrated is (< 100) or 1/3 of the total. Thus
248
          we expect an additional 50 genes in our set to be
249
          patterned, so a 50% overall positive rate for our module
250
          predictions.
251
        
252
        
253
          The Gibbs sampler finds binding sites within
254
          experimentally characterized modules
255
          Binding site information for most of the transcription
256
          factors relevant to any developmental process is only
257
          rarely available. More common are modules obtained by
258
          'promoter bashing' from several genes with similar
259
          expression. Thus it is natural to ask, in view of the
260
          site repetition within modules, whether standard motif
261
          finders are able to recover good weight matrices from
262
          modules and if so can these be used as input to Ahab to
263
          find genes with similar regulatory inputs.
264
          We have tailored the Gibbs algorithm (see methods) to
265
          this problem by searching for only one site at a time,
266
          and then masking only the central 1-2 bases of each
267
          sequence motif found before iterating. The results were
268
          thereby much more reproducible between runs. Most
269
          importantly, motifs were allowed to overlap, a very
270
          common occurrance in modules and arguably important for
271
          their function [ 3 ] .
272
          To gain confidence in the capabilities of the Gibbs
273
          algorithm, we prepared two synthetic data sets
274
          representative of the data we wanted to examine (eg
275
          several kb long, 30-50% of the sequence covered by
276
          motifs, the remaining sequence random and 60% A/T). Data
277
          set 1 was made by equally sampling our Hb, Cad, Kr and
278
          Tll matrices. Data set 2 was generated from four
279
          synthetic frequency matrices of specificity equal to the
280
          known ones. The customized Gibbs virtually perfectly
281
          recovers all the synthetic weight matrices from dataset
282
          2. By contrast, only half of the natural sites were
283
          recovered from dataset 1. The sites overlapped and their
284
          delineation was imperfect. Thus Gibbs detects sequence
285
          correlations among the factors we chose, but probably
286
          exaggerates them, because it computes significance based
287
          on single base frequencies.
288
          We then ran the Gibbs algorithm on several modules
289
          with extensive binding site data, Table 3. In accord with
290
          experiment, Gibbs predicts that about 30-50% of the
291
          sequence is covered by motifs. The specificity of the
292
          Gibbs motifs is typically higher than for the
293
          experimental ones, presumably because a smaller number of
294
          sites is sampled. Even when the majority of the sequences
295
          composing the Gibbs motif match a single factor, there
296
          are a few other strong factor matches generally in
297
          different positions and perhaps the other orientation.
298
          Nevertheless the Gibbs motifs are largely reproducible
299
          between runs. Generally, we recover about half the
300
          factors known to influence the module, and interestingly
301
          predict several new motifs. The lack of a 1:1
302
          correspondence between the experimental motifs (generally
303
          composed from a wider range of data then presented to
304
          Gibbs) and those we find, points to a real ambiguity, we
305
          believe, in how to parse a sequence into binding
306
          sites.
307
        
308
        
309
          Using only three modules as input, Ahab finds 63
310
          significant modules in the genome
311
          Next we tested whether Gibbs derived matrices can be
312
          fed to Ahab for a genome-wide search for modules. As
313
          samples, we used three known modules that drive
314
          expression of the pair rule gene hairy in stripes 5, 6,
315
          and 7, which are known to receive input from Bcd, Cad,
316
          Hb, Kni, Kr, and Tll (see additional File 1).
317
          Customized Gibbs finds 6 highly significant weight
318
          matrices within the 2 kb composite module (see additional
319
          File 4). One matches the Kr weight matrix with high
320
          quality, another Kni, and a third represents a mixture of
321
          Hb and Cad, whose matrices are indeed quite similar due
322
          to a poly T motif. The other three motifs are new. Using
323
          these 6 weight matrices as input, Ahab finds 63 highly
324
          significant modules genome-wide (Table 4, additional File
325
          5).
326
          The top four modules overlap with those used in the
327
          Gibbs sampling. In addition, 13 new modules were
328
          contiguous to genes that are known to be patterned in the
329
          blastoderm, and two fall close to a single gene of
330
          unknown function. One of the top scoring modules is the
331
          hunchback central stripe module, other particularly
332
          interesting hits are in the intron of knirps and 18 kb
333
          upstream of hairy outside the known regulatory region,
334
          and proximal to emc, a known transcriptional
335
          co-repressor. Compared to the Ahab predictions, we find
336
          more modules near homeotic genes (Abd-A, abd-B, Ubx,
337
          hth), due to the presence of the novel Gibbs motifs. The
338
          statistical significance of our predictions and the
339
          inferred true positive rate are comparable to the
340
          segmentation gene results.
341
        
342
        
343
          Argos: prediction of regulatory modules from raw
344
          genomic sequence
345
          As a final generalization, we ask whether there is
346
          enough repetition of sequence motifs within a module for
347
          its discovery using no information other than the
348
          noncoding sequence in the genome. To determine whether a
349
          given motif is locally overrepresented, its frequency of
350
          occurrence has to be scored against some statistical
351
          model. Both Ahab and Gibbs, use counts of short strings
352
          or single bases, within the window of interest to compute
353
          the significance of longer motifs. We attempted running
354
          Gibbs on successive windows of sequence and scoring the
355
          resulting motifs with Ahab, but were not able to
356
          discriminate the known modules from the remainder of the
357
          upstream region.
358
          We therefore devised an alternative strategy that uses
359
          the information available in all the noncoding sequence
360
          and thus extrinsic to the window of interest. We
361
          enumerate all motifs in a class (a consensus of length 8
362
          and 2 mutations worked best), and use their frequency to
363
          assign a probability for observing an overrepresentation
364
          of any one of them in the window of interest. The actual
365
          binding motifs may be longer, we only need to capture the
366
          most significant region. Typically several hundred motifs
367
          are significant for each window. They heavily overlap so
368
          the individual motif scores can not be simply added for
369
          an overall score. Although we tried using Ahab to
370
          eliminate redundant motifs, we got better results with a
371
          greedy algorithm that looks only at the motifs without
372
          placing them on the sequence. The greedy algorithm
373
          winnows the list of motifs down by starting with the
374
          highest scoring one and eliminates any motif related to
375
          it under shifts and a limited number of mutations. The
376
          next remaining motif is retained and overlaps with it are
377
          eliminated, until ~5 quasi independent motifs are
378
          obtained whose scores can be added (details in additional
379
          File 7).
380
          We evaluated the results of Argos for the the modules
381
          in additional File 1. Log probability scores > 70 were
382
          taken as significant, since they are found very rarely
383
          using randomized noncoding sequence 1. With this
384
          threshold and at least l00 bp overlap, half of the
385
          modules were recovered, indicating a 50% false negative
386
          rate. The number of false positives is difficult to
387
          assess, because the code looks for 
388
          any regulatory module. However for
389
          several well studied segmentation genes (specifically
390
          even-skipped, giant, hairy, hunchback, knirps, Kruppel,
391
          and tailless), with 15 known modules, we squarely hit 7
392
          when looking over the entire 10 kb upstream of
393
          translation start (gt, kni, Kr-730, Kr-CD2, eve3-7, eve
394
          autoregulatory element, h-7), Fig. 4, but only three
395
          predictions are outside of known modules (one for
396
          tailless and two for hunchback). This indicates a low
397
          false positive rate. Genome wide we are predicting about
398
          one module per 5 kb of noncoding sequence averaged over
399
          the genome (with a strong bias for noncoding vs coding),
400
          which corresponds to roughly one module per gene.
401
        
402
      
403
      
404
        Discussion
405
        We have demonstrated algorithms that exploit three very
406
        different levels of prior information and lead to
407
        statistically highly significant predictions for early
408
        developmental modules in the fly. The Ahab algorithm is
409
        perhaps closest to the 'calculation' actually performed by
410
        the cell. The weight matrix match is a surrogate for the
411
        energetic preference of a transcription factor for a
412
        particular sequence, and Ahab models the competition of
413
        several factors and their binding energies for a stretch of
414
        DNA (a module). Ahab ignores distances between binding
415
        sites and the actual factor concentrations. Thus, the
416
        success of Ahab suggests that just modeling the binding
417
        energies is already predictive. It will be interesting to
418
        see how well Ahab performs in situations where the
419
        concerted binding of cofactors constrains the spacing of
420
        binding sites [ 13 14 ] .
421
        Finding overrepresented weight matrices is a well
422
        studied problem for which Gibbs sampling constitutes a
423
        reasonable solution if the data consists of distinct motifs
424
        separated by random bases. The difficulty we have
425
        encountered with this algorithm in dissecting regulatory
426
        modules for binding sites is not rare or diffuse motifs but
427
        rather too much signal, namely the overlay of motifs of
428
        different sizes and specificities. The Gibbs statistical
429
        model is not strictly correct for our data. A more adequate
430
        algorithm would allow competition among motifs of different
431
        lengths [ 15 ] . Irrespective of technical problems, the
432
        discovery of binding motifs by site repetition is
433
        qualitatively a more difficult task than their recognition
434
        by transcription factors [ 16 ] . Thus our ability to
435
        recover plausible motifs for about half the known factors
436
        was not obvious in advance and is another manifestation of
437
        the redundancy in module design.
438
        Reference [ 6 ] describes another approach to locating
439
        modules from clusters of known weight matrices. They count
440
        the number of matches of each weight matrix in an interval
441
        with a score above some empirically defined cutoff, and
442
        then score a 700 bp window as significant when the total
443
        number of matches for all factors is large enough.
444
        Information about the background is implicitly encoded in
445
        their choice of threshold. We do not have factor specific
446
        cutoffs, and use a locally defined background model, which
447
        renders our algorithm more automatic and less sensitive to
448
        local variation in sequence composition, eg poly A
449
        runs.
450
        Although we are predicting many more modules than in [ 6
451
        ] , the positive hit rates are comparable between the two
452
        methods (50% vs 10 positives out of 28 predictions [ 6 ] ).
453
        A more detailed comparison of both data sets reveals,
454
        however, that the 28 modules predicted by [ 6 ] , with the
455
        exception of the giant one, do not overlap with any of the
456
        top 137 modules predicted by Ahab, although there are 4
457
        genes in common to our sets. More strikingly, the 10
458
        modules for which experimental results in [ 6 ] suggest
459
        functionality based on blastoderm expression of a
460
        neighboring gene fall below 500 in our ranking with
461
        exception of giant and one of the hairy derived modules for
462
        nub, Table 4. Presumably due to the difference in
463
        background model, their modules are dominated by Hb sites,
464
        while ours are not, which contributes considerably to the
465
        divergence of the predictions. Clearly, only direct
466
        experimental validation of predicted modules through
467
        reporter gene fusions will help to compare the different
468
        methods. In this fashion, we plan to test a number of the
469
        new modules predicted for key genes in the segmentation
470
        system such as h, run, gt, odd, prd, slpl/2 and cad.
471
        In order to understand regulatory networks of genes, it
472
        is useful to generalize from a few genes or modules with
473
        common functions to new candidates. When control is
474
        combinatoric, a purely experimental approach tends to be
475
        more tedious than screening a modest list of candidates.
476
        Thus a potentially important aspect of our work is the
477
        combination of motif discovery from modules via Gibbs
478
        sampling and generalization to the entire genome with Ahab.
479
        We have demonstrated the feasibility of this procedure when
480
        we worked from the hairy stripe 5-7 modules. Interestingly,
481
        the candidate list of similar modules genome wide was quite
482
        small, but had little overlap with the top scoring modules
483
        predicted from the full set of gap gene weight matrices.
484
        Hopefully some of the new motifs discovered by Gibbs
485
        sampling are real; perhaps they are binding sites for
486
        corepressors. Clearly the first step is to confirm a
487
        striped expression for some of the genes in Table 4.
488
        Our algorithm Argos for predicting enhancers from raw
489
        genomic sequence works astonishingly well. It will be most
490
        interesting to use this approach together in conjunction
491
        with the customized Gibbs sampler and Ahab in situations
492
        where nothing is known experimentally about the
493
        transcriptional regulation of genes of interest to identify
494
        co-regulated genes. Namely, following the hierarchical
495
        structure in Figure 1, Argos could be used to predict
496
        modules, then the customized Gibbs sampler to predict
497
        binding sites (weight matrices) and finally Ahab to
498
        predict, genome wide, genes in the same regulatory
499
        network.
500
        Several recent papers [ 6 7 9 ] as well as ours have
501
        taken only the very first steps in applying computational
502
        approaches to the the elucidation of 
503
        cis- regulatory modules. For body
504
        patterning in the fly, it is very encouraging that such
505
        limited information as we have used works so well. It
506
        remains to be seen if the same approaches work on systems
507
        where a single master regulatory gene initiates a
508
        developmental cascade, or where integration of
509
        developmental cues occurs partly at the level of signal
510
        transduction.
511
      
512
      
513
        Conclusions
514
        Predicting and understanding transcriptional regulation
515
        is a fundamental problem in biology. We have designed new
516
        algorithms for the detection of 
517
        cis- regulatory modules in the
518
        genomes of higher eukaryotes which is a first step in
519
        unraveling transcriptional regulatory networks. We have
520
        demonstrated, in the case of body patterning in the 
521
        Drosophila embryo, that our
522
        algorithms allow the genome-wide identification of
523
        regulatory modules when the motifs for the transcription
524
        factors are known (algorithm Ahab), or when only related
525
        modules are known (customized Gibbs sampler in conjunction
526
        with Ahab), or when only genomic sequence is analyzed with
527
        Argos. We believe that Ahab overcomes many problems of
528
        recent studies and we estimated the false positive rate to
529
        be about 50%. Argos is the first successful attempt to
530
        predict regulatory modules using only the genome without
531
        training data. All our results and module predictions
532
        across the 
533
        Drosophila genome are available at
534
        http://uqbar.rockefeller.edu/~siggia/. The Ahab code is
535
        available upon request from the authors.
536
      
537
      
538
        Methods
539
        
540
          Genomic data
541
          We downloaded the Release 2 genomic sequence and
542
          annotation for 
543
          Drosophila melanogaster from
544
          "Gadfly" (Genome Annotation Database of 
545
          
546
          Drosophila, http://www.fruitfly.org/(Oct 2000)).
547
          Using a map provided by Chris Mungall (private
548
          communication) we mapped the annotation, which was done
549
          on separate contigs, to chromosomal coordinates.
550
          "Flybase" http://flybase.bio.Indiana.edu/provides a
551
          curated assembly of genetic and molecular data from the
552
          existing literature. Our web sites links this database to
553
          the Gadfly annotation using a map provided by David
554
          Emmert (private communication). Since our algorithms are
555
          based on searching for clusters of common sites,
556
          microsatellites can score high, but tend not to be
557
          functional (for an exception see [ 9 ] ). We used the
558
          Tandem Repeat Remover [ 17 ] to mask microsatellites,
559
          with scoring parameters (2 5 5 75 20 20 500)
560
          (respectively; match, mismatch, indel scores; percentage
561
          priors for mismatch and indels; minimum score, and
562
          maximum length) which are as promiscuous as possible yet
563
          did not detect appreciable microsatellites in random
564
          sequences. With these settings, ~5.7% of all non coding
565
          sequences are masked.
566
          We collected from the experimental literature modules
567
          that drive blastoderm specific expression of a reporter
568
          gene in response to several of the factors in our list.
569
          In many cases the module was shown to be the minimal
570
          element. The modules mapped to chromosomal coordinates
571
          are reported in the additional File 1.
572
          We collected a total of 199 experimentally
573
          characterized sites for the factors bicoid (30 sites),
574
          caudal (21), dorsal (32), hunchback (43), knirps (27),
575
          kruppel (20), tailless (20), and torRE (6). Giant sites
576
          are too few in number and ill defined to be useable. Each
577
          binding site was mapped to the genome and padded with six
578
          bases on both sides. The multiple alignment program
579
          WCONSENSUS [ 18 ] (options, -f -d -sl -a and background
580
          frequencies representative of noncoding sequence (60%
581
          A/T)) was used to align and orient the sites and create a
582
          weight matrix for each factor. Results and references are
583
          in the additonal File 2.
584
        
585
        
586
          Algorithm Ahab: fitting multiple weightmatrices to
587
          sequence
588
          Ahab computes an optimal probabilistic segmentation of
589
          a sequence 
590
          S into binding sites and background
591
          for a fixed set 
592
          W of sequence motifs modelled by
593
          weight matrices. Ahab is related to the mobydick
594
          algorithm [ 11 ] , but has several novel key features
595
          described in detail in the results section and the
596
          additional File 6. It fits the probabilities 
597
          p 
598
          
599
            w 
600
           of the (fixed) matrices 
601
          w and background 
602
          p 
603
          
604
            B 
605
           , so as to maximize the likelihood of generating 
606
          S under a certain explicit model.
607
          Namely, select a weight matrix or background according to
608
          its probability 
609
          p 
610
          
611
            w 
612
           , 
613
          p 
614
          
615
            B 
616
           , sample according to the predetermined frequencies,
617
          and add the resulting bases to the sequence under
618
          construction. The fit of the model to 
619
          S is accomplished by defining the
620
          probability of a particular segmentation T of 
621
          S as
622
          
623
          where 
624
          k = 1, 2,..., 
625
          N ( 
626
          T ) labels the weight matrices (or
627
          background) which were used in segmentation 
628
          T . The quality of the match
629
          between the weight matrix 
630
          w 
631
          
632
            k 
633
           and the subsequence 
634
          s = ( 
635
          n 
636
          1 ,... 
637
          n 
638
          
639
            l 
640
           ) is incorporated in 
641
          m ( 
642
          s | 
643
          w 
644
          
645
            k 
646
           ) = , where 
647
          f 
648
          
649
            j 
650
           ( 
651
          n | 
652
          w 
653
          
654
            k 
655
           ) are the normalized frequencies of nucleotide 
656
          n at position 
657
          j for weight matrix 
658
          w 
659
          
660
            k 
661
           .
662
          An important consequence of equation (1) is that
663
          multiple binding sites with weak matches to the weight
664
          matrix for the same factor ( large, 
665
          m ( 
666
          s | 
667
          w 
668
          
669
            k 
670
           ) small) may make an important contribution to 
671
          P ( 
672
          T ). In many cases these redundant
673
          sites with low weight matrix scores have been observed in
674
          experimentally known modules. Any algorithm that would
675
          just count matches of sequences to a weight matrix above
676
          a certain threshold would have to use ad-hoc measures to
677
          incorporate these sites into the score. Note also that
678
          the weight matrix only captures the sequence dependent
679
          part of the binding energy, so 'weak' binding could
680
          equally well be termed 'nonspecific'. We know too little
681
          about the physical binding energies of transcription
682
          factors, and their cofactors and protein concentrations
683
          in vivo, to calculate whether any modules are actually
684
          occupied by factors.
685
          Ahab uses a 
686
          local Markov model of order 
687
          q for background sequence, that is,
688
          a single base 
689
          n at site 
690
          j is segmented with probability 
691
          p 
692
          
693
            B 
694
           
695
          f 
696
          
697
            j 
698
           ( 
699
          n | 
700
          B ), where ∑ 
701
          
702
            n 
703
           
704
          f 
705
          
706
            j 
707
           ( 
708
          n | 
709
          B ) = 1, and 
710
          f 
711
          
712
            j 
713
           ( 
714
          n | 
715
          B ) is contingent on the 
716
          q preceding bases following the
717
          usual Markov model definitions. The 
718
          f 
719
          
720
            j 
721
           are computed by enumerating all 
722
          q + 1 tuples of bases in 
723
          S, which has the effect of
724
          suppressing the number of copies of any 
725
          w which match frequent triples of
726
          bases, eg poly A tracts (we typically use 
727
          q = 2).
728
          The likelihood 
729
          Z to observe 
730
          S is then
731
          
732
          Dynamic programming allows the calculation of 
733
          Z ( 
734
          S ) in a time proportional to the
735
          sequence length and the number of weight matrices. The
736
          maximization of 
737
          Z ( 
738
          S ) then determines 
739
          p 
740
          
741
            w,B 
742
           (see additional File 6). The likelihood 
743
          Z 
744
          
745
            B 
746
           that 
747
          S comes from background only (ie 
748
          p 
749
          
750
            B 
751
           == 1) is trivially computed from the Markov model.
752
          The score 
753
          R that 
754
          S is a regulatory module is then in
755
          log-odds units
756
          
757
          At each position 
758
          i = 1, ..., 
759
          L in 
760
          S the probability 
761
          P 
762
          
763
            i 
764
           ( 
765
          w | 
766
          S ) to observe the start of weight
767
          matrix 
768
          w of length 
769
          l 
770
          
771
            w 
772
           is computed by standard posterior decoding. Let 
773
          Z ( 
774
          i, j ) denote the likelihood for
775
          the sequence 
776
          S from position 
777
          i up to 
778
          j. (The symbol 
779
          Z ( 
780
          S ) used above, is just 
781
          Z (1, 
782
          L ).)
783
          
784
          using the optimized 
785
          p 
786
          
787
            w 
788
           's. The sum of equation (4) over all positions is
789
          the average copy number of 
790
          w in the data. Summing over all
791
          segmentations (1) naturally allows for overlapping sites
792
          and the 'profile' (4) quantifies the competition between
793
          different factors for the same bases.
794
          It takes a modern LINUX workstation about 2 days to
795
          run Ahab over the entire genome with a 500 bp window
796
          moved in 20 bp steps, fitting the gap gene weight
797
          matrices we collected. We enumerated all local maxima in
798
          the score 
799
          R larger than 15 and eliminated
800
          those within 500 bp of a higher scoring peak and obtained
801
          216 disjoint regions. If we insist that at least 3
802
          different factors contribute to the module with
803
          individual average copy numbers of at least 1 the number
804
          of modules reduces to 169, and eliminating all candidates
805
          with 80 or more basepairs masked by the Tandem Repeat
806
          Remover gave a final list of 146 modules. Predictions
807
          with more than 200 bases overlap with a known module or
808
          with an overlap of at least 50% of the length of the
809
          known module were considered to be a recovered known
810
          module.
811
        
812
        
813
          Determination of motifs from modules
814
          We ran locally the Gibbs sampler algorithm provided by
815
          C. Lawrence's group at
816
          http://www.wadsworth.org/resnres/bioinfo/, as described
817
          in Results. We generally used a motif length of 11 bp and
818
          allowed the algorithm to vary this by +-2 if the data
819
          warranted, and took a prior copy number of 7 when fitting
820
          1-2 kb of data (again the algorithm will adjust this
821
          number). Other parameters were taken as default, and
822
          under these conditions typically 5 distinct motifs were
823
          fit.
824
          To decide whether Gibbs derived motifs matched known
825
          ones we ran the known weight matrices in both
826
          orientations over the individual sequences composing the
827
          motif plus five flanking bases and computed the position
828
          that maximized the information score. The Gibbs motif was
829
          deemed to match a known factor if:
830
          1) a single known matrix was the top match to a
831
          majority of the sequences,
832
          2) the optimal match occurred at the same position
833
          (with some variability allowed for factors such as hb,
834
          with poly-A regions),
835
          3) the information score of the match was comparable
836
          to the score of the sequences which define the matrix to
837
          the matrix itself and not dominated by the flanking
838
          bases.
839
          4) the preceding conditions were met in two
840
          independent runs (with typically 2-4 runs done for each
841
          data set).
842
        
843
        
844
          Algorithm Argos
845
          Argos is described in detail in the results section
846
          and additional File 7.
847
        
848
        
849
          Genome-wide display of our results
850
          Our webpage
851
          http://uqbar.rockefeller.edu/~siggia/contains all our
852
          predictions. For each module, all nearby genes were
853
          extracted from the annotation, their position relative to
854
          the module (ie up/down stream, intronic), and Flybase
855
          links for gene function were collected into a table. The
856
          number of binding sites for each factor in the module is
857
          listed and their position and score along with known
858
          binding sites can be viewed in graphs which were produced
859
          with "gff2ps" [ 19 ] .
860
          To view our results interactively on a larger scale
861
          together with the current fly annotation we installed the
862
          Gbrowse software from L. Stein's Generic Model Organism
863
          Systems Database Project http://www.gmod.org. Our modules
864
          (predicted and experimental), binding sites, and
865
          restriction sites are included in the display. A function
866
          was added that allows to plot the Ahab score (equation 3)
867
          along the genome. Thus, the user can explore where
868
          additional putative modules fall relative to any gene of
869
          interest.
870
        
871
      
872
      
873
        Contributions
874
        Authors 1, 2 and 4 carried out the computional part of
875
        this study, author 3 annotated the Ahab results. All
876
        authors read and approved the final manuscript.
877
      
878
      
879
        Note
880
        1Randomized sequence was produced by randomly pooling
881
        and concatenating 100 basepair chunks from genomic
882
        noncoding sequence
883
      
884
    
885
  
886

887
Product

Resources

Company