Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
In higher eukaryotes, many genes feature differential
7
spatial-temporal expression during development and
8
throughout the life cycle of the organism. Their complex
9
transcription regulation is thought to be achieved by the
10
combinatorial action of multiple transcription factors
11
which bind to
12
cis- regulatory DNA sequences. Here,
13
transcription factors are defined as proteins which
14
recognize and bind regulatory sites and have a potential to
15
modulate directly or indirectly through the recruitment of
16
cofactors the activity of the basal transcriptional
17
apparatus of proximal genes. The number of transcription
18
factors is a substantial part of the total number of genes
19
in any organism, for example about 700 out of 13,500 genes
20
in Drosophila [ 1 ] .
21
Although combinatorial action of transcription factors
22
has been studied throughout the life cycle of organisms [ 2
23
] , perhaps the most coherent picture has emerged in the
24
context of developmental processes [ 3 4 ] . Here, a great
25
number of experiments suggest that a major part of the gene
26
regulatory apparatus is organized in the form of separable
27
cis- regulatory modules [ 3 ] . A
28
given module defines specific aspects of the
29
spatio-temporal pattern of gene expression by the
30
combinatorial action of multiple transcription factors
31
which together define the rate of transcription. Modules
32
thus integrate inputs from several genes and regulate
33
another gene to form developmental networks. Modules seem
34
to share several architectural features [ 5 ] : They are
35
typically only hundreds of nucleotides in length and
36
contain multiple binding sites for as many as 4-5 different
37
transcription factors. The frequent occurence of multiple
38
copies of the same motif as well as the enrichment of
39
certain combinations of motifs in a module in comparison
40
with the genome at large provide the basis for our
41
computational strategies to predict genes which are part of
42
the same regulatory network. Existing algorithms for
43
discovering modules [ 6 7 8 ] are based on counting the
44
number of matches of a certain minimal strength to known
45
motifs and thus require ad-hoc parameters for each motif,
46
resulting in parameter dependent predictions. These
47
algorithms are also bound to miss multiple weak binding
48
sites which are known to be present in many modules. We
49
demonstrate a novel algorithm, Ahab, which overcomes these
50
problems.
51
However, the binding sites (motifs) which reside in
52
modules are often not known. A recent paper [ 9 ] proposed
53
an algorithm for identfying these sites; here we show that
54
a standard method is capable of identifying typically half
55
of the known binding sites in a module. Its entire output
56
can then be used as input to Ahab (Figure 1). Finally, we
57
ask if the redundancy of sites inside modules is strong
58
enough to predict modules using only genomic sequence. To
59
our knowledge, our algorithm Argos is the first successful
60
attempt to do this for a metazoan genome.
61
Our system of choice is the body patterning of the early
62
Drosophila embryo which is established by a multi-tiered
63
hierarchy of transcription factors [ 10 ] . Broadly
64
distributed maternal factors trigger zygotic gap gene
65
expression in discrete domains along the anterior-posterior
66
axis of the embryo. Maternal and gap gene factors together
67
trigger pair rule gene expression in 7 alternating stripes,
68
which in turn regulate segment polarity and homeotic gene
69
expression in 14 stripes. Many of these factors are known,
70
their binding sites have been studied, and more than 20
71
modules have been identified. For this system, we show by
72
comparing our predictions to literature results that we can
73
predict regulatory modules using different levels of input
74
- binding sites, regulatory sequence identified by
75
'promoter bashing'/sufficiency tests, and genomic sequence
76
itself. Altogether, we predict roughly 200 new modules and
77
7 new sequence motifs and we analyze and validate the
78
performance of each of the three algorithms.
79
Figure 1summarizes the input/output levels and serves as
80
a reference for what follows.
81
82
83
Results
84
85
Using known binding sites, Ahab finds 135
86
significant new modules in the genome
87
There are a sufficient number of binding sites in the
88
literature (see additional File 1and 2) for us to
89
construct frequency weight matrices for the maternal
90
transription factors Bicoid (Bcd), Caudal (Cad) and
91
Dorsal (Dl), as well as the zygotic gap gene factors
92
Hunchback (Hb), Kruppel (Kr), Knirps (Kni), and Tailless
93
(Tll), and the torso response element (torRE). With the
94
exception of giant, which is too ill-defined to model,
95
these are all the maternal/gap genes at the top of the
96
segmentation gene hierarchy.
97
Modules can be located by scanning the genome in
98
windows, counting the number of matches to each matrix
99
with a score better than some value, and then combining
100
scores and ranking [ 6 7 ] . We have designed an
101
algorithm, Ahab, which eliminates all parameters other
102
than a final rank, and, following the logic of the
103
mobydick algorithm [ 11 ] , computes via maximum
104
likelihood the probability that the window sequence is
105
made up by sampling from the known weight matrices or
106
background. Mobydick was designed to find overrepresented
107
"words" in noncoding DNA, and applied to yeast. However,
108
in the genomes of higher eukaryotes, binding sites are
109
much fuzzier than in yeast and simple motif models are
110
unlikely to yield meaningful results. Therefore, Ahab
111
generalizes the motif model to positional weightmatrices.
112
Furthermore, we introduce a
113
local background model to cut down
114
the number of false positives arising from local
115
variations in sequence composition and degeneracy of
116
motifs (for example motifs that contain a poly A string).
117
Finally, the maximum likelihood fit has to be done
118
separately for each window and thus hundreds of millions
119
of times for the genomes of higher eukaryotes, thus
120
requiring an efficient implementation of the numerical
121
procedures. Ahab tallies all possible segmentations of
122
the sequence into binding sites (also called parsings).
123
Thus motif overlaps are allowed and weighted according to
124
how well they explain the data, and multiple weak copies
125
of a factor are all counted [ 5 12 ] . Our background
126
model is a Markov model fit to all triples of bases in
127
the window and accounts for local compositional variation
128
which could otherwise push up the representation of any
129
matrix that matched one of the high copy number base
130
triples (eg hb matches poly A tracts).
131
Additional file 1
132
133
Click here for file
134
Additional file 2
135
136
Click here for file
137
Additional file 3
138
139
Click here for file
140
Additional file 4
141
142
Click here for file
143
Additional file 5
144
145
Click here for file
146
Additional file 6
147
148
Click here for file
149
Additional file 7
150
151
Click here for file
152
We used Ahab with a 500 bp window to scan the
153
Drosophila genome. As a representative/typical example we
154
show the hairy locus, Fig. 2, a pair-rule gene expressed
155
in 7 stripes. Five modules driving individual stripes and
156
stripe pairs have been identified experimentally. Ahab
157
predicts four of them (stripes 1, 5, 6, 7), only the
158
stripe 3+4 element is not recovered. Note that two
159
modules, stripes 1 and 7, were not used as training data.
160
The support for each local maximum corresponds nicely to
161
the experimentally estimated sizes of the modules. Ahab
162
also reports scores for putative binding sites in each
163
window as defined by (4). As an example, we show the
164
even-skipped stripe 3+7 module, one of the best studied
165
cases in the literature (Fig. 3). Most of the known sites
166
are recovered and some new ones predicted.
167
Ahab finds 146 highly significant modules in the
168
genome, most located in the non-transcribed regions. 27
169
modules are inside introns, only 6 overlap with exons. It
170
recovers the 11 modules used to construct our weight
171
matrices and predicts 6 other known modules with
172
maternal/gap gene input. Thus, 17 out of 27 known modules
173
(see additional File 1) are found. Three of the missing
174
10 modules (kni64, snail, sog) are very high in score
175
(< rank 100) but lost when filtering for the presence
176
of at least three different factors. Three modules are
177
lost because they contain only dorsal sites (rho, twist,
178
zen) and we do not have matrices for the other factors
179
known to bind (such as snail and twist), however
180
searching with only the dorsal weight matrix does recover
181
them. The four remaining missing modules are low in score
182
(rank > 700), but Kruppel CD2 is recovered for a
183
window size of 700; the others are evidently low in
184
maternal/gap gene binding sites.
185
To determine whether any of the 129 novel module
186
predictions were correct, we asked whether any of the
187
adjacent genes were patterned in the blastoderm. For 15
188
modules patterned expression has been reported for one of
189
the adjacent genes; the patterns are either gap or pair
190
rule like. An additional 11 modules are suggestive, since
191
they are clustered with another, nonoverlapping
192
prediction (Table 1). Interestingly, two predicted
193
modules are in proximity to well studied segmentation
194
genes (giant, runt) but still outside known regulatory
195
regions.
196
We tested the stability of Ahab against unspecific
197
input weight matrices by eliminating the least specific
198
matrix in our list (Tailless). Tailless has ~600
199
predicted sites in the 146 modules (Table 2), twice as
200
much as any other factor. However, we found that roughly
201
75 % of the predictions without using tailless (but with
202
the same significance cutoff) were also present in the
203
list of 146. Thus, although Tailless is rather
204
unspecific, it makes a contribution to the predictions
205
without dominating them.
206
We also varied the window size to 700 bp and masked
207
out the repeats genome wide, before running Ahab, (see
208
web site). Our top 150 predictions now included 11 out of
209
the 27 known modules (see additional File 1and 3).
210
Lowering the cutoff value in the module score did not
211
help, only one additional known module was recovered when
212
including ranks 151-250. Overall, 84 modules or 58 % of
213
the window 500 dataset are also among the top 200 of the
214
window 700 set, and thus might be accorded more
215
significance. Although some known modules disappeared,
216
some very interesting new modules are predicted, in
217
window 700 runs, including modules for key genes in the
218
segmentation process such as caudal, grainyhead, odd
219
skipped (odd), and sloppy paired 1,2 (slp1,2) (see
220
additional File 3). Thus Ahab could be improved by
221
allowing for variable window lengths.
222
We estimated the false positive rate by scrambling the
223
columns (positions) in the input frequency matrices. The
224
new matrices are thus unlikely to be functional, but
225
retain the same specificity. Ahab found roughly half as
226
many "modules" for the same score cutoff as for the
227
original set ie a 50% false positive rate. Note that this
228
is a conservative estimate because part of the consensus
229
motif recognized by hunchback and caudal is largely a
230
poly T motif, and thus preserved by scrambling. Moreover,
231
only very few known patterned gene are predicted by the
232
scrambled matrices.
233
Another perhaps more straightforward estimate of
234
statistical significance and our true positive rate is to
235
use the experimental result (see footnote in [ 6 ] ) that
236
less than 2% of the genome or ~300 genes are patterned
237
during the blastoderm. Since we can not tell which of two
238
neighboring genes is regulated by each of the 102
239
intergenic modules we predict, we are obliged to label
240
237 genes (adding the 33 genes with intragenic modules)
241
as potentially patterned. For 237 random predictions one
242
expects 2% or five genes to be patterned, and the
243
probability to get 21 or more genes (see additional File
244
3) by chance (p-value) is ~10 -10. It should be stressed
245
that the true success rate of Ahab will be much higher
246
since the number of genes for which blastoderm expression
247
is demonstrated is (< 100) or 1/3 of the total. Thus
248
we expect an additional 50 genes in our set to be
249
patterned, so a 50% overall positive rate for our module
250
predictions.
251
252
253
The Gibbs sampler finds binding sites within
254
experimentally characterized modules
255
Binding site information for most of the transcription
256
factors relevant to any developmental process is only
257
rarely available. More common are modules obtained by
258
'promoter bashing' from several genes with similar
259
expression. Thus it is natural to ask, in view of the
260
site repetition within modules, whether standard motif
261
finders are able to recover good weight matrices from
262
modules and if so can these be used as input to Ahab to
263
find genes with similar regulatory inputs.
264
We have tailored the Gibbs algorithm (see methods) to
265
this problem by searching for only one site at a time,
266
and then masking only the central 1-2 bases of each
267
sequence motif found before iterating. The results were
268
thereby much more reproducible between runs. Most
269
importantly, motifs were allowed to overlap, a very
270
common occurrance in modules and arguably important for
271
their function [ 3 ] .
272
To gain confidence in the capabilities of the Gibbs
273
algorithm, we prepared two synthetic data sets
274
representative of the data we wanted to examine (eg
275
several kb long, 30-50% of the sequence covered by
276
motifs, the remaining sequence random and 60% A/T). Data
277
set 1 was made by equally sampling our Hb, Cad, Kr and
278
Tll matrices. Data set 2 was generated from four
279
synthetic frequency matrices of specificity equal to the
280
known ones. The customized Gibbs virtually perfectly
281
recovers all the synthetic weight matrices from dataset
282
2. By contrast, only half of the natural sites were
283
recovered from dataset 1. The sites overlapped and their
284
delineation was imperfect. Thus Gibbs detects sequence
285
correlations among the factors we chose, but probably
286
exaggerates them, because it computes significance based
287
on single base frequencies.
288
We then ran the Gibbs algorithm on several modules
289
with extensive binding site data, Table 3. In accord with
290
experiment, Gibbs predicts that about 30-50% of the
291
sequence is covered by motifs. The specificity of the
292
Gibbs motifs is typically higher than for the
293
experimental ones, presumably because a smaller number of
294
sites is sampled. Even when the majority of the sequences
295
composing the Gibbs motif match a single factor, there
296
are a few other strong factor matches generally in
297
different positions and perhaps the other orientation.
298
Nevertheless the Gibbs motifs are largely reproducible
299
between runs. Generally, we recover about half the
300
factors known to influence the module, and interestingly
301
predict several new motifs. The lack of a 1:1
302
correspondence between the experimental motifs (generally
303
composed from a wider range of data then presented to
304
Gibbs) and those we find, points to a real ambiguity, we
305
believe, in how to parse a sequence into binding
306
sites.
307
308
309
Using only three modules as input, Ahab finds 63
310
significant modules in the genome
311
Next we tested whether Gibbs derived matrices can be
312
fed to Ahab for a genome-wide search for modules. As
313
samples, we used three known modules that drive
314
expression of the pair rule gene hairy in stripes 5, 6,
315
and 7, which are known to receive input from Bcd, Cad,
316
Hb, Kni, Kr, and Tll (see additional File 1).
317
Customized Gibbs finds 6 highly significant weight
318
matrices within the 2 kb composite module (see additional
319
File 4). One matches the Kr weight matrix with high
320
quality, another Kni, and a third represents a mixture of
321
Hb and Cad, whose matrices are indeed quite similar due
322
to a poly T motif. The other three motifs are new. Using
323
these 6 weight matrices as input, Ahab finds 63 highly
324
significant modules genome-wide (Table 4, additional File
325
5).
326
The top four modules overlap with those used in the
327
Gibbs sampling. In addition, 13 new modules were
328
contiguous to genes that are known to be patterned in the
329
blastoderm, and two fall close to a single gene of
330
unknown function. One of the top scoring modules is the
331
hunchback central stripe module, other particularly
332
interesting hits are in the intron of knirps and 18 kb
333
upstream of hairy outside the known regulatory region,
334
and proximal to emc, a known transcriptional
335
co-repressor. Compared to the Ahab predictions, we find
336
more modules near homeotic genes (Abd-A, abd-B, Ubx,
337
hth), due to the presence of the novel Gibbs motifs. The
338
statistical significance of our predictions and the
339
inferred true positive rate are comparable to the
340
segmentation gene results.
341
342
343
Argos: prediction of regulatory modules from raw
344
genomic sequence
345
As a final generalization, we ask whether there is
346
enough repetition of sequence motifs within a module for
347
its discovery using no information other than the
348
noncoding sequence in the genome. To determine whether a
349
given motif is locally overrepresented, its frequency of
350
occurrence has to be scored against some statistical
351
model. Both Ahab and Gibbs, use counts of short strings
352
or single bases, within the window of interest to compute
353
the significance of longer motifs. We attempted running
354
Gibbs on successive windows of sequence and scoring the
355
resulting motifs with Ahab, but were not able to
356
discriminate the known modules from the remainder of the
357
upstream region.
358
We therefore devised an alternative strategy that uses
359
the information available in all the noncoding sequence
360
and thus extrinsic to the window of interest. We
361
enumerate all motifs in a class (a consensus of length 8
362
and 2 mutations worked best), and use their frequency to
363
assign a probability for observing an overrepresentation
364
of any one of them in the window of interest. The actual
365
binding motifs may be longer, we only need to capture the
366
most significant region. Typically several hundred motifs
367
are significant for each window. They heavily overlap so
368
the individual motif scores can not be simply added for
369
an overall score. Although we tried using Ahab to
370
eliminate redundant motifs, we got better results with a
371
greedy algorithm that looks only at the motifs without
372
placing them on the sequence. The greedy algorithm
373
winnows the list of motifs down by starting with the
374
highest scoring one and eliminates any motif related to
375
it under shifts and a limited number of mutations. The
376
next remaining motif is retained and overlaps with it are
377
eliminated, until ~5 quasi independent motifs are
378
obtained whose scores can be added (details in additional
379
File 7).
380
We evaluated the results of Argos for the the modules
381
in additional File 1. Log probability scores > 70 were
382
taken as significant, since they are found very rarely
383
using randomized noncoding sequence 1. With this
384
threshold and at least l00 bp overlap, half of the
385
modules were recovered, indicating a 50% false negative
386
rate. The number of false positives is difficult to
387
assess, because the code looks for
388
any regulatory module. However for
389
several well studied segmentation genes (specifically
390
even-skipped, giant, hairy, hunchback, knirps, Kruppel,
391
and tailless), with 15 known modules, we squarely hit 7
392
when looking over the entire 10 kb upstream of
393
translation start (gt, kni, Kr-730, Kr-CD2, eve3-7, eve
394
autoregulatory element, h-7), Fig. 4, but only three
395
predictions are outside of known modules (one for
396
tailless and two for hunchback). This indicates a low
397
false positive rate. Genome wide we are predicting about
398
one module per 5 kb of noncoding sequence averaged over
399
the genome (with a strong bias for noncoding vs coding),
400
which corresponds to roughly one module per gene.
401
402
403
404
Discussion
405
We have demonstrated algorithms that exploit three very
406
different levels of prior information and lead to
407
statistically highly significant predictions for early
408
developmental modules in the fly. The Ahab algorithm is
409
perhaps closest to the 'calculation' actually performed by
410
the cell. The weight matrix match is a surrogate for the
411
energetic preference of a transcription factor for a
412
particular sequence, and Ahab models the competition of
413
several factors and their binding energies for a stretch of
414
DNA (a module). Ahab ignores distances between binding
415
sites and the actual factor concentrations. Thus, the
416
success of Ahab suggests that just modeling the binding
417
energies is already predictive. It will be interesting to
418
see how well Ahab performs in situations where the
419
concerted binding of cofactors constrains the spacing of
420
binding sites [ 13 14 ] .
421
Finding overrepresented weight matrices is a well
422
studied problem for which Gibbs sampling constitutes a
423
reasonable solution if the data consists of distinct motifs
424
separated by random bases. The difficulty we have
425
encountered with this algorithm in dissecting regulatory
426
modules for binding sites is not rare or diffuse motifs but
427
rather too much signal, namely the overlay of motifs of
428
different sizes and specificities. The Gibbs statistical
429
model is not strictly correct for our data. A more adequate
430
algorithm would allow competition among motifs of different
431
lengths [ 15 ] . Irrespective of technical problems, the
432
discovery of binding motifs by site repetition is
433
qualitatively a more difficult task than their recognition
434
by transcription factors [ 16 ] . Thus our ability to
435
recover plausible motifs for about half the known factors
436
was not obvious in advance and is another manifestation of
437
the redundancy in module design.
438
Reference [ 6 ] describes another approach to locating
439
modules from clusters of known weight matrices. They count
440
the number of matches of each weight matrix in an interval
441
with a score above some empirically defined cutoff, and
442
then score a 700 bp window as significant when the total
443
number of matches for all factors is large enough.
444
Information about the background is implicitly encoded in
445
their choice of threshold. We do not have factor specific
446
cutoffs, and use a locally defined background model, which
447
renders our algorithm more automatic and less sensitive to
448
local variation in sequence composition, eg poly A
449
runs.
450
Although we are predicting many more modules than in [ 6
451
] , the positive hit rates are comparable between the two
452
methods (50% vs 10 positives out of 28 predictions [ 6 ] ).
453
A more detailed comparison of both data sets reveals,
454
however, that the 28 modules predicted by [ 6 ] , with the
455
exception of the giant one, do not overlap with any of the
456
top 137 modules predicted by Ahab, although there are 4
457
genes in common to our sets. More strikingly, the 10
458
modules for which experimental results in [ 6 ] suggest
459
functionality based on blastoderm expression of a
460
neighboring gene fall below 500 in our ranking with
461
exception of giant and one of the hairy derived modules for
462
nub, Table 4. Presumably due to the difference in
463
background model, their modules are dominated by Hb sites,
464
while ours are not, which contributes considerably to the
465
divergence of the predictions. Clearly, only direct
466
experimental validation of predicted modules through
467
reporter gene fusions will help to compare the different
468
methods. In this fashion, we plan to test a number of the
469
new modules predicted for key genes in the segmentation
470
system such as h, run, gt, odd, prd, slpl/2 and cad.
471
In order to understand regulatory networks of genes, it
472
is useful to generalize from a few genes or modules with
473
common functions to new candidates. When control is
474
combinatoric, a purely experimental approach tends to be
475
more tedious than screening a modest list of candidates.
476
Thus a potentially important aspect of our work is the
477
combination of motif discovery from modules via Gibbs
478
sampling and generalization to the entire genome with Ahab.
479
We have demonstrated the feasibility of this procedure when
480
we worked from the hairy stripe 5-7 modules. Interestingly,
481
the candidate list of similar modules genome wide was quite
482
small, but had little overlap with the top scoring modules
483
predicted from the full set of gap gene weight matrices.
484
Hopefully some of the new motifs discovered by Gibbs
485
sampling are real; perhaps they are binding sites for
486
corepressors. Clearly the first step is to confirm a
487
striped expression for some of the genes in Table 4.
488
Our algorithm Argos for predicting enhancers from raw
489
genomic sequence works astonishingly well. It will be most
490
interesting to use this approach together in conjunction
491
with the customized Gibbs sampler and Ahab in situations
492
where nothing is known experimentally about the
493
transcriptional regulation of genes of interest to identify
494
co-regulated genes. Namely, following the hierarchical
495
structure in Figure 1, Argos could be used to predict
496
modules, then the customized Gibbs sampler to predict
497
binding sites (weight matrices) and finally Ahab to
498
predict, genome wide, genes in the same regulatory
499
network.
500
Several recent papers [ 6 7 9 ] as well as ours have
501
taken only the very first steps in applying computational
502
approaches to the the elucidation of
503
cis- regulatory modules. For body
504
patterning in the fly, it is very encouraging that such
505
limited information as we have used works so well. It
506
remains to be seen if the same approaches work on systems
507
where a single master regulatory gene initiates a
508
developmental cascade, or where integration of
509
developmental cues occurs partly at the level of signal
510
transduction.
511
512
513
Conclusions
514
Predicting and understanding transcriptional regulation
515
is a fundamental problem in biology. We have designed new
516
algorithms for the detection of
517
cis- regulatory modules in the
518
genomes of higher eukaryotes which is a first step in
519
unraveling transcriptional regulatory networks. We have
520
demonstrated, in the case of body patterning in the
521
Drosophila embryo, that our
522
algorithms allow the genome-wide identification of
523
regulatory modules when the motifs for the transcription
524
factors are known (algorithm Ahab), or when only related
525
modules are known (customized Gibbs sampler in conjunction
526
with Ahab), or when only genomic sequence is analyzed with
527
Argos. We believe that Ahab overcomes many problems of
528
recent studies and we estimated the false positive rate to
529
be about 50%. Argos is the first successful attempt to
530
predict regulatory modules using only the genome without
531
training data. All our results and module predictions
532
across the
533
Drosophila genome are available at
534
http://uqbar.rockefeller.edu/~siggia/. The Ahab code is
535
available upon request from the authors.
536
537
538
Methods
539
540
Genomic data
541
We downloaded the Release 2 genomic sequence and
542
annotation for
543
Drosophila melanogaster from
544
"Gadfly" (Genome Annotation Database of
545
546
Drosophila, http://www.fruitfly.org/(Oct 2000)).
547
Using a map provided by Chris Mungall (private
548
communication) we mapped the annotation, which was done
549
on separate contigs, to chromosomal coordinates.
550
"Flybase" http://flybase.bio.Indiana.edu/provides a
551
curated assembly of genetic and molecular data from the
552
existing literature. Our web sites links this database to
553
the Gadfly annotation using a map provided by David
554
Emmert (private communication). Since our algorithms are
555
based on searching for clusters of common sites,
556
microsatellites can score high, but tend not to be
557
functional (for an exception see [ 9 ] ). We used the
558
Tandem Repeat Remover [ 17 ] to mask microsatellites,
559
with scoring parameters (2 5 5 75 20 20 500)
560
(respectively; match, mismatch, indel scores; percentage
561
priors for mismatch and indels; minimum score, and
562
maximum length) which are as promiscuous as possible yet
563
did not detect appreciable microsatellites in random
564
sequences. With these settings, ~5.7% of all non coding
565
sequences are masked.
566
We collected from the experimental literature modules
567
that drive blastoderm specific expression of a reporter
568
gene in response to several of the factors in our list.
569
In many cases the module was shown to be the minimal
570
element. The modules mapped to chromosomal coordinates
571
are reported in the additional File 1.
572
We collected a total of 199 experimentally
573
characterized sites for the factors bicoid (30 sites),
574
caudal (21), dorsal (32), hunchback (43), knirps (27),
575
kruppel (20), tailless (20), and torRE (6). Giant sites
576
are too few in number and ill defined to be useable. Each
577
binding site was mapped to the genome and padded with six
578
bases on both sides. The multiple alignment program
579
WCONSENSUS [ 18 ] (options, -f -d -sl -a and background
580
frequencies representative of noncoding sequence (60%
581
A/T)) was used to align and orient the sites and create a
582
weight matrix for each factor. Results and references are
583
in the additonal File 2.
584
585
586
Algorithm Ahab: fitting multiple weightmatrices to
587
sequence
588
Ahab computes an optimal probabilistic segmentation of
589
a sequence
590
S into binding sites and background
591
for a fixed set
592
W of sequence motifs modelled by
593
weight matrices. Ahab is related to the mobydick
594
algorithm [ 11 ] , but has several novel key features
595
described in detail in the results section and the
596
additional File 6. It fits the probabilities
597
p
598
599
w
600
of the (fixed) matrices
601
w and background
602
p
603
604
B
605
, so as to maximize the likelihood of generating
606
S under a certain explicit model.
607
Namely, select a weight matrix or background according to
608
its probability
609
p
610
611
w
612
,
613
p
614
615
B
616
, sample according to the predetermined frequencies,
617
and add the resulting bases to the sequence under
618
construction. The fit of the model to
619
S is accomplished by defining the
620
probability of a particular segmentation T of
621
S as
622
623
where
624
k = 1, 2,...,
625
N (
626
T ) labels the weight matrices (or
627
background) which were used in segmentation
628
T . The quality of the match
629
between the weight matrix
630
w
631
632
k
633
and the subsequence
634
s = (
635
n
636
1 ,...
637
n
638
639
l
640
) is incorporated in
641
m (
642
s |
643
w
644
645
k
646
) = , where
647
f
648
649
j
650
(
651
n |
652
w
653
654
k
655
) are the normalized frequencies of nucleotide
656
n at position
657
j for weight matrix
658
w
659
660
k
661
.
662
An important consequence of equation (1) is that
663
multiple binding sites with weak matches to the weight
664
matrix for the same factor ( large,
665
m (
666
s |
667
w
668
669
k
670
) small) may make an important contribution to
671
P (
672
T ). In many cases these redundant
673
sites with low weight matrix scores have been observed in
674
experimentally known modules. Any algorithm that would
675
just count matches of sequences to a weight matrix above
676
a certain threshold would have to use ad-hoc measures to
677
incorporate these sites into the score. Note also that
678
the weight matrix only captures the sequence dependent
679
part of the binding energy, so 'weak' binding could
680
equally well be termed 'nonspecific'. We know too little
681
about the physical binding energies of transcription
682
factors, and their cofactors and protein concentrations
683
in vivo, to calculate whether any modules are actually
684
occupied by factors.
685
Ahab uses a
686
local Markov model of order
687
q for background sequence, that is,
688
a single base
689
n at site
690
j is segmented with probability
691
p
692
693
B
694
695
f
696
697
j
698
(
699
n |
700
B ), where ∑
701
702
n
703
704
f
705
706
j
707
(
708
n |
709
B ) = 1, and
710
f
711
712
j
713
(
714
n |
715
B ) is contingent on the
716
q preceding bases following the
717
usual Markov model definitions. The
718
f
719
720
j
721
are computed by enumerating all
722
q + 1 tuples of bases in
723
S, which has the effect of
724
suppressing the number of copies of any
725
w which match frequent triples of
726
bases, eg poly A tracts (we typically use
727
q = 2).
728
The likelihood
729
Z to observe
730
S is then
731
732
Dynamic programming allows the calculation of
733
Z (
734
S ) in a time proportional to the
735
sequence length and the number of weight matrices. The
736
maximization of
737
Z (
738
S ) then determines
739
p
740
741
w,B
742
(see additional File 6). The likelihood
743
Z
744
745
B
746
that
747
S comes from background only (ie
748
p
749
750
B
751
== 1) is trivially computed from the Markov model.
752
The score
753
R that
754
S is a regulatory module is then in
755
log-odds units
756
757
At each position
758
i = 1, ...,
759
L in
760
S the probability
761
P
762
763
i
764
(
765
w |
766
S ) to observe the start of weight
767
matrix
768
w of length
769
l
770
771
w
772
is computed by standard posterior decoding. Let
773
Z (
774
i, j ) denote the likelihood for
775
the sequence
776
S from position
777
i up to
778
j. (The symbol
779
Z (
780
S ) used above, is just
781
Z (1,
782
L ).)
783
784
using the optimized
785
p
786
787
w
788
's. The sum of equation (4) over all positions is
789
the average copy number of
790
w in the data. Summing over all
791
segmentations (1) naturally allows for overlapping sites
792
and the 'profile' (4) quantifies the competition between
793
different factors for the same bases.
794
It takes a modern LINUX workstation about 2 days to
795
run Ahab over the entire genome with a 500 bp window
796
moved in 20 bp steps, fitting the gap gene weight
797
matrices we collected. We enumerated all local maxima in
798
the score
799
R larger than 15 and eliminated
800
those within 500 bp of a higher scoring peak and obtained
801
216 disjoint regions. If we insist that at least 3
802
different factors contribute to the module with
803
individual average copy numbers of at least 1 the number
804
of modules reduces to 169, and eliminating all candidates
805
with 80 or more basepairs masked by the Tandem Repeat
806
Remover gave a final list of 146 modules. Predictions
807
with more than 200 bases overlap with a known module or
808
with an overlap of at least 50% of the length of the
809
known module were considered to be a recovered known
810
module.
811
812
813
Determination of motifs from modules
814
We ran locally the Gibbs sampler algorithm provided by
815
C. Lawrence's group at
816
http://www.wadsworth.org/resnres/bioinfo/, as described
817
in Results. We generally used a motif length of 11 bp and
818
allowed the algorithm to vary this by +-2 if the data
819
warranted, and took a prior copy number of 7 when fitting
820
1-2 kb of data (again the algorithm will adjust this
821
number). Other parameters were taken as default, and
822
under these conditions typically 5 distinct motifs were
823
fit.
824
To decide whether Gibbs derived motifs matched known
825
ones we ran the known weight matrices in both
826
orientations over the individual sequences composing the
827
motif plus five flanking bases and computed the position
828
that maximized the information score. The Gibbs motif was
829
deemed to match a known factor if:
830
1) a single known matrix was the top match to a
831
majority of the sequences,
832
2) the optimal match occurred at the same position
833
(with some variability allowed for factors such as hb,
834
with poly-A regions),
835
3) the information score of the match was comparable
836
to the score of the sequences which define the matrix to
837
the matrix itself and not dominated by the flanking
838
bases.
839
4) the preceding conditions were met in two
840
independent runs (with typically 2-4 runs done for each
841
data set).
842
843
844
Algorithm Argos
845
Argos is described in detail in the results section
846
and additional File 7.
847
848
849
Genome-wide display of our results
850
Our webpage
851
http://uqbar.rockefeller.edu/~siggia/contains all our
852
predictions. For each module, all nearby genes were
853
extracted from the annotation, their position relative to
854
the module (ie up/down stream, intronic), and Flybase
855
links for gene function were collected into a table. The
856
number of binding sites for each factor in the module is
857
listed and their position and score along with known
858
binding sites can be viewed in graphs which were produced
859
with "gff2ps" [ 19 ] .
860
To view our results interactively on a larger scale
861
together with the current fly annotation we installed the
862
Gbrowse software from L. Stein's Generic Model Organism
863
Systems Database Project http://www.gmod.org. Our modules
864
(predicted and experimental), binding sites, and
865
restriction sites are included in the display. A function
866
was added that allows to plot the Ahab score (equation 3)
867
along the genome. Thus, the user can explore where
868
additional putative modules fall relative to any gene of
869
interest.
870
871
872
873
Contributions
874
Authors 1, 2 and 4 carried out the computional part of
875
this study, author 3 annotated the Ahab results. All
876
authors read and approved the final manuscript.
877
878
879
Note
880
1Randomized sequence was produced by randomly pooling
881
and concatenating 100 basepair chunks from genomic
882
noncoding sequence
883
884
885
886
887