Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
The complete sequencing of several genomes, including
7
that of the human, has signaled the beginning of a new era
8
in which scientists are becoming increasingly interested in
9
functional genomics; that is, uncovering both the
10
functional roles of different genes, and how these genes
11
interact with, and/or influence, each other. Increasingly,
12
this question is being addressed through the simultaneous
13
analysis of hundreds to thousands of unique genetic
14
elements with microarrays. Already, analytical strategies
15
have subdivided into distinct 'omic' domains, such as
16
genomics, proteomics, and metabolomics. This enables
17
researchers to examine not only genetic elements, but also
18
the corresponding proteins and metabolites derived from
19
these genes. All 'omic' technologies share the need for
20
fresh, innovative looks at data analysis. To date,
21
transcriptomics is the most widely studied molecular
22
approach, enabling researchers to examine subtle
23
differences in thousands of mRNA levels between
24
experimental samples, medical biopsies,
25
etc . Although mRNA is not the end
26
product of a gene, the transcription of a gene is both
27
critical and highly regulated, thereby providing an ideal
28
point of investigation [ 1 2 ] . Development of microarrays
29
has permitted global measurement of gene expression at the
30
transcript level and provided a glimpse into the
31
coordinated control and interactions between genes.
32
Presently, two technologies dominate the field of
33
high-density microarrays: cDNA arrays and oligonucleotide
34
arrays. The cDNA array has a long history of development [
35
3 ] stemming from immunodiagnostic work in the 1980s;
36
however, it has been most widely developed in recent years
37
by Stanford University (California) researchers depositing
38
cDNA tags onto glass slides, or chips, with precise robotic
39
printers [ 4 ] . Labeled cDNA fragments are then hybridized
40
to the tags on the chip, scanned, and differences in mRNA
41
between samples identified and visualized using a variation
42
of the red/green matrix originally introduced by Eisen and
43
colleagues [ 5 ] . The light-generated oligonucleotide
44
array, developed by Affymetrix, Inc. (Santa Clara, CA),
45
involves synthesizing short 25-mer oligonucleotide probes
46
directly onto a glass slide using photolithographic masks [
47
6 7 ] . Sample processing includes the production of
48
labeled cRNA, hybridization to a microarray, and
49
quantification of the obtained signal after laser scanning.
50
Regardless of the array used, the output can be readily
51
transferred to commercially available data analysis
52
programs for the selection and clustering of significantly
53
modified genes.
54
Differentially expressed genes will be defined herein as
55
gene data determined to be statistical outliers from some
56
standard state, and which can not be ascribed to chance or
57
natural variabilty. Various creative techniques have been
58
proposed and implemented for the selection of
59
differentially expressed genes; however, none have yet
60
gained widespread acceptance for microarray analysis.
61
Despite this, there remains a great impulse to develop new
62
data analysis techniques, partly driven by the obvious need
63
to move beyond setting simple fold change cut-offs which
64
are out of context with the rest of the experimental and
65
biological data at hand [ 8 9 10 11 ] . This has been the
66
case for many studies, where the selection of differential
67
gene expression is performed through a simple fold change
68
cut-off, typically between 1.8 and 3.0. There is an
69
inherent problem with this selection criterion, as genes of
70
low absolute expression have a greater inherent error in
71
their measured levels. These genes will then tend to
72
numerically meet any given fold change cut-off even if the
73
gene is not truly differentially expressed. The inverse
74
also holds true, where highly expressed genes, having less
75
error in their measured levels, may not meet an arbitrary
76
fold-change cut-off of 2.0 even when they are truly
77
differentially expressed [ 12 ] . Therefore, selecting
78
differentially regulated genes based only on a single fold
79
change across the entire range of experimental data
80
preferentially selects lowly expressed genes [ 8 ] . This
81
commonly used approach does not accommodate for background
82
noise, variability, non-specific binding, or low copy
83
numbers- characteristics typical of microarray data which
84
may not be homogeneously distributed. Other approaches
85
entail the use of standard statistical measures such as a
86
student's
87
t -test or ANOVA for every individual
88
gene. However, due to the cost of repeating microarray
89
experiments, the number of replicates usually remains low,
90
leading to inaccurate estimates of variance [ 8 ] .
91
Furthermore, due to the low number of replicates, the power
92
of these "gene-by-gene" statistical tests to differentiate
93
between regulated and non-regulated genes also remains very
94
low.
95
The present article describes a model that considers
96
both expression levels and fold changes for the
97
identification of significant differentially expressed
98
genes. This simple model allows the experimenter to
99
estimate the relationship between these two parameters in
100
the absence of large numbers of experimental replicates,
101
where the inherent error of measures cannot be accurately
102
estimated. Subsequently, gene transcripts determined to be
103
outliers from the trend can be considered differentially
104
expressed genes. An added strength to the model lies in its
105
ease of application to any data set. This model should be
106
considered a progressive and cyclical process, where the
107
data analyst can quickly and globally identify a list of
108
potentially differentially regulated genes with confidence,
109
based on the inherent qualities of the data set under
110
evaluation.
111
The model presented herein was developed with a data set
112
from a nutritional experiment in a mouse model using
113
Affymetrix Mu11K chips, where the effects of four diets
114
were compared in a number of organs (pool of five mice for
115
each sample in each organ): (1) control diet A in duplicate
116
from the same pool; (2) diet B; (3) diet C; and (4) diet D.
117
Details of the dietary treatments will be reported
118
elsewhere. The present article will take only the data from
119
the liver as an example for the development of a gene
120
selection model. The model was validated by real-time
121
polymerase chain reaction (RT-PCR) and indicates good
122
concordance between the two experimental techniques.
123
124
125
Results and Discussion
126
127
Selection of differentially regulated genes and
128
data analysis
129
The method developed herein includes: (A)
130
determination of the upper X% of highest fold changes
131
within narrow bins of absolute expression levels in order
132
to generate a limit fold change (LFC) function; and (B)
133
subsequent ranking of genes by a combined fold
134
change/absolute expression calculation. The following
135
discussions describe the development of the model within
136
the context of our nutritional study; however, a generic
137
protocol can be found in the Materials and Methods
138
section.
139
140
141
(A) Selection of the upper X% of highest fold
142
changes within binned absolute expression levels
143
The principal parameter for gene expression data
144
stemming from a typical Affymetrix experiment is the
145
average difference intensity (ADI), which is a
146
representation of the absolute expression of a gene. As
147
indicated in the literature, it is common practice to
148
establish a minimal expression threshold below which data
149
are considered to be noise. In the case of Affymetrix
150
data, it is often necessary to discard minimal and
151
negative ADIs, as these data are both biologically and
152
mathematically difficult to interpret.
153
A number of previous reports have used an ADI
154
threshold (
155
A
156
157
t
158
) value of 20 in the standard Affymetrix range [ 13
159
14 15 16 ] ,
160
i.e. probe sets with ADI's of less
161
than 20 would either be rejected or set to 20 as
162
meaningful differences in gene expression can purportedly
163
be evaluated above this level. Although empirically
164
supported, an
165
A
166
167
t
168
of 20 is essentially an arbitrary selection and not
169
all groups select the same threshold value. The exact
170
setting of this lower
171
A
172
173
t
174
is not inherent to the LFC modeling process, and the
175
reader is encouraged to set the
176
A
177
178
t
179
value based on additional criteria, such as that
180
previously published by Gerhold
181
et al . [ 17 ] and Dieckgraefe
182
et al . [ 18 ] . However, an
183
A
184
185
t
186
of 20 will be used in the present work, for which
187
the selection of differentially expressed genes in the
188
context of ADI dependent variance is the central focus.
189
Therefore, all ADI's less than 20 were set to 20 and any
190
probe set with a value of 20 across all dietary
191
treatments were discarded. After eliminating the probe
192
sets which met these criteria there remained 9391 genes
193
out of the original 13179 genes represented on the Mu11K
194
GeneChip.
195
An additional parameter, highest fold change (HFC),
196
was then applied to these remaining genes. HFC is defined
197
as:
198
199
where A, B, C, and D represent the individual
200
microarray results for each gene. The HFC is inherently a
201
ratio metric of the maximum change in measured gene
202
expression between any combination of experimental
203
treatments. The present experiment has four dietary
204
conditions with microarray data; however, it should be
205
noted that the HFC equation could be expanded to any
206
number of conditions or experimental treatments.
207
The determination of HFC is highly influenced by
208
absolute expression, and trends can be readily observed
209
in our data set where HFC is negatively correlated with
210
absolute expression (Figure 1a). For example, with
211
absolute expression values greater than 5000 it is of low
212
probability to observe an HFC greater than 2. However,
213
with absolute expression values near 50, an HFC of
214
greater than 2 is readily seen. Although not shown in
215
Figure 1a, this trend could be observed for any pair or
216
triplet of experimental comparison in the current data
217
set,
218
i.e. AB, AC, AD, BC, BD, ABC, BCD.
219
It has also been observed across multiple experiments
220
examined in our laboratory (data not shown). This
221
consistancy can be explained by the fact that there are
222
very few genes out of the entire transcriptome which are
223
differentially expressed due to treatment. Therefore,
224
most measured gene transcripts display a typical
225
coefficient of variation independent of treatment. The
226
few genes which are differentially expressed do not
227
unduly affect the overall trend. Therefore, the trend
228
lends itself to characterization and may be used as a
229
metric for determining differential gene expression
230
across multiple experiments.
231
This empirically implies that natural variation,
232
expressed here as HFC, tends to be much greater at low
233
expression levels. This concept is supported in the
234
literature [ 12 ] and questions the appropriateness of
235
using a linear fold change cut-off in a system
236
characterized by heterogenous variance.
237
As stated previously, the selection of differentially
238
expressed genes is essentially a search for outliers,
239
i.e. gene data lying outside some
240
standard distribution of differences relative to a
241
control state, and which cannot be ascribed to chance or
242
natural variabilty. To determine those genes which are
243
outliers, it is necessary to measure either the
244
variability of the system or to make valid assumptions
245
regarding the distribution of variability. In the present
246
model we assume that: (1) as mentioned above, variability
247
in the measurement of gene expression is related to the
248
ADI; and (2) if a broad sampling of the transcriptome is
249
measured, only a small number of genes will actually be
250
outliers even in the harshest of experimental conditions.
251
Assumption (1) is a fairly general analytical concept,
252
i.e. the closer data is to the
253
measurement threshold, the higher the variability is in
254
that measurement [ 12 19 ] . Assumption (2) appears to be
255
empiricaly valid when surveying the literature for
256
high-density microarray experiments which evaluate severe
257
biological events, from caloric restriction [ 20 21 ] to
258
apoptosis [ 22 23 ] . In these experiments [ 20 21 22 23
259
] , regardless of the gene selection method used, less
260
than 5% of the total number of genes probed were
261
differentially regulated. Therefore, to develop the
262
present model of gene selection, the validity of
263
selecting outliers was evaluated for a range of highly
264
variable genes using the top 5% as a benchmark. Model
265
trends were then examined from 1% to 10%.
266
The model was developed by first binning gene
267
expression data into tight classes across the entire
268
range of absolute expression values, where genes with an
269
equal absolute expression value were randomly ordered,
270
and then selecting the upper 5% of HFC values for further
271
consideration. Binning was carried out to divide the
272
entire range of absolute expression values into bins
273
containing an equal number,
274
m , of genes, where
275
m = 200. Therefore, bin widths
276
(ADI) were not necessarily equal, yet the number of genes
277
contained in each bin was equivalent. For the first round
278
of analysis, the upper 5%, or 95 thpercentile, of HFC
279
genes in each bin were selected for further consideration
280
(Figure 1a). It was possible to search separately for the
281
5% of genes with the greatest HFCs in each class;
282
however, in order to simplify the overall selection, we
283
plotted the relationship between absolute expression,
284
defined as min ADI (A,B,C,D), and HFC, in order to set
285
the LFC function. Herein, the min ADI (A,B,C,D) will
286
simply be referred to as min ADI. This relationship was
287
then modeled using a simple equation of the form
288
LFC = a + (b/min ADI) , which is
289
fitted to the 95 thpercentile of each bin (Figure 1a) to
290
produce the LFC curve that best models the expression
291
data. This modeled LFC curve (5% LFC model = 1.74 +
292
91.55/min ADI) fit the data well (R 2= 0.98) and further
293
analysis indicated the residuals were randomly
294
distributed (data not shown). The equation for the line
295
of best fit contains two parameters that have various
296
repercussions on gene selection, both of which can be
297
defined in commercially available software using common
298
"solve" functions (
299
e.g . Microsoft Excel). First,
300
a sets the asymptote, which
301
corresponds to the minimum HFC value that can be observed
302
at any given ADI. Second,
303
b raises/lowers the limit function
304
at a given ADI, and is therefore highly influenced by
305
this latter value. For example, the smaller the ADI the
306
greater the LFC, and vice versa. Figure 1bshows that as
307
the selection criteria becomes more strict (top 5% → 1%
308
of genes), the curve shifts (1% LFC model = 2.43 +
309
166.12/min ADI) and becomes more restrictive in the
310
selection of differential genes,
311
i.e. at any given absolute
312
expression level a higher fold change must be observed
313
for a gene to be considered differentially expressed. The
314
opposite is true when the selection criteria becomes less
315
strict (top 5% → 10% of genes), where the curve shifts
316
(10% LFC model = 1.59 + 69.47/min ADI) and results in a
317
more permissive selection of differential genes.
318
Using the aforementioned equations the selection of
319
genes for further consideration becomes simple and
320
'global' (
321
i.e. across the entire range of
322
expression levels); where a gene is selected with the HFC
323
approach if max ADI/min ADI > a+(b/min ADI). After
324
applying the 10% LFC gene filter, 869 genes remained in
325
the list out of the 9391 candidate genes selected from
326
the original 13179 genes on the GeneChip. When interested
327
in the top 5% and 1% of significant genes, the total
328
number of genes that meet the LFC requirements is 471 and
329
82, respectively.
330
Lastly it should be noted that the LFC,
331
i.e. the modeled trend of HFC vs.
332
min ADI, is based on binned data of hundreds of genes
333
across multiple conditions leading to a highly powerful
334
characterization of a given threshold. In other words,
335
there is a large amount of data available in order to
336
accurately characterize the trend. The same argument
337
holds for the generation of a modeled confidence interval
338
based on low numbers of replicates, as will be described
339
below. This is in contrast to the relatively low
340
statistical power of conventional "gene-by-gene" tests
341
such as the
342
t -test or ANOVA, often used for
343
the selection of differentially expressed genes [ 8 ]
344
.
345
346
347
(B) Assignment of gene rank
348
Following gene selection, a rank of 'importance' or
349
'interest level' was assigned to each selected gene. It
350
should be noted that the LFC is not dependent on the rank
351
calculation; rather rank simply lends relative
352
'importance' to selected genes by incorporating both the
353
magnitude of fold change and absolute expression values.
354
The rank number (RN) for each gene was determined by
355
first calculating a rank value (RV), which can be defined
356
as: RV = HFC * (max ADI - min ADI). After calculation of
357
RV, gene lists were sorted and then assigned a simple RN
358
of 1,2,3,4..., where a gene with a RN of 1 corresponds to
359
the gene with the highest RV. The RV is an arbitrary
360
value that simply lends importance to selected genes with
361
both high fold changes and high differences in absolute
362
expression. Both RV and RN aid in the discussion of
363
differential gene effects by adding the concept of
364
relative weight or importance amongst selected genes.
365
This concept aids in the choice of genes for validation
366
or follow-up studies, as detailed below.
367
368
369
(C) Model validation
370
371
Validation of the LFC model via characterization
372
of measurement variability
373
Hess and colleagues have recently examined the
374
concept that variability and absolute expression are
375
related; however, they examined only the variability of
376
replicate spots on a single slide [ 24 ] . Herein, we
377
extended this concept to examine the variability
378
between genes on different microarrays. Measurement
379
variance was examined following the development of the
380
LFC model, and was therefore used simply as a
381
confirmation of this model. To further understand the
382
nature of measurement variability within the current
383
study, duplicate Mu11K Affymetrix microarrays for the
384
controls were examined (see Materials and Methods
385
section). A pooled RNA sample from mice (
386
n = 5) fed the control diet was
387
hybridized to two different chips, and the data was
388
analyzed to characterize measurement variability. It
389
was apparent from the trend that as absolute expression
390
levels (ADI) increase, the coefficient of variation
391
(CV= SD/MAE) decreases. The trendline was calculated as
392
detailed in the Materials and Methods section. This
393
trendline was overlayed on the entire data set, in
394
addition to the 5% LFC selected data (shown in red), in
395
Figure 1c. By overlaying the trendline of the within
396
variability data on those genes determined to be
397
significantly regulated by the LFC model, the CV upper
398
confidence limit for these selected genes had a
399
p value ≤ 0.001. Thus, the 5%
400
LFC-selected data lies outside the 99.9% confidence
401
interval surrounding measurement variability,
402
reinforcing the validity of the results.
403
404
405
Real-time polymerase chain reaction
406
(RT-PCR)
407
The results obtained from a microarray experiment
408
are influenced by each step in the experimental
409
procedure, from array manufacturing to sample
410
preparation and application to image analysis [ 25 ] .
411
The preparation of the cRNA sample is highly correlated
412
to the efficiency of the reverse transcription step,
413
where reagents and enzymes alike can influence the
414
reaction outcome. These factors affect the
415
representation of transcripts in the cRNA sample,
416
necessitating the need for validations by complementary
417
techniques. Analyses by northern blot and RNAse
418
protection assays are commonly reported; however, the
419
emerging 'gold-standard' validation technique is RT-PCR
420
[ 26 ] . Microarrays tend to have a low dynamic range,
421
which can lead to small yet significant
422
under-representations of fold changes in gene
423
expression [ 27 ] . As RT-PCR has a greater dynamic
424
range, it is often used to validate the observed trends
425
rather than duplicate the fold changes obtained by chip
426
experiments [ 26 28 29 ] .
427
Having chosen genes that lie across the range of RN,
428
and therefore the range of model selection criteria,
429
RT-PCR was performed in triplicate for each
430
experimental condition (Diet A,B,C,D) using the same
431
pooled stocks of liver RNA (5 mice per experiment).
432
Genes were compared to the endogenous controls β-actin
433
and GAPDH, which did not significantly change across
434
the dietary treatments. As determined by our LFC
435
selection model, the GeneChip microarrays indicated no
436
significant differences amongst the 4 diets for either
437
GAPDH or β-actin. Subsequent confirmation that both
438
GAPDH and β-actin did not change was provided by
439
RT-PCR, where a simple student's
440
t -test with a predefined nominal
441
α level of 0.05 indicated no significant differences
442
between the experimental diets (B,C,D) and the control
443
diet A. RT-PCR provided a means to confirm the effects
444
of the 3 dietary treatments on 9 genes (Table 1) and
445
the concordance between these 27 microarray and RT-PCR
446
results was examined. Perfect concordance was not to be
447
expected due to the inherent differences in sensitivity
448
and dynamic range between the two techniques. However,
449
a good overall concordance of 77.7% for differential
450
gene expression was observed,
451
i.e. the fold change for a given
452
gene seen by microarray was directionally consistent
453
with that seen by RT-PCR, regardless whether the
454
results were significant by either the 5% LFC model
455
(for microarray data) or a student's T-test (for RT-PCR
456
data). When examining only those genes considered
457
significantly changed by RT-PCR (α = 0.05, starred
458
values in Table 1), concordance increases to 85.7%.
459
Therefore, the value of 85.7% indicates the overall
460
concordance between significantly changed genes seen by
461
RT-PCR and those microarray pairwise comparisons
462
(treatment vs. control) that meet the LFC model
463
criteria (§ values in Table 1).
464
What is noticeable through the color scheme (Table
465
1) is genes with high RN (low RV) have relatively less
466
concordance between the two techniques; where red
467
indicates no concordance and blue indicates only one or
468
two (out of three) of the results agreed. However, the
469
majority of genes are colored in green, indicating
470
perfect directional concordance. When specifically
471
examining fatty acid synthase (FAS), a highly expressed
472
gene, microarray fold changes of less than 2 can be
473
corroborated between the two experimental techniques,
474
reinforcing the strength of this fold change model.
475
Furthermore, it is clear from the RT-PCR data that at
476
very low expression levels, high fold changes are still
477
problematic to verify and remain questionable. The
478
present model takes this into account by raising the
479
criteria appropriately at the low expression range,
480
i.e. a higher fold change at low
481
expression levels is required for a gene to be
482
considered differentially expressed.
483
As the selection criteria with microarray data was
484
that the HFC must be greater than the LFC model limits,
485
the expectation was that the LFC function could be
486
validated by RT-PCR (underlined values in Table
487
1indicate HFC for each gene). This is predominantly the
488
case across the full dynamic range of data selected by
489
the model (77.7% / 85.7 % concordance), except for very
490
lowly expressed genes such as the RAS oncogene. For
491
genes with a slightly lower RN (higher RV), such as
492
ABC1 member 7, some concordance is seen, indicating
493
confidence is gaining as RV increases. For genes with
494
an RN lower than 130 (RV > 1156;
495
e.g. USF-2) concordance quickly
496
approaches 100%, indicating high confidence when
497
discussing gene trends or individual gene results.
498
These results reinforce the concept that RN is
499
correlated with confidence and validity when discussing
500
the gene set produced by the LFC model.
501
Interestingly, one might expect that genes with an
502
RN lower then 130 would be concentrated only at higher
503
expression levels; however, when the spread of genes
504
with an RN between 1-130 were examined, these genes
505
were found to lie across the entire range of absolute
506
expressions (data not shown). This indicates that a 5%
507
LFC model is confidently selecting differentially
508
regulated genes across the full range of absolute
509
expression. Therefore, the 5% LFC model appears to be
510
an appropriate selection criteria for the present
511
experimental data set; however, the fold change
512
percentage could easily be varied to meet other
513
acceptable levels of risk, as is done with conventional
514
hypothesis testing (
515
e.g. α-,
516
p -, and χ 2-values). The X%
517
selection criteria should then be re-evaluated for
518
other experimental data sets in relation to the
519
variance and validation data at hand.
520
521
522
523
524
Conclusions
525
The analysis of microarray data is a developing field of
526
study aimed at enabling the biomedical community to cope
527
with the waves of large microarray data sets. Already, an
528
evolution can be observed with respect to the methods for
529
selecting significantly changed genes. Researchers are
530
moving away from simple fold change cut-offs and
531
incorporating the use of robust statistical concepts. The
532
conclusion that highly expressed genes will rarely have a
533
2-fold change in mRNA levels and that lowly expressed genes
534
will commonly have a greater than 2-fold change led to the
535
development of a model that would accommodate for this real
536
biological characteristic of gene expression measurements.
537
The fold change model presented in this paper considers
538
both the absolute expression level and fold change of every
539
gene across the entire range of observed absolute
540
expressions. In addition, the concept of increased
541
variation in lowly expressed genes is incorporated into the
542
selection model through the higher fold change requirements
543
for differential gene selection at low expression levels.
544
Following gene selection using an initial criterion of X%,
545
gene rank was introduced as a basis for choosing genes to
546
validate the model. Therefore, a limited but judicious
547
choice of model parameters to select genes across a broad
548
range of gene rank can then be used to reset the X% in
549
order to correspond with the data at hand (Figure 2). The
550
variance data characterizing measurement variability
551
supports the selection model, indicating that selected
552
genes lie outside measurement variability at very high
553
confidence limits (> 99.9% CL). Further validation of
554
this model in the current data set by RT-PCR confirmed
555
these relationships, reinforcing that genes with fold
556
changes even less than 1.8 can be measured, assuming
557
adequate absolute expression levels. This demonstrates that
558
biological changes in sample concentration of mRNA, even at
559
low fold change levels, can be confidently determined.
560
In summary, the X% LFC model enables one to define
561
experiment specific selection stringency while maintaining
562
simplicity and ensuring global coverage for the detection
563
of differential gene selection.
564
565
566
Materials and Methods
567
568
Mice and feeding conditions
569
Mice were male Rj:NMRI mice from Elevage Janvier, Le
570
Genest-Saint-Isle France, weighing 10-11 g at delivery
571
and 33-51 grams on day 42, were housed 10 per cage. Mice
572
received
573
ad libitum quantities of bottled
574
distilled water and purified powdered diets (7.5 g/mouse)
575
in ceramic cups (10/group) for 42 d. Experimental diets
576
will be described in detail in a biological follow up
577
publication.
578
579
580
Dissection of mice
581
After administration of the aforementioned diets to 10
582
mice per group, 5 mice were randomly selected for
583
inclusion in the gene expression analysis experiment.
584
Organs were dissected according to standard protocols,
585
then cut into 100-150 mg subsections, flash frozen in
586
liquid nitrogen, and stored at -80°C until gene
587
expression analysis.
588
589
590
Nucleic acid preparation
591
Tissue from each organ was extracted from 5 individual
592
mice and extracted separately using Qiagen RNeasy
593
mini-kits (Basel, Switzerland) according to
594
manufacturer's instructions with one exception: During
595
extractions, all RNeasy columns were impregnated with
596
DNase I (Roche, Basel, Switzerland) to remove possible
597
genomic DNA contamination. After extraction, equal
598
amounts of material were pooled to obtain 10 μg total RNA
599
per dietary group. RNA samples were quantified with the
600
RiboGreen RNA Quantification Kit according to
601
manufacturer's instructions (Molecular Probes, Eugene
602
Oregon), and then analyzed via agarose gel
603
electrophoresis for intact 18 and 28s rRNA. All samples
604
included in the study were judged to contain high-quality
605
RNA in sufficient amounts for hybridization.
606
607
608
Gene expression analysis using the murine 11k
609
GeneChip
610
611
cRNA preparation
612
Total RNA (15 μg) was used as starting material for
613
all samples. In all cases, a "test chip" provided by
614
the manufacturer was run prior to using the Murine 11k
615
GeneChip. In each case this confirmed that sufficient
616
high quality RNA was present to detect gene expression
617
in the various tissue samples. The first and second
618
strand cDNA synthesis was performed using the
619
SuperScript Choice System (Life Technologies) according
620
to manufacturer's instructions, but using oligo-dT
621
primer containing a T7 RNA polymerase binding site.
622
Labeled cRNA was prepared using the MEGAscript,
623
In Vitro Transcription kit
624
(Ambion). Biotin labeled CTP and UTP (Enzo) was used
625
together with unlabeled NTP's in the reaction.
626
Following the
627
in vitro transcription reaction,
628
unincorporated nucleotides were removed using RNeasy
629
columns (Qiagen).
630
631
632
Array hybridization and scanning
633
cRNA (10 μg) was fragmented at 94°C for 35 min in
634
buffer containing 40 mM/L Tris-acetate pH 8.1, 100 mM/L
635
KOAc, 30 mM/L MgOAc. Prior to hybridization, fragmented
636
cRNA in a 6×SSPE-T hybridization buffer (1 M/L NaCl, 10
637
mM Tris pH 7.6, 0.005% Triton), was heated to 95°C for
638
5 min, cooled to 40°C, and then loaded onto the
639
Affymetrix probe array cartridge. The probe array was
640
incubated for 16 h at 40° C at 60 rpm. The probe array
641
was washed 10× in 6×SSPE-T at 25°C followed by 4 washes
642
in 0.5×SSPE-T at 50°C. The biotinylated cRNA was
643
stained with a streptavidin-phycoerythrin conjugate, 10
644
g/ml (Molecular Probes) in 6×SSPE-T for 30 min at 25°C
645
followed by 10 washes in 6×SSPE-T at 25°C. The probe
646
arrays were scanned at 560 nm using a confocal
647
laser-scanning microscope (made for Affymetrix by
648
Hewlett-Packard). Readings from the quantitative
649
scanning were analyzed with Affymetrix Gene Expression
650
Analysis Software.
651
652
653
654
A step-by-step method to apply the LFC model to an
655
experimental data set
656
657
1. Data handling and 2-dimensional
658
visualization
659
Overall, the values of all genes are compared across
660
any number (
661
p ) of experimental conditions.
662
The absolute expression value of the
663
k -th gene for the
664
j -th treatment is coded Z
665
666
kj
667
. When considering any given gene, the following
668
data-handling rules are applied:
669
• All values below an ADI threshold (
670
A
671
672
t
673
) are set to
674
A
675
676
t
677
.
678
• If the values for gene
679
k are
680
A
681
682
t
683
for all
684
p treatments, the gene is defined
685
as not expressed and isn't considered further.
686
• The absolute expression value for gene
687
k is defined as
688
min (
689
Z
690
691
k 1 , ...,
692
Z
693
694
kp
695
).
696
• The highest fold change (HFC) of gene
697
k is defined as the
698
following:
699
700
When visualizing all genes on a bivariate plot
701
according to absolute expression and fold change, one
702
obtains a data distribution similar to that of Figure
703
1a.
704
705
706
2. Modeling a discrete limit fold change
707
model
708
The goal is to select the upper X% of genes with
709
highest fold change across the entire range of
710
expression levels. Therefore, the following rules are
711
applied:
712
• Genes are ordered according to their absolute
713
expression value
714
min (
715
Z
716
717
k 1 , ...,
718
Z
719
720
kp
721
), where equally expressed genes are randomly
722
ordered.
723
• The overall expression range is divided into bins
724
of different width, but containing an equal number
725
m of genes.
726
• In each bin, the (1-X)-percentile fold change
727
corresponding to a fold change that is exceeded by X%
728
of genes in the bin is determined. For X% between 1%
729
and 10%,
730
m = 200 appears to be
731
suitable.
732
When visualizing the (1-X)-percentile fold change in
733
each bin, one obtains a data distribution similar to
734
that seen in Figure 1a.
735
736
737
3. Modeling a continuous limit fold change (LFC)
738
model
739
A continuous model is derived from the discrete one
740
by relating the mean expressions of each bin with the
741
corresponding (1-X)-limit fold change, using a least
742
squares fit of the equation:
743
(1-X)-LFC =
744
a +
745
b /Z (minimum expression)
746
This equation appears to fit the data very well and,
747
the interpretation of the parameters (
748
a and
749
b ) is straightforward:
750
• Parameter
751
a represents the asymptote of the
752
curve. For very large expressions, the (1-
753
X )-limit fold change tends to be
754
equal to parameter
755
a .
756
• Parameter
757
b is proportional to the
758
difference between the (1-X)-limit fold change of small
759
and high expressions.
760
When visualizing this continuous limit fold change
761
model, one obtains a curve similar to that observed in
762
Figure 1a. In addition, increasing the (1-X)-percentile
763
fold change shifts the curve up the
764
y -axis and results in an
765
increased stringency for gene selection,
766
i.e. fewer genes meet the LFC
767
requirement (Figure 1b).
768
769
770
Validation by Coefficient of Variance
771
For experiments that are performed without
772
replicates, the LFC-model selects genes with the
773
highest between-treatment variability (previously
774
defined as fold change) at any expression level. If
775
replicates are available, the inherent error of
776
measures, the within-treatment variability, can be
777
estimated. Therefore, it becomes possible to select the
778
genes with the highest ratio of
779
between-treatment-variability /
780
within-treatment-variability.
781
In the data set that was used for illustrating the
782
development of the LFC-model, duplicate measures were
783
available for one of the four treatments. The
784
within-treatment-variability appears to be highly
785
dependent of the expression level of the gene,
786
confirming the findings of Hess
787
et al. [ 24 ] .
788
In order to estimate the CV without taking into
789
account extreme values of the duplicate we used a
790
robust estimator, represented by the following
791
equation:
792
793
where the χ
794
795
inverse
796
function returns the inverse of the one-tailed
797
probability of the χ-squared distribution.
798
Applying the CV derived from replicate sample data (
799
800
eqn. 2 ) to the quadruplicate
801
diet data enabled the calculation of the CV upper
802
confidence level (by bins of absolute expression level)
803
using the following equation:
804
805
where the χ
806
807
inverse
808
function returns the inverse of the one-tailed
809
probability of the χ-squared distribution.
810
811
Eqn. 3 allows one to identify
812
genes with a variance above measurement variability.
813
This greater variability arose due to combined pool
814
(biological) and treatment variabilities.
815
This confidence level could be raised or lowered
816
according to the level of confidence desired by
817
altering the
818
p value. Therefore, modeling the
819
variance data provides a complimentary method for
820
examining the variation of genes across the complete
821
range of absolute expression values.
822
The upper 99.9% confidence limit (CL) of a robust
823
estimation of the coefficient of variance (CV) for
824
replicates (within-treatment variability) has been
825
modeled as a function of absolute minimum expression of
826
all treatments using the following model:
827
Upper 99.9% CL =
828
c +
829
d /mean expression
830
The selected genes are now those for which the CV of
831
treatment expressions (between-treatment variability)
832
is larger than this limit (Figure 1c). By overlapping
833
those genes selected by the LFC model (red dots) on the
834
graph indicating the 99.9% CL (blue line), one observes
835
that the LFC model is considerably more restrictive
836
when selecting lowly expressed genes (Figure 1c).
837
838
839
Validation by real-time PCR (RT-PCR)
840
A subset of differentially expressed genes were
841
selected to confirm the LFC model, where genes were
842
selected across the range of absolute expressions and
843
with varying fold changes. Although not discussed in
844
the present manuscript, a good description of the
845
technique and an example of an excellent experimental
846
design can be found in previous publications [ 26 30 ]
847
, respectively. In brief, all genes were amplified in
848
the Applied Biosystems 5700 instrument using SYBR
849
®green (Molecular Probes), a dye that binds
850
double-stranded DNA. Data represented means of
851
triplicates for each experimental treatment using
852
pooled RNA samples (
853
n = 5). Amplification was
854
performed using an ABI 5700 machine (Applied
855
Biosystems, Foster City, CA, USA) with a hot start at
856
95°C for 10 minutes, followed by 40 cycles of 95°C for
857
15 s and 60°C for 1 min for denaturation, annealing and
858
elongation. Genes were normalized to either β-actin or
859
GAPDH, and then experimental diets (B,C,D) were
860
compared to the control diet (A). All fold changes were
861
subjected to a student's
862
t -test (α = 0.05) to ensure fold
863
changes observed by RT-PCR were statistically
864
significant. Comparisons between microarray data and
865
RT-PCR were then performed.
866
867
868
869
870
Abbreviations
871
872
CV: coefficient of variation
873
874
FC: fold change
875
876
HFC: highest fold change
877
878
LFC: limit fold change function
879
880
MAE: mean absolute expression
881
882
RN: rank number
883
884
RT-PCR: real time polymerase chain
885
reaction
886
887
RV: rank value
888
889
SD: standard deviation
890
891
892
Definitions
893
894
Average Difference Intensity
895
(ADI): average measure of intensity of hybridization
896
for a series of match and mismatch probe pairs tiled across
897
a particular gene transcript. ADI is an indicator of the
898
absolute expression of a gene.
899
900
Concordance: state of agreement between
901
two complementary measurement techniques which is
902
directionally consistent,
903
e.g. two techniques determine that
904
values are statistically significant and that they are both
905
either positive or negative.
906
907
908
Author contributions
909
DM integrated the mathematical and biological
910
interpretation of the experiment that resulted in the
911
writing of this manuscript. AR and RM developed the
912
mathematical formula describing the limit fold change
913
model. AB and MR designed and carried out the DNA
914
microarray studies in mice. MR initiated the development of
915
robust mathematical techniques to evaluate microarray data
916
at the Nestlé Research Center.
917
All authors read and approved the final manuscript.
918
919
920
921
922