Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
Affymetrix microarrays are used by many laboratories to
7
study differences in gene expression associated with
8
experimental treatments, diseases, development, aging, and
9
other conditions. Typically, an arbitrary value for
10
expression ratios (or fold-change values) is chosen to
11
define significant differences in gene expression between
12
conditions. For example, in several studies of aging [ 1 2
13
3 4 5 6 ] , only differences > 1.7-fold in magnitude
14
were considered to be significant. None of the reports
15
indicated whether there were smaller effects that were
16
statistically significant. It has been pointed out that
17
statistically significant differences in gene expression
18
often are of small magnitude (sometimes as low as
19
1.2-fold), and that larger effects often are artefacts of
20
high variance [ 7 8 ] . For those interested in detecting
21
these smaller effects, it is important to minimize
22
nonspecific sources of inter-array variance.
23
To understand the approach described in this report, it
24
is necessary to understand the design of Affymetrix
25
microarrays and analysis software (Microarray Suite). There
26
are multiple probe pairs for each mRNA (8-20 for the arrays
27
used in the present study). A probe pair consists of a 25
28
base oligonucleotide that matches an mRNA sequence (perfect
29
match, or PM probe) and an oligonucleotide with a
30
mismatched base in the center (MM probe). The specific
31
hybridization signal for each probe pair is the difference
32
between the PM intensity and the MM intensity (although the
33
latest version of Affymetrix Microarray Suite, 5.0, has
34
special rules for handling MM probes that have higher
35
signals than their PM partner). No single hybridization
36
condition is optimal for all oligonucleotide probes, so it
37
is inevitable that there is variability among the signals
38
within a probe set. The expression level reported for each
39
probe set (by the Affymetrix "absolute analysis" algorithm)
40
is based on a weighted average of the signals from the
41
individual probe pairs, with signals near the median given
42
more weight than those far from the median. We refer to
43
this as the signal method in this report. The weights
44
assigned to each probe pair can vary from one array to
45
another, but it is unclear whether variable weighting adds
46
significantly to inter-array variance. Microarray Suite
47
also has a procedure ("comparative analysis" algorithm) for
48
comparing two arrays at the level of individual probe
49
pairs. With this algorithm, ratios of signals (PM-MM for
50
each probe pair) from one array to those of the other array
51
are computed first. Weighted averages of these ratios are
52
then computed. We refer to this as the ratio method. This
53
method is supposed to be more precise than the signal
54
method for inter-array comparisons. Thus, many
55
investigators use this algorithm for all possible
56
one-to-one comparisons across groups (e.g., 9 comparisons
57
for 3 arrays per group) and report the average of the
58
ratios as the change in gene expression [ 1 2 3 4 5 9 ] . A
59
problem with this approach is that there is no absolute or
60
relative expression level assigned to each mRNA on
61
individual arrays, so that formal statistical approaches
62
(e.g., t-tests) cannot be used to rate the statistical
63
significance of differences. In this report, we describe
64
how we circumvented this problem by using the ratio method
65
to generate a composite gene expression score for each mRNA
66
on each array.
67
The procedure used to estimate the statistical
68
significance of differences determines which genes, and how
69
many genes, are defined as being differentially expressed.
70
For a comparison between two groups, the t-test is the most
71
commonly used procedure in biological research. However,
72
with 6 arrays per group, even a single outlier can markedly
73
reduce the value of t even when there is no overlap between
74
groups. Therefore, we also used the nonparametric rank sum
75
test, which is insensitive to a skewed distribution. False
76
detection rates were estimated with the Significance
77
Analysis of Microarrays (SAM) program [ 8 ] .
78
79
80
Results
81
82
Normalization method
83
Comparisons among arrays are meaningful only after
84
accounting for variability in overall target
85
concentration ("target" is the Affymetrix nomenclature
86
for a labeled cRNA that hybridizes with a probe),
87
hybridization efficiency, staining, etc. The
88
normalization procedure recommended by Affymetrix is to
89
multiply raw signals by a scaling factor such that the
90
trimmed mean (excluding 2% highest and 2% lowest) of
91
signals is always the same (500 in this study). This
92
procedure could be problematic if a variable proportion
93
(>2%) of the signals are beyond the linear range of
94
the system. Another concern about the normalization
95
procedure was that the majority of the targets did not
96
produce signals that were significantly greater than
97
those caused by nonspecific hybridization. After the
98
recommended normalization procedure was applied, we
99
confirmed that there was negligible inter-array
100
variability of the mean signal (with 5% of signals
101
trimmed from both the high and low ends) across the 4629
102
targets that passed the presence / absence filter
103
(described in the next section). The trimmed mean was 649
104
± 14 (standard deviation) arbitrary units for the 6
105
arrays probing RNA from young muscle, and 643 ± 18 for
106
the 6 arrays probing RNA from older muscle. These data
107
were not used to re-scale the arrays because the variance
108
would have been reduced by less than 2%. Plots of signals
109
from individual arrays versus the average of all 12
110
arrays generally showed the expected scatter around the
111
line of identity (Figure 1A), but a few showed systematic
112
deviations either above or below the line of identity for
113
signals > ~10 4arbitrary units (worst-case array shown
114
in Figure 1B). While this problem might be addressed by
115
using a different scaling factor for high-intensity
116
signals [ 10 ] , few targets produced such high signals,
117
and the magnitude of the effect was small. Thus, the
118
Affymetrix normalization method was employed without
119
modification.
120
121
122
Exclusion of targets based on the absolute
123
detection algorithm
124
Microarray Suite estimates probabilities that targets
125
are absent (P
126
detection ) based on ratios of signals
127
from PM probes to those of MM probes, together with the
128
degree of consistency across the multiple probe pairs for
129
each target. As illustrated in Figure 2, the average
130
signal from the multiple probe pairs cannot be used to
131
decide whether a target should be considered present or
132
absent. We restricted the data analyses to targets for
133
which P
134
detection was less than 0.1 for at
135
least 3 of the 6 samples from either the younger or older
136
group. This filter reduced the number of targets included
137
in the statistical analyses to 4629. While excluding data
138
does not affect the nominal value of P for each
139
comparison made with a t-test or rank sum test, it
140
significantly reduces the estimated false detection rate
141
(see
142
t-Tests and
143
SAM below).
144
145
146
Signal method vs. ratio method
147
When two arrays are compared, the gene expression
148
ratios obtained by the signal method and those obtained
149
by the ratio method (see
150
Background for explanation of terms)
151
are highly correlated. However, the results often differ
152
by more than 1.5-fold (Figure 3). The advantage of the
153
signal method is that Microarray Suite provides, for each
154
target, a number describing the level of gene expression
155
(in arbitrary units) that can be used for t-tests or
156
other statistical procedures. However, according to
157
Affymetrix (Microarray Suite 5.0 User's Guide),
158
comparisons between arrays are more precise when the
159
ratio method is used, so the values on the horizontal
160
axis of Figure 3should be more accurate. The Affymetrix
161
ratio method only provides ratios between two arrays, and
162
does not provide gene expression values for the
163
individual arrays that can be used with standard tests of
164
statistical significance. We therefore extended the ratio
165
method to generate a relative expression score for each
166
target on each array, as follows:
167
The first step was to name one of the arrays as the
168
baseline in the comparative analysis program (Microarray
169
Suite 5.0). Every other array included in the study was
170
then compared with that baseline array. This procedure
171
yielded, for each target, a set of 11 expression ratios (
172
173
r ) representing the relative
174
expression level on each array compared with the baseline
175
array.
176
The next step was to compute, for each target, the
177
expression level (
178
R ) of the baseline array relative
179
to all 12 arrays included in the study. For array #1,
180
R was computed with the
181
formula:
182
183
R
184
1 = 12/(1 +
185
r
186
2 vs. 1 +
187
r
188
3 vs. 1 + ... +
189
r
190
12 vs. 1 )
191
The value of 1 in the denominator of this formula
192
represents the comparison of array #1 with itself. The
193
number of arrays is the numerator rather than the
194
denominator in this formula because the Affymetrix
195
comparative analysis program sets the baseline array as
196
the denominator, so that values of
197
r are the inverse of the relevant
198
ratios.
199
A different array was then named as the baseline.
200
E.g., for array #2 as the baseline:
201
202
R
203
2 = 12/(
204
r
205
1 vs. 2 + 1 +
206
r
207
3 vs. 2 + ... +
208
r
209
12 vs. 2 )
210
These steps were repeated until all 12 arrays had been
211
named as the baseline. The values R
212
1 through R
213
12 were then used for comparisons
214
between age groups with t-tests, rank sum tests, and SAM
215
as described below.
216
For the 4629 probe sets that passed the presence /
217
absence filter, the expression ratios (mean value in old
218
group / mean value in young group) generated by the
219
signal method and those generated by the ratio method
220
were highly correlated (r = 0.89). There also was a
221
fairly good correlation between the signal and ratio
222
methods with respect to the level of statistical
223
significance (log P) of the age-related differences (r =
224
0.74). The advantage of the ratio method was that it
225
usually reduced the within-group variance so that the
226
same mean difference between young and old was associated
227
with a higher level of statistical significance. The
228
average within-group coefficient of variation (CV,
229
standard deviation / mean) was 20% with the ratio method
230
and 25% with the signal method (average CVs were the same
231
for young and old groups). The distribution of CVs
232
improved significantly with the ratio method (Figure 4).
233
Table 1shows that more differences were detected by the
234
ratio method whether we used t-tests, rank sum tests, or
235
SAM to define significant differences. Moreover,
236
consistency between the initial scan and the
237
antibody-enhanced scan was significantly improved by the
238
ratio method, for both expression ratios and for the
239
statistical significance of differences between young and
240
old (Table 2). With the signal method, 38% of the
241
differences significant at P < 0.01 (by t-test) on the
242
initial scan were also significant at P < 0.01 on the
243
antibody-enhanced scan, and 65% were significant at P
244
< 0.05 on the antibody enhanced scan. With the ratio
245
method, 51% of the differences significant at P < 0.01
246
on the initial scan were also significant at P < 0.01
247
on the antibody-enhanced scan, and 77% were significant
248
at P < 0.05 on the antibody enhanced scan.
249
250
251
t-Tests
252
A plot of expression ratio vs. statistical
253
significance (Figure 5) shows that differences with high
254
statistical significance (by t-test) usually were less
255
than 1.7-fold in magnitude. The validity of at least some
256
of the small differences is demonstrated by the fact that
257
~1.3-fold differences were detected (P < 0.01) for 17
258
genes encoding proteins involved in mitochondrial
259
electron transport or ATP synthesis (Table 3). This
260
finding replicates our previous research, in which SAGE
261
and quantitative RT-PCR assays demonstrated ~1.3-fold
262
declines in older muscle of several mRNAs encoding
263
components of NADH dehydrogenase, cytochrome c oxidase,
264
and ATP synthase complexes [ 11 ] . For all of these
265
mRNAs, both the signal and ratio methods detected the
266
differences at P < 0.03, whereas the ratio method was
267
a bit more likely to detect them at P < 0.01 (14/17
268
genes) than was the signal method (12/17 genes, Table
269
3).
270
The P values generated by the t-tests were not
271
adjusted for multiple comparisons. However, a Bonferroni
272
correction would be too stringent for exploratory
273
research [ 12 ] . A useful alternative to P in studies
274
involving thousands of comparisons is the estimated false
275
detection rate, which is the ratio of the expected number
276
of chance differences (P × number of comparisons) to the
277
number of differences observed. If we use P < 0.01 to
278
define a significant difference, we should expect ~46
279
chance differences (0.01 × 4629 comparisons) if there is
280
no effect of aging on gene expression. Because 124
281
differences were significant at P < 0.01 (by the ratio
282
method), the estimated false detection rate was 46/124,
283
or 37%. When no presence / absence filter was applied
284
(12533 probe sets included in the analysis), the
285
estimated false detection rate (ratio method) increased
286
from 37% to 73% because there were fewer differences (at
287
P < 0.01) among the "absent" mRNAs than expected by
288
chance (48 observed vs. 79 expected by chance).
289
290
291
Mann-Whitney rank sum tests
292
The Mann-Whitney rank sum test was used to detect
293
transcripts for which there was little or no overlap of
294
values between groups. This test revealed 107 differences
295
at P < 0.01 (exact P = 0.00866 at rank sum cutoff
296
values) when the ratio method was used. We would expect
297
to find 40 differences by chance alone (0.00866 × 4629
298
genes), so the false detection rate (40/107 = 37%) is the
299
same as that estimated from t-tests. There were 21
300
differences significant at P < 0.01 by rank sum tests
301
but not by t-tests according to the ratio method. Thus,
302
for exploratory research being done to generate lists of
303
mRNAs that warrant further study, use of both parametric
304
and nonparametric tests is one way to significantly
305
expand the list.
306
307
308
SAM
309
SAM computes a value, termed d [d = (mean
310
1 - mean
311
2 )/(s
312
d + s
313
0 )], that is similar to t [t = (mean
314
1 - mean
315
2 )/s
316
d ]. The "exchangeability factor", s
317
0 , minimizes the number of extreme d
318
values among targets with small variances in signal
319
intensity. When absolute signals are analyzed, these
320
small variances are associated with targets that are more
321
difficult to quantify accurately because of low
322
concentrations. We already have filtered most of these
323
targets from the analysis. When data based on the signal
324
method were analyzed, s
325
0 was very small (lowest percentile of
326
the s
327
d values). When relative expression
328
levels were computed by the ratio method, all means were
329
close to 1 regardless of the absolute signal intensity.
330
In this case, s
331
0 was large (53rd percentile of the s
332
d values), and lower s
333
d values were associated with stronger
334
rather than weaker signals. This problem precluded the
335
use of SAM for data normalized in this fashion. However,
336
by multiplying each value of
337
R by the gene-specific mean signal
338
(mean of all 12 arrays), we generated a data set
339
compatible with SAM.
340
SAM lists genes for which d exceeds (by an adjustable
341
threshold termed Δ) the value that would be expected by
342
chance (d
343
e ). Values of d
344
e are generated by computing the d
345
distribution numerous times with random permutations of
346
the group assignments (we instructed SAM to perform 100
347
permutations). The average distribution from these
348
permutations defines the values of d
349
e . Reducing Δ expands the list of
350
"significant" genes, but also increases the false
351
detection rate. When we chose a value of Δ corresponding
352
to a false detection rate < 20%, there were 124
353
differences according to the ratio method but only 56
354
differences according to the signal method. There were 20
355
differences for which the false detection rate was <
356
5% when the ratio method was used (including 9 for genes
357
involved in energy metabolism that are listed in Table
358
3), but none when the signal method was used. When no
359
presence / absence filter was applied, the highest-ranked
360
differences had false detection rates of 30% even with
361
the ratio method, and only 34 genes achieved this level.
362
Thus, it is very important to remove noisy data before
363
using SAM.
364
365
366
367
Discussion
368
Careful subject selection and consistency in
369
experimental conditions, sample collection procedures, and
370
sample processing obviously are needed if small differences
371
in gene expression are to be detected. A further reduction
372
in total within-group variance can be achieved by using the
373
ratio method described in this report. This method is based
374
on the Affymetrix comparative analysis algorithm, which was
375
designed for comparisons between two arrays. To use the
376
procedure for groups rather than individual arrays, we
377
assigned each target on each array a score that was the
378
average ratio from all one-to-one comparisons of that array
379
with every array included in the study. The best
380
illustration of the increase in power gained by the ratio
381
method was the fact that that 20 differences were below the
382
5% false detection rate (by SAM) when this method was used,
383
whereas no differences below the 5% false detection rate
384
were found when the signal method was used. The major
385
drawback of the ratio method is increased computational
386
time.
387
It has been suggested that inter-array variance can be
388
reduced by ignoring data from MM probes, or by using an
389
alternative computation to take advantage of the MM data [
390
13 14 15 16 ] . In previous versions of Microarray Suite,
391
MM signals were always subtracted from PM signals, which
392
often led to negative expression levels. The newer version
393
(5.0), used in this study, has a different procedure for
394
dealing with high signals from MM probes. It is not clear
395
whether alternative algorithms for using the MM signals, or
396
ignoring MM signals, would improve the accuracy of the
397
data. We therefore used the Affymetrix algorithm, which is
398
likely to be used by most investigators until there is
399
definitive evidence that alternative methods are
400
superior.
401
There is no consensus about the optimal statistical
402
approach for finding differences in expression among
403
thousands of genes. When a specific hypothesis is being
404
tested, "shopping" for the best statistical test to support
405
the hypothesis should be avoided. In contrast, when the
406
goal is to generate descriptive information to guide
407
decisions about future research directions, there is no
408
reason not to use multiple approaches to obtain as much
409
information as possible. For a comparison of two groups,
410
the t-test is simple and robust, and does not require
411
special software. Some investigators may be wary about
412
using t-tests rather than mean differences to rank genes
413
because one or two extreme values can reduce t even when
414
there is no overlap between groups. The rank sum test can
415
be used to detect such effects. Log transformation of the
416
data also can minimize the impact of outliers with high
417
signals. However, log transformation reduces t when the
418
outliers are close to zero. It has been suggested that this
419
feature of the log transformation is advantageous because
420
it can exclude effects that are artefacts of noisy data at
421
low expression levels [ 17 ] . We prefer to filter noisy
422
data by using the P
423
detection algorithm.
424
SAM [ 8 ] is an alternative to t-tests or rank sum
425
tests. The false detection rates computed by SAM were
426
markedly increased when we did not apply our presence /
427
absence filter. When the filter was used, SAM generated a
428
lower estimate of the number of false differences than
429
estimates based on multiplying P(t) or P(rank sum) by the
430
number of comparisons. Perhaps this observation can be
431
explained by the fact that expression levels of many genes
432
are correlated with expression levels of many others. The
433
number of random differences to be expected among thousands
434
of comparisons with t-tests or rank sum tests is based on
435
the assumption that comparisons are independent of one
436
another. SAM computes the false detection rate with a
437
method that does not rely on this assumption.
438
Some of the arrays included in this study (n = 4) probed
439
RNA pooled from multiple subjects, whereas others (n = 8)
440
probed RNA from individual subjects. The heterogeneous
441
nature of the samples conceivably could influence the
442
statistical properties of this data set. However, there was
443
good uniformity among arrays in terms of scaling factors
444
and percentages of probe sets with P
445
detection < 0.1, and both age groups
446
were comprised of 2 pools and 4 individual samples, which
447
should minimize the influence of using both pooled and
448
individual samples (see Additional file 1). After the data
449
analyses described in this report were completed, we
450
reanalyzed the individual samples along with 8 new
451
individual samples (total of 8 individual samples per age
452
group) using U133A and U133B arrays, which have ~44000
453
probe sets with 11 probe pairs per set. The reduction of
454
within-group variance by the ratio method was replicated
455
(Table 4), demonstrating that it is not unique to U95A
456
arrays, and is not an artefact of including both pooled and
457
individual samples. We cannot guarantee that the same
458
reduction in variance will occur in all cases. If variance
459
caused by biological heterogeneity or by inconsistent
460
laboratory procedures is very high, then the difference
461
between the signal and ratio methods might be trivial in
462
relation to overall variance. The proposed computational
463
method appears to reduce the inflation of variance caused
464
by variable weighting of individual probes within a set,
465
but cannot compensate for variance from other sources.
466
Additional Files
467
Source of RNA, scaling factors, and percentage of probe
468
sets with Pdetection < 0.1 for 12 arrays included in
469
this study
470
Click here for file
471
The data generated in this study have been deposited in
472
the NCBI Gene Expression Omnibus (GEO, accession numbers
473
GSM2390 through GSM2401, Series accession number GSE80)
474
http://www.ncbi.nlm.nih.gov/geo/and the AMDeC Microarray
475
Resource Center Gene Expression Database www.amdec.org.
476
Both the signal data and the ratio data have been
477
deposited. The Affymetrix files (*.exp, *.dat, *.cel,
478
*.chp) can be obtained from the corresponding author.
479
480
481
Conclusions
482
The ratio method reduces inter-array variance and
483
thereby enhances statistical power. SAM is very sensitive
484
to noisy data, which should be removed a priori.
485
486
487
Methods
488
489
Subjects
490
The subjects were 16 young (21-31 yr) and 19 older men
491
(62-77 yr). All had normal neuromuscular function and
492
were healthy according to history, physical examination,
493
and laboratory tests. None of them was involved in any
494
type of regular exercise program involving strenuous
495
activity. All subjects gave written consent after the
496
procedures and risks were explained. The research was
497
approved by the University of Rochester Research Subjects
498
Review Board.
499
500
501
Procedures
502
Subjects were asked to refrain from any activity more
503
strenuous than walking for 3 days before the muscle
504
biopsy. Each subject stayed at the University of
505
Rochester General Clinical Research Center the night
506
before the biopsy to minimize variability between
507
subjects in the amount of activity performed on the day
508
of the biopsy. The needle biopsy was obtained from the
509
vastus lateralis. The skin and muscle were anesthetized
510
with lidocaine a few minutes before tissue removal. The
511
muscle sample was frozen in liquid nitrogen within 30
512
seconds of removal, then stored at -70°C until
513
analysis.
514
515
516
Analysis of gene expression by high density
517
oligonucleotide arrays
518
RNA was extracted from the muscle samples as described
519
previously [ 11 ] . All RNA samples were of high quality
520
as indicated by the pattern of staining with ethidium
521
bromide in an agarose gel after electrophoretic
522
separation. RNA was probed with Affymetrix HG-U95A high
523
density oligonucleotide arrays, which have ~12500 probe
524
sets. Twelve arrays were examined: four (two for each age
525
group) that probed RNA pooled from 4-8 subjects and eight
526
(four for each age group) that probed RNA from a single
527
subject. Pooling of RNA was necessary in some cases
528
because most of the RNA from some samples had been used
529
for other purposes. Additional file 1shows the source of
530
RNA, scaling factors, and percentage of transcripts
531
present (P
532
detection < 0.1) for each
533
array.
534
Analytical procedures were carried out using standard
535
operating procedures developed and validated by the
536
University of Rochester Microarray Core Facility. After
537
hybridization and washing, arrays were stained with
538
phycoerythrin-streptavidin, which binds to the
539
biotinylated cRNAs hybridized with the probes. The
540
initial scan detected the fluorescence of the
541
phycoerythrin-streptavidin. The analyses described in
542
this report are based on data from this initial scan.
543
After the initial scan, signals were amplified by
544
staining the array with biotin-labeled anti-streptavidin
545
antibody followed by phycoerythrin-streptavidin. Analyses
546
of the antibody-enhanced scans are not presented, except
547
for correlations with the initial scans, since the same
548
statistical issues are relevant to both scans. These
549
scans supported the results discussed above. Data from
550
both scans were deposited in the GEO and AMDeC
551
databases.
552
553
554
Software
555
The statistical algorithms of Microarray Suite 5.0
556
were used with the default parameters to generate
557
signals, ratios of signals between two arrays, and P
558
detection values. Data generated by
559
Microarray Suite were exported to Microsoft Excel for
560
further analysis. SAM runs within Excel.
561
562
563
564
Authors' contributions
565
SW was responsible for data analyses and was the
566
principal author. AIB was responsible for generating
567
microarray data, and contributed to writing the manuscript.
568
CAT was responsible for obtaining the muscle samples, and
569
contributed to writing the manuscript. All authors read and
570
approved the final manuscript.
571
572
573
574
575