Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
Genomics sequencing projects have rapidly generated
7
tremendous amount of information. At the time of writing,
8
the NCBI UniGene database [ 1 ]
9
http://www.ncbi.nlm.nih.gov/UniGenecontained 96,109 Homo
10
sapiens clusters and 85,047 Mus musculus clusters.
11
Predictions from the Human Genome Project [ 2 ] and Celera
12
Genomics [ 3 ] suggest there are about 26,000-40,000 human
13
genes. Other recent studies suggest that these numbers may
14
be an underestimation and that the human genome appears
15
more complicated [ 4 ] . Understanding the functions of
16
such a large number of genes has been an unprecedented
17
challenge for functional genomics research. As the array of
18
hope in recent years, gene expression array technology has
19
quickly grown into a powerful tool to chart a gene atlas in
20
various biological sources and under various conditions in
21
a massively parallel manner [ 5 6 7 ] . Facing the
22
challenge of annotating such a huge amount of genomic data,
23
increasing array information density and improving analysis
24
algorithms have become two critical research areas to
25
ensure that gene expression profiling proceeds in an
26
efficient and cost effective manner.
27
Take an Affymetrix high-density oligonucleotide GeneChip
28
http://www.affymetrix.comfor example. Firstly, its human
29
U95 series chip consists of 5 chip types with 12,000 coding
30
clusters each, which makes it expensive to profile all the
31
human genes in samples of interest. Can a gene chip take
32
more genes? Comparing its U95 chip and Human 6800 chip,
33
Affymetrix has already increased chip information density
34
by 20% by reducing the number of probe pairs per gene from
35
20 to 16. Since demand for higher information density has
36
still not been met, it is of interest to study the probe
37
number effect in detail. Secondly, most optional research
38
efforts focus on the downstream statistical and clustering
39
analysis. However, on the upstream side, Affymetrix chip
40
users are still dependent on the Microarray Suite ®software
41
that comes with the measurement system to interpret raw
42
data. The Affymetrix algorithm implemented in its
43
Microarray Suite 4.0 package (referred hereafter as MAS4)
44
uses empirical rules derived from its internal research
45
data to assign absolute calls for the significance of gene
46
presence and assign fold change calls for the significance
47
of expression variations. Such discrete categorizations are
48
not the most appropriate language to describe quantities of
49
continuous nature. Although it is well known that fold
50
change numbers have defined behaviors of uncertainty, there
51
are very few studies in this area. How does one assign
52
statistical significance to expression analysis results?
53
This work presents our preliminary research results for the
54
two questions raised above.
55
The Affymetrix gene chip layout used in this study
56
contains the same number of perfect match (PM) probes and
57
mismatch (MM) probes. MAS4 uses differences between these
58
two types of probes for gene expression signals. The
59
primary goal of Match-Only Integral Distribution (MOID)
60
algorithm is to discard mismatch information, which allows
61
immediate doubling of the chip information density. In this
62
study, the performance of both algorithms were benchmarked
63
using 366 known fold change values derived from 34 spiking
64
experiments. Their false positive tendencies were assessed
65
by no-change expression experiments. Computer simulations
66
were used to study their noise tolerances, and to determine
67
the minimum number of probes required for chip
68
analysis.
69
The idea of using PM-only information is based on the
70
following observations: MAS4 essentially discards the
71
one-one correspondence between a PM and its MM partner (for
72
details, see materials and methods on MAS4 algorithm for
73
absolute analysis) and still gives satisfactory
74
interpretation, which suggests the contribution of MM
75
probes might be approximated in a nonspecific manner
76
overall. After we designed the first mismatch-free gene
77
chip (GNF-HS1) in July 1999, the match-only expression
78
analysis idea was proposed in other independent studies as
79
well. Li (submitted, 2001) adjusted their previous
80
model-based analysis of oligonucleotide arrays [ 8 ] to
81
PM-only calculations and found their results correlate well
82
with that of using PM and MM information. In addition, the
83
idea is endorsed by recent studies of Naef
84
et al [ 9 ] and Irizarry
85
et al (2002, in prepare)
86
http://biosun01.biostat.jhsph.edu/~ririzarr/papers/.
87
One difficulty in comparing algorithms for gene
88
expression analysis is the lack of "known" results. Here we
89
overcome the problem by resorting to a spiking set and set
90
of no-change experiments, where results are unambiguous.
91
Model-based methods [ 8 ] generally require a reasonable
92
numbers of training experiments of the same chip type,
93
among which probes under study give significantly large
94
signals in at least some experiments. Considering the fact
95
that the 34 spiking experiments used were obtained by three
96
different chip types, and in addition some experiments are
97
replicates under different concentrations and hybridization
98
conditions, it is impractical to sufficiently train
99
model-based algorithms in this study. We limit our study to
100
MOID and MAS4, where training is not required.
101
In the materials and methods section, we summarize
102
Affymetrix chip technologies and describe the MOID
103
algorithms in detail. MAS4 algorithms for both absolute
104
expression analysis and comparison analysis are included
105
afterwards. The benchmarks used for algorithm comparison
106
were explained and comparison results were shown in the
107
results session. Issues such as noise-tolerance of both
108
algorithms and further reduction of the probe set size are
109
also included. We discuss generalization of the
110
normalization algorithm, which may be of interest to other
111
researchers. Finally, the main differences between MAS4 and
112
MOID are summarized in a tabular form as conclusions.
113
114
115
Results and Discussion
116
117
Spiking experiments
118
Spiking experiments were done by adding to tissue
119
samples a certain number of control genes with known
120
concentrations. Since signal intensity is not exactly
121
proportional to gene concentration across different probe
122
sets, only the fold change values of each gene between
123
comparisons are considered reliable and used to benchmark
124
both algorithms. We are fortunate to have access to 34
125
spiking experiments done previously in GNF for a
126
hybridization protocol study, where 366 independent known
127
fold change numbers were derived. These experiments
128
covered three Affymetrix chip types: Hu35kSubA, Hu6800,
129
and Mu11kSubA. True experimental fold change values,
130
f
131
exp , range from 1.0 to 10.0. The raw
132
data (34 CEL files) and experimental fold change values
133
are available from
134
http://carrier.gnf.org/publications/MOID.
135
We applied both algorithms to these array data and the
136
fold change numbers calculated were compared to known
137
experimental values. Because of the way we defined the
138
base and experiment,
139
f
140
exp are always no less than 1.0. For
141
MAS4, we used all the default parameters used by the
142
Microarray Suite 4.0 program in the calculation. For
143
MOID, we found using 70 percentile for
144
both pct and
145
pct
146
147
f
148
(defined in materials and methods) gave optimal
149
performance. Figure 1shows the comparison results (data
150
available from http://carrier.gnf.org/publications/MOID).
151
Two algorithms give virtually similar performance.
152
In the comparison, we defined relative error,
153
R
154
err , as the following to benchmark
155
each algorithm,
156
157
R
158
err1 = <|(
159
f
160
calc -
161
f
162
exp> )/
163
f
164
exp |>
165
all pairs ,
166
167
R
168
err2 = <|(1/
169
f
170
calc - 1)/
171
f
172
exp ) ·
173
f
174
exp |>
175
all pairs ,
176
and
177
178
R
179
err = <|(
180
f
181
exp /
182
f
183
calc -
184
f
185
calc /
186
f
187
exp )|>
188
all pairs .
189
Where
190
f
191
exp and
192
f
193
calc are fold change values from
194
experiment and calculation, respectively.
195
R
196
err1 and
197
R
198
err2 are defined symmetrically for
199
relative errors of
200
f
201
exp and 1/
202
f
203
exp , reflecting the fact that base
204
and experiment can be defined arbitrarily, and the final
205
R
206
err is defined as the average of the
207
two. Each relative error number is an average over all
208
the 366 data points.
209
It seems the two algorithms have very close
210
performance Table 1. They all have greater error margins
211
for calculating down-regulated fold changes (captured by
212
R
213
err2 ) than up-regulated ones
214
(captured by
215
R
216
err1 ). MAS4 is more asymmetric than
217
MOID in this aspect.
218
219
220
No-Change experiments
221
Another set of experiments used for benchmarking
222
algorithms is the comparison between two experiments
223
where mRNA prepared from the same tissue sample was
224
hybridized twice under slightly different conditions.
225
Those fold change numbers that deviate largely from 1.0
226
are considered to be false positives. When we applied the
227
algorithms to ten experiments done with either human
228
brain or human lung replicate samples (same mRNA,
229
slightly different hybridization conditions), the results
230
were all similar. Figure 2shows one of the typical
231
comparison results. When only "Present" genes (defined by
232
MAS4) were taken into account, 80% of fold change numbers
233
were spread across 0.54 for MAS4 and only 0.30 for MOID.
234
Clearly MOID assigns much less false positive fold
235
changes than MAS4 does in this comparison. If all the
236
genes were counted, the spreads of MAS4 and MOID results
237
would be 0.89 and 0.29 for the same data sets. This test
238
suggests MOID is more robust than MAS4.
239
240
241
Reduced Probe Set Simulation
242
In the studies above, we demonstrated the feasibility
243
of a match-only gene chip design based on the MOID
244
algorithm. In order to further increase chip information
245
density, it is of common interest to understand how many
246
probes are sufficient for expression analysis and push
247
for the lower limit of the size of a probe set. This is
248
done via computer simulations.
249
In the simulation, for each probe set, a subset of
250
n
251
252
r
253
probes were randomly chosen to be used in the
254
calculation. Both the spiking and no-change calculations
255
presented above were repeated with the selected subset,
256
as we gradually reduced
257
n
258
259
r
260
. Figure 3and figure 4show the results for the
261
spiking calculation and no-change calculation,
262
respectively. As the graph suggested, accuracy of MOID is
263
essentially unaffected while reducing the number of
264
probes down to ten. This result enables us to almost
265
triple the amount of clusters one can put on a gene chip
266
using MOID design. Combined with other new design ideas,
267
MOID lays the foundation for the first universal human
268
chip that contains 75,000 UniGene clusters (release 116).
269
The results have recently been validated and led to many
270
interesting discoveries, which provides indirect support
271
for MOID (to be submitted).
272
273
274
Noise Tolerance of both Algorithms
275
Computer simulations also enable us to study the
276
behaviors of both algorithms under noise disturbance. In
277
simulations, a subset of
278
n
279
280
n
281
probes were randomly chosen, their PM values were
282
multiplied by ten to mimic the effect of serious noise
283
effect. We repeated the calculations for both spiking
284
experiments and no-change experiments, while gradually
285
increasing
286
n
287
288
n
289
. Figure 5and figure 6show the results for spiking
290
calculation and no-change calculation, respectively. As
291
the graph suggested, both MAS4 and MOID can resist noise
292
perturbation up to 2 cells out of 16, the test favors
293
MOID slightly.
294
295
296
Probe-Specific Effect
297
MOID is based on the assumption that the background
298
for probes is mainly non-probe-specific. The results
299
indicate that this assumption is sound overall. However,
300
it is clear that by better understanding probe-specific
301
behaviors, the analysis accuracy can be further improved.
302
For instance, probe response factors might be derived by
303
accumulating and studying many experiments. If there are
304
a sufficient number of experiments, where the target gene
305
is significantly present, probe response factors may be
306
retrieved by some statistical modeling. A recent study [
307
8 ] provides an important example in this direction.
308
Researchers at Corimbia http://www.corimbia.comalso
309
developed some proprietary methods to identify "good"
310
probes and assign different weights to them to improve
311
data analysis. Efforts can be made along similar lines
312
with the MOID algorithm in the future. If probe response
313
factors can be calculated, the "bad" but "stable" probes
314
can be scaled, and the distribution may be expected to be
315
closer to normal, therefore expression levels and their
316
uncertainties can be calculated in a more statistically
317
sound manner.
318
Future studies for more accurate background
319
subtraction models may also improve the fold change
320
distribution; therefore yield better statistics for fold
321
change evaluation as well. The relative error in spiking
322
experiments (28%) suggests that there is room for future
323
improvement.
324
325
326
LogP and Absolute Calls
327
In the derivation of LogP (see materials and methods,
328
Figure 7), it is assumed that the statistics of any
329
absent probe can be described by the general background
330
curve
331
B. The
332
P value should not be literally
333
interpreted as accurate probabilities, since it is only
334
as good as the underlining assumption. Users should also
335
bear in mind that a
336
P value is for the null hypothesis
337
that the discrepancy between observables and generic
338
mismatches is generated by noise, which is different from
339
the likelihood of gene presence conditioned on the
340
observed discrepancy. However, LogP values more or less
341
serve as a statistical indicator for the sorting of genes
342
by their significance. The discrete absolute calls in
343
MAS4 are determined in a completely different empirically
344
manner. Although higher LogP values have a higher
345
tendency of being assigned "Absent" in MAS4, the exact
346
correspondence between the two varies from experiment to
347
experiment. Roughly, LogP values above -3.0 have a
348
greater chance to be called "Absent" than "Present" by
349
MAS4. Users should be aware not to over interpret this
350
correspondence. One interesting observation we had in
351
this study was that discrete calls by MAS4 are not as
352
stable as LogP values. Sometimes a gene can be reassigned
353
to "Present" from "Absent" by MAS4 due to a small
354
perturbation in the underlying quantities without going
355
through a "Marginal" transitional stage.
356
357
358
Curve Normalization
359
It is clear by observing various experiments that the
360
response of intensity to signal varies in different
361
intensity regions. Occasionally, the normalization
362
constant
363
n
364
365
f
366
does not cover well for all intensity values. A more
367
generalized normalization procedure should use a
368
normalization curve instead of a constant factor. The
369
MOID normalization algorithm can be easily modified to
370
the following (see [ 10 ] for a similar approach):
371
Step 1: include all genes in the normalization gene
372
list.
373
Step 2: using the
374
E
375
376
k
377
values for all the genes in the list, generate
378
integral intensity distributions for both experiments. A
379
normalization curve
380
NF(I) is constructed in such a way
381
that the two integral distributions are identical after
382
normalization.
383
Step 3: normalize the data sets using
384
NF obtained in the previous step;
385
refine the gene list by further excluding those genes
386
whose intensity values changed by more than a certain
387
fold, >
388
f
389
max or < 1/
390
f
391
max , between the two data sets.
392
Step 4: repeat Step 2 and update
393
NF to
394
NF', until one of the following
395
conditions is met:
396
1) the maximum number of iterations,
397
Itr
398
max , is reached;
399
2) max(|
400
NF (
401
I )/
402
NF '(
403
I ) - 1|) ≤
404
max , where
405
max is a small predefined
406
threshold;
407
3) size of the normalization gene list drops below a
408
predefined threshold,
409
Sz
410
min .
411
This algorithm offers a chance to correct
412
non-linearity in the chip system to a certain extent. To
413
demonstrate the algorithm, we applied the procedure to
414
two measurements, where genomics DNA sample of yeast
415
s288c strain were hybridized onto Affymetrix YG_S98
416
arrays. The second experiment was a repeat of the first
417
one after about 50 days, where some of hybridization and
418
scanning parameters had been changed over the course. It
419
is shown in Figure 8that the curve normalization
420
procedure out-performed constant normalization as
421
expected.
422
423
424
425
Conclusions
426
MOID algorithm allows at least double or even triple the
427
information density of current (U95 human chip) Affymetrix
428
high-density oligo nucleotide arrays without compromising
429
analysis accuracy. Table 2summarizes feature comparisons
430
between MOID and MAS4.
431
It should be noted that at the time of this study,
432
Affymetrix U95 chip uses 16 probe pairs per set. As
433
Affymetirx is planning on further reducing probe set size
434
in their next design, the density improvement by using MOID
435
may be less than the estimation given above.
436
437
438
Materials and Methods
439
440
Affymetrix Oligonucleotide Chip
441
At the time of this study, Affymetrix synthesizes
442
25-mer oligonucleotides on a 640 × 640 array with 20 μm
443
feature size using photolithographic fabrication
444
techniques (some data used in this study were collected
445
from some previous generation arrays of 540 × 540 cells
446
and 24 μm feature size). Each cell (also called a probe)
447
in the array contains the same oligonucleotide sequences.
448
As shown in Figure 9, typically a set of 16-20 probes are
449
designed for each targeting cluster (called a probe set,
450
or sometimes a "gene".). An expression value will be
451
derived per probe set as the result of analysis. Probes
452
taken as fragments from a target sequence are called
453
perfect matches (PM). Multiple matches per set serve as
454
independent signal detectors and provide a possibility to
455
capture statistical uncertainties. For each match, there
456
is also a corresponding mismatch (MM) probe, whose
457
sequence differs from its match by a single base in the
458
middle. Mismatches are meant to detect cross
459
hybridization components of their corresponding matches,
460
and are used by MAS4 for probe-specific background
461
subtraction. There are also several Affymetrix control
462
sets on each chip used for quality references, which were
463
used in the spiking experiments to validate and refine
464
hybridization protocols.
465
466
467
Intensity Distribution
468
In an ideal case, all the probes in a set should give
469
similar signals and serve as replicate measurements. One
470
common rule in probe design is to select probes with
471
similar melting temperatures to minimize the variances
472
among probes. The chemistry involved in a hybridization
473
experiment is often too complicated to be predicted
474
computationally at the current stage, therefore in
475
reality probe intensity distribution for a set is usually
476
fairly wide and has a long tail, which causes all kinds
477
of difficulty for data analysis.
478
One usual attraction in expression analysis is to
479
assume a normal distribution for probe intensities. This
480
hypothesis can be examined by the following calculation.
481
For each probe set, we applied a linear transformation to
482
probe intensities, so that the mean and the standard
483
deviation of the intensities in the set were normalized
484
to zero and one, respectively. Then all the resulting
485
match intensities were used to generate an overall
486
distribution. Our calculation shows the normal
487
distribution assumption is not an appropriate one. When
488
similar analysis was applied to mismatch intensities and
489
mismatch-subtracted match intensities, the conclusion
490
stays more or less the same.
491
Despite the fact that the intensity distribution of a
492
set is non-Gaussian, it is still crucial to find out the
493
distribution function for match or mismatch intensities
494
to generate meaningful statistics tests, if such
495
distributions actually exist. The authors tried several
496
analytical functions on match and mismatch probe
497
intensities. Unfortunately, the distributions seemed to
498
be quite "bad", the tail portions even fade out slower
499
than the extreme value distribution. It was found that
500
distributions of mismatch signals also significantly
501
depended on the sample and experimental conditions. One
502
may reasonably suspect that cross hybridization,
503
intensity saturation, overall sample concentration, chip
504
production irregularities, and miscellaneous noise may
505
all cause a change in the signal distribution. The study
506
seems to suggest the distribution of cross hybridization
507
should be taken from each experiment as a result of
508
measurement instead of using an
509
a priori determined analytical
510
expression. This guideline is used in MOID algorithm.
511
512
513
MOID algorithm: hypothesis and principles
514
As mentioned in the intensity distribution analysis,
515
MOID assumes cross hybridization behavior of probes is
516
mainly non-probe-specific, and therefore all the match
517
cells can share the same cross hybridization background.
518
MOID primarily aims at saving 50% of the chip space
519
currently dedicated to mismatch cells in the Affymetrix
520
design under this study. It was also discussed before
521
that cross hybridization distributions should be taken
522
from experimental data instead of from some analytical
523
function. Based upon the general belief that there is
524
always a significant amount of genes unexpressed in any
525
mRNA experiment, we will use the match cell signals from
526
the 5% darkest probe sets in the current public
527
Affymetrix arrays to prove the point that it is possible
528
for MOID not to use any mismatch signal. For the next
529
generation gene chip designed based on MOID algorithm,
530
some designated probes may be used as generic mismatches
531
to collect non-probe-specific cross hybridization data.
532
In fact, on GNF-HS1, 16,460 probes from 476 viral genes
533
are used for such a purpose.
534
As discussed, the distributions of quantities in
535
expression analysis are usually asymmetric and contain a
536
long tail, regardless whether it is match intensity,
537
mismatch intensity, or average intensity difference. For
538
such abnormal distributions, concepts like mathematical
539
average and standard deviation are not the most
540
appropriate statistical terms to use. Pre-filtering the
541
data, like what MAS4 does, certainly helps reshape the
542
distribution, but by no means this can bring the
543
distribution back to normal without throwing out a
544
significant portion of measurements. The MOID algorithm
545
is specifically designed to avoid these problems by using
546
percentile instead of mean and using confidence intervals
547
instead of standard deviation in all possible cases.
548
549
550
MOID algorithm for significance of gene
551
presence
552
A non-probe-specific integral background distribution
553
(representing noise, cross hybridization, and possible
554
other factors),
555
B (
556
I ), is derived from the
557
intensities of the 5% darkest probe sets (or from generic
558
mismatches in the future).
559
I stands for intensity;
560
B satisfies boundary conditions:
561
B (0) = 0 and
562
B (∞) = 1. For each probe set
563
k, the match signals are sorted and
564
the integral distribution,
565
S
566
567
k
568
(
569
I ), is generated as well. Figure
570
7is a schematic diagram.
571
If the gene represented by a probe set is absent, the
572
intensity data from matches are due to pure background
573
contribution, therefore
574
S
575
576
k
577
is likely to be close to
578
B. We choose the Kolmogorov-Smirnov
579
test to determine likelihood that the observed signal
580
distribution,
581
S
582
583
k
584
, could be explained by
585
B. If we define
586
d
587
588
k max as the maximum vertical
589
distance between
590
B and
591
S
592
593
k
594
, according to K-S statistics, the probability of
595
observing discrepancies greater than
596
d
597
598
k max is determined by a P
599
value [ 11 ] ,
600
601
where
602
603
604
n
605
606
k
607
is the effective number of data points in the K-S
608
statistics, which is the number of PM for gene
609
k in our case.
610
P carries the meaning of
611
probability, therefore is a number between 0 and 1.
612
Practically, Log
613
10
614
P , a negative value, is used as
615
our final representation. In this way, MOID uses a
616
continuous LogP criterion to replace MAS4 absolute calls.
617
Those signal sets that can be easily explained by noise
618
are assigned a LogP value closer to zero.
619
620
621
MOID algorithm for expression level
622
calculation
623
MOID uses the horizontal distance between
624
S
625
626
k
627
and
628
B to represent gene expression
629
level (Figure 7). Since the darkest probes may be more
630
likely caused by their poor binding properties to the
631
target gene, and the brightest probes may be more likely
632
caused by serious cross hybridization issues, different
633
parts of the integral distribution tend to have different
634
qualities. Therefore, the MOID algorithm uses the
635
horizontal distance measured at a certain percentile,
636
pct, instead of the whole curve.
637
Based on the analysis of the spiking experiments
638
discussed later, it is empirically determined that 70% is
639
the optimal percentile for signal retrieval. We also
640
tried to use the average of horizontal distances from
641
several
642
pct, but without significant
643
improvement. Therefore the expression level
644
E
645
646
k
647
, related to gene concentration for gene
648
k, is defined as
649
650
E
651
652
k
653
= max(
654
I |
655
656
S
657
658
k
659
=
660
pct -
661
I |
662
663
B =
664
pct , 0).
665
Instead of using the standard deviation, we take
666
confidence intervals directly from the distribution curve
667
668
S
669
670
k
671
, with the background subtracted. E.g., 80%
672
confidence intervals can be represented by a lower bound
673
(
674
E
675
676
kl
677
) at 10 percentile and an upper bound (
678
E
679
680
ku
681
) at 90 percentile of the distribution, i.e.,
682
683
E
684
685
kl
686
0(0.1) = max (
687
I |
688
689
S
690
691
k
692
= 0.1 -
693
I |
694
695
B =
696
pct , 0)
697
and
698
699
E
700
701
ku
702
0(0.9) = max (
703
I |
704
705
S
706
707
k
708
= 0.9 -
709
I |
710
711
B =
712
pct , 0).
713
As one might expect, increment of the probe set size
714
helps narrowing down the confidence intervals in a manner
715
determined by the statistics of probe distribution.
716
However, this piece of information is unknown and we took
717
an
718
ad hoc approach by modifying the
719
boundaries formulae as the following:
720
721
and
722
723
724
725
MOID algorithm for normalization
726
The most common way of understanding gene functions by
727
expression profiling is to study the change of their
728
expression levels in various disease states and cellular
729
environments. The observed expression level is a product
730
of complicated protocols, and is subjected to various
731
factors from sample preparation to final image scanning.
732
Therefore, it is a common practice to normalize
733
expression data drawn from multiple profiles before
734
making any reliable interpretation.
735
The normalization procedure in MAS4 aims at
736
normalizing the average intensities of probe sets
737
(excluding the top and bottom 2%). To avoid possible
738
signal contamination, MOID takes similar approach by
739
normalizing the probe sets between 10 and 90 percentiles.
740
However, it is noticed that an ideal normalization factor
741
should be derived from only those genes that do not
742
change their expression levels between the two
743
experiments. MAS software provides the possibility to use
744
a list of housekeeping genes for such purposes; however,
745
it certainly requires careful downstream research to
746
validate any such a list. Some common normalization
747
criteria were summarized in a recent study by Zien et al.
748
[ 12 ] . MOID uses a heuristic bootstrap method to
749
identify an approximate list of unchanged genes between
750
two data sets:
751
Step 1: include all genes as the normalization gene
752
list.
753
Step 2: sort the
754
E
755
756
k
757
values (already background subtracted) for all the
758
genes in the list; the interesting portion of expression
759
values (between 10% and 90% in our case) is retrieved and
760
average intensity is calculated for each experiment,
761
respectively. The initial normalization factor
762
nf is calculated by the ratio of
763
the two average intensities.
764
Step 3: normalize the data sets using
765
nf obtained in the previous step;
766
refine the gene list by further excluding those genes,
767
whose intensity values changed by more than a certain
768
fold, >
769
f
770
max or < 1/
771
f
772
max , between the two data sets.
773
Step 4: repeat Step 2 and update
774
nf to
775
nf', until one of the following
776
conditions is met:
777
1) the maximum number of iterations,
778
Itr
779
max , is reached;
780
2) |
781
nf/nf' - 1| ≤
782
max , where
783
max is a small predefined
784
threshold;
785
3) size of the normalization gene list drops below a
786
predefined threshold,
787
Sz
788
min .
789
In practice, a single iteration is usually sufficient
790
for most comparisons. User can adjust threshold
791
parameters towards their preferred stringent levels. In
792
this study we use
793
f
794
max = 2.0,
795
max = 0.05,
796
Itr
797
max = 5,
798
Sz
799
min = 1000.
800
801
802
MOID algorithm for comparison analysis
803
Different probes within the same probe set generally
804
respond very differently to mRNA sample fragments. We
805
examined various probe properties such as melting
806
temperature, nucleotide base composition, possible
807
sequence motifs, and potential secondary structures in
808
order to understand the cause of such diverse response
809
properties. That study did not find any conclusive factor
810
that explains observed probe signal distribution
811
satisfactory. Although it seems rather difficult to
812
pre-compute and predict probe responses, it is possible
813
to derive some features afterwards for a probe based on a
814
significant amount of expression data where the target
815
gene of interest is present [ 8 ] . Instead of using a
816
modeling approach to explain various signal
817
contributions, MOID uses a new approach to avoid the
818
affect of diversities of probe response factors in
819
calculating expression fold changes.
820
Let us assume any two probes in a probe set
821
k of size
822
n
823
824
k
825
always respond to their target gene concentration
826
with factor
827
r
828
1 and
829
r
830
2 , respectively. That is, if the gene
831
is present at concentration
832
c
833
1 and
834
c
835
2 in two hybridizations, the two
836
probes in average give signal intensities
837
r
838
1
839
c
840
1 and
841
r
842
2
843
c
844
1 in the first experiment,
845
r
846
1
847
c
848
2 and
849
r
850
2
851
c
852
2 in the second experiment. As in
853
MAS4, fold change is calculated by the ratio between
854
r
855
1
856
c
857
2 +
858
r
859
2
860
c
861
2 and
862
r
863
1
864
c
865
1 +
866
r
867
2
868
c
869
1 , where the result is essentially
870
dominated by the probes with the largest response
871
factors, and the statistics of the ratio becomes
872
difficult to estimate. However, we observe that by taking
873
the ratio of the two signals for each probe individually,
874
one essentially is looking at
875
n
876
877
k
878
independent measurements of
879
c
880
2 /
881
c
882
1 for that probe set. This opens a
883
possibility for obtaining a distribution of fold change
884
values.
885
If one assumes most probes have a rather stable
886
intrinsic response factor,
887
r, a fold change number can be
888
calculated from each probe
889
i in the set independently, which
890
hopefully removes the affect of the unknown response
891
factors
892
r
893
894
i
895
. However, it seems the response factor of each
896
probe has a non-negligible intrinsic spread as well; this
897
may be further complicated by alternative splicing and
898
tissue-specific cross hybridization issues. In addition,
899
the
900
n
901
902
k
903
ratio numbers calculated for a probe set are not
904
normally distributed (it is well known that even the
905
ratio of two normally distributed values is no longer
906
normally distributed). The current background subtraction
907
algorithm may not take fully into account response
908
factor-unrelated signals, therefore could further
909
increase the non-normal component. To overcome this
910
difficulty, MOID uses a percentile,
911
pct
912
913
f
914
, of the integral distribution formed by the
915
n
916
917
k
918
fold change numbers,
919
F
920
921
k
922
(
923
f ), which satisfied the boundary
924
conditions:
925
F (0) = 0 and
926
F (∞) = 1. Empirically, 70% is used
927
for
928
pct
929
930
f
931
as determined by spiking experiments discussed
932
later. The confidence intervals are also directly taken
933
from relevant percentiles in the distribution
934
corresponding to the lower bound and upper bound,
935
respectively. The effect of the number
936
n
937
938
k
939
is taken into account in the same heuristic manner
940
as we did previously for absolute analysis. The final
941
formulae are:
942
943
944
945
MAS4 Algorithm for Absolute Analysis
946
After a sample is hybridized to probes on a chip, the
947
chip is scanned and fluorescent signals are collected and
948
stored in an Affymetrix DAT file. For cell
949
i, after filtering out boundary
950
pixels,
951
N
952
953
i
954
numbers of pixels are used to calculate an average
955
intensity value
956
I
957
958
i
959
and standard deviation σ
960
961
i
962
. Hereafter we index a probe set by
963
k, which contains
964
n
965
966
k
967
probe pairs. The calculated intensity values for the
968
969
j th pair in the
970
k th probe set are denoted as
971
PM
972
973
kj
974
for PM and
975
MM
976
977
kj
978
for MM. Background intensity,
979
bg, is derived from the average
980
intensity of the 2% darkest cells. Noise level,
981
Q, is determined as
982
983
where average is done over all background cells.
984
MAS4 assumes that the cross hybridization component of
985
a match is captured by its mismatch companion, therefore
986
the difference,
987
PM
988
989
k
990
-
991
MM
992
993
k
994
, is used to derive expression levels. MAS4 also
995
uses a "super-Olympic scoring" procedure to iteratively
996
filter out outliers, defined as those probe pairs with
997
difference values more than three times the standard
998
deviation away from the mean. After filtering, the
999
average difference value for a probe set,
1000
D
1001
1002
k
1003
, is used to represent the expression level of that
1004
particular gene.
1005
The significance of gene presence is determined by an
1006
empirical absolute call decision matrix coupled with
1007
proprietary MAS4 algorithms, parts of which are described
1008
in the Microarray Suite documentation. As the result, a
1009
probe set is marked as either "Present", "Absent", or
1010
"Marginal".
1011
MAS4 calculates average difference by
1012
1013
D
1014
1015
k
1016
= <
1017
PM
1018
1019
kj
1020
-
1021
MM
1022
1023
kj
1024
>
1025
1026
j
1027
,
1028
which is essentially
1029
1030
D
1031
1032
k
1033
= <
1034
PM
1035
1036
kj
1037
>
1038
1039
j
1040
- <
1041
MM
1042
1043
kj
1044
>
1045
1046
j
1047
,
1048
where the average is taken over all the probe pairs
1049
surviving the super-Olympic filtering process mentioned
1050
above. Based on this observation, the one-one
1051
correspondences between
1052
PM
1053
1054
kj
1055
and
1056
MM
1057
1058
kj
1059
, typically with an average correlation coefficient
1060
around 0.8, are essentially lost during such averaging.
1061
Since MAS4 is successful in various applications despite
1062
the fact it averages out probe-specific mismatch signals,
1063
we hypothesize that contributions of the mismatch probes
1064
are mainly probe-nonspecific, if the pairwise
1065
correspondence between
1066
PM
1067
1068
kj
1069
and
1070
MM
1071
1072
kj
1073
are not enforced. This observation led to the idea
1074
of only using match probes in the MOID algorithm
1075
discussed above.
1076
1077
1078
MAS4 Algorithm for Normalization
1079
MAS4 introduces a normalization factor (a scaling
1080
factor in MAS4 serves the same purpose). The goal of
1081
normalization is to reduce false positives in the
1082
expression fold change calculation and ensure
1083
housekeeping genes are marked as unchanged. Since it is
1084
usually unknown
1085
a priori what genes sustain their
1086
expression level during the two experiments, different
1087
heuristic normalization schemes may be adopted and it is
1088
unclear at this point which particular approach is
1089
optimal. MAS4 normally derives the normalization factor
1090
by scaling average
1091
D
1092
1093
k
1094
for all genes with expression levels within the
1095
central 96% to a certain target intensity constant. This
1096
algorithm is based on the assumption that the total copy
1097
of mRNA per cell is a conserved number among
1098
experiments.
1099
1100
1101
MAS4 Algorithm for Comparison Analysis
1102
In comparison of two chips, we follow Affymetrix
1103
convention of calling the reference chip measurement as
1104
the base (
1105
b ), the other one as the
1106
experiment (
1107
e ). After the above normalization
1108
procedure, the fold change value for gene
1109
k, f
1110
1111
k
1112
, is essentially calculated by dividing
1113
D
1114
1115
k
1116
(
1117
e ) with
1118
D
1119
1120
k
1121
(
1122
b ). MAS4 uses a filtering process
1123
to ensure that a common subset of "good" probe pairs
1124
between
1125
b and
1126
e are used to recalculate
1127
D
1128
1129
k
1130
. Very weak expression signals (sometimes negative)
1131
are rounded to the noise level mentioned before. The
1132
final formula for
1133
f is:
1134
1135
f
1136
1137
k
1138
= [
1139
D
1140
1141
k
1142
(
1143
e ) + max (
1144
Q -
1145
D
1146
1147
k
1148
(
1149
b ), 0)] / max (
1150
D
1151
1152
k
1153
(
1154
b ),
1155
Q ), if
1156
D
1157
1158
k
1159
(
1160
e ) ≥
1161
D
1162
1163
k
1164
(
1165
b ),
1166
or
1167
1168
f
1169
1170
k
1171
= max (
1172
D
1173
1174
k
1175
(
1176
e ),
1177
Q ) / [
1178
D
1179
1180
k
1181
(
1182
b ) + max (
1183
Q -
1184
D
1185
1186
k
1187
(
1188
e ), 0)], otherwise.
1189
An empirical difference call decision matrix and a
1190
proprietary MAS4 algorithm, partly described in the
1191
Microarray Suite 4.0 documentation, determine the
1192
significance of the fold change. As a result, the probe
1193
set is marked as "Increase", "No Change", "Decrease",
1194
"Marginally Increase", or "Marginally Decrease". It is
1195
obvious that the fold change number is only meaningful if
1196
the probe set is clearly present at a level above the
1197
noise in both data sets.
1198
1199
1200
1201
List of Abbreviations
1202
MOID - match-only integral distribution
1203
MAS4 - Affymetrix algorithms implemented in its
1204
Microarray Suit 4.0 software
1205
PM - perfect match
1206
DMM - mismatch
1207
1208
1209
1210
1211