CoCalc -- 1471-2105-3-3.txt

OANC_GrAF / data / written_2 / technical / biomed / 1471-2105-3-3.txt
³⁹⁶⁷³ views
1

2
  
3
    
4
      
5
        Background
6
        Genomics sequencing projects have rapidly generated
7
        tremendous amount of information. At the time of writing,
8
        the NCBI UniGene database [ 1 ]
9
        http://www.ncbi.nlm.nih.gov/UniGenecontained 96,109 Homo
10
        sapiens clusters and 85,047 Mus musculus clusters.
11
        Predictions from the Human Genome Project [ 2 ] and Celera
12
        Genomics [ 3 ] suggest there are about 26,000-40,000 human
13
        genes. Other recent studies suggest that these numbers may
14
        be an underestimation and that the human genome appears
15
        more complicated [ 4 ] . Understanding the functions of
16
        such a large number of genes has been an unprecedented
17
        challenge for functional genomics research. As the array of
18
        hope in recent years, gene expression array technology has
19
        quickly grown into a powerful tool to chart a gene atlas in
20
        various biological sources and under various conditions in
21
        a massively parallel manner [ 5 6 7 ] . Facing the
22
        challenge of annotating such a huge amount of genomic data,
23
        increasing array information density and improving analysis
24
        algorithms have become two critical research areas to
25
        ensure that gene expression profiling proceeds in an
26
        efficient and cost effective manner.
27
        Take an Affymetrix high-density oligonucleotide GeneChip
28
        http://www.affymetrix.comfor example. Firstly, its human
29
        U95 series chip consists of 5 chip types with 12,000 coding
30
        clusters each, which makes it expensive to profile all the
31
        human genes in samples of interest. Can a gene chip take
32
        more genes? Comparing its U95 chip and Human 6800 chip,
33
        Affymetrix has already increased chip information density
34
        by 20% by reducing the number of probe pairs per gene from
35
        20 to 16. Since demand for higher information density has
36
        still not been met, it is of interest to study the probe
37
        number effect in detail. Secondly, most optional research
38
        efforts focus on the downstream statistical and clustering
39
        analysis. However, on the upstream side, Affymetrix chip
40
        users are still dependent on the Microarray Suite ®software
41
        that comes with the measurement system to interpret raw
42
        data. The Affymetrix algorithm implemented in its
43
        Microarray Suite 4.0 package (referred hereafter as MAS4)
44
        uses empirical rules derived from its internal research
45
        data to assign absolute calls for the significance of gene
46
        presence and assign fold change calls for the significance
47
        of expression variations. Such discrete categorizations are
48
        not the most appropriate language to describe quantities of
49
        continuous nature. Although it is well known that fold
50
        change numbers have defined behaviors of uncertainty, there
51
        are very few studies in this area. How does one assign
52
        statistical significance to expression analysis results?
53
        This work presents our preliminary research results for the
54
        two questions raised above.
55
        The Affymetrix gene chip layout used in this study
56
        contains the same number of perfect match (PM) probes and
57
        mismatch (MM) probes. MAS4 uses differences between these
58
        two types of probes for gene expression signals. The
59
        primary goal of Match-Only Integral Distribution (MOID)
60
        algorithm is to discard mismatch information, which allows
61
        immediate doubling of the chip information density. In this
62
        study, the performance of both algorithms were benchmarked
63
        using 366 known fold change values derived from 34 spiking
64
        experiments. Their false positive tendencies were assessed
65
        by no-change expression experiments. Computer simulations
66
        were used to study their noise tolerances, and to determine
67
        the minimum number of probes required for chip
68
        analysis.
69
        The idea of using PM-only information is based on the
70
        following observations: MAS4 essentially discards the
71
        one-one correspondence between a PM and its MM partner (for
72
        details, see materials and methods on MAS4 algorithm for
73
        absolute analysis) and still gives satisfactory
74
        interpretation, which suggests the contribution of MM
75
        probes might be approximated in a nonspecific manner
76
        overall. After we designed the first mismatch-free gene
77
        chip (GNF-HS1) in July 1999, the match-only expression
78
        analysis idea was proposed in other independent studies as
79
        well. Li (submitted, 2001) adjusted their previous
80
        model-based analysis of oligonucleotide arrays [ 8 ] to
81
        PM-only calculations and found their results correlate well
82
        with that of using PM and MM information. In addition, the
83
        idea is endorsed by recent studies of Naef 
84
        et al [ 9 ] and Irizarry 
85
        et al (2002, in prepare)
86
        http://biosun01.biostat.jhsph.edu/~ririzarr/papers/.
87
        One difficulty in comparing algorithms for gene
88
        expression analysis is the lack of "known" results. Here we
89
        overcome the problem by resorting to a spiking set and set
90
        of no-change experiments, where results are unambiguous.
91
        Model-based methods [ 8 ] generally require a reasonable
92
        numbers of training experiments of the same chip type,
93
        among which probes under study give significantly large
94
        signals in at least some experiments. Considering the fact
95
        that the 34 spiking experiments used were obtained by three
96
        different chip types, and in addition some experiments are
97
        replicates under different concentrations and hybridization
98
        conditions, it is impractical to sufficiently train
99
        model-based algorithms in this study. We limit our study to
100
        MOID and MAS4, where training is not required.
101
        In the materials and methods section, we summarize
102
        Affymetrix chip technologies and describe the MOID
103
        algorithms in detail. MAS4 algorithms for both absolute
104
        expression analysis and comparison analysis are included
105
        afterwards. The benchmarks used for algorithm comparison
106
        were explained and comparison results were shown in the
107
        results session. Issues such as noise-tolerance of both
108
        algorithms and further reduction of the probe set size are
109
        also included. We discuss generalization of the
110
        normalization algorithm, which may be of interest to other
111
        researchers. Finally, the main differences between MAS4 and
112
        MOID are summarized in a tabular form as conclusions.
113
      
114
      
115
        Results and Discussion
116
        
117
          Spiking experiments
118
          Spiking experiments were done by adding to tissue
119
          samples a certain number of control genes with known
120
          concentrations. Since signal intensity is not exactly
121
          proportional to gene concentration across different probe
122
          sets, only the fold change values of each gene between
123
          comparisons are considered reliable and used to benchmark
124
          both algorithms. We are fortunate to have access to 34
125
          spiking experiments done previously in GNF for a
126
          hybridization protocol study, where 366 independent known
127
          fold change numbers were derived. These experiments
128
          covered three Affymetrix chip types: Hu35kSubA, Hu6800,
129
          and Mu11kSubA. True experimental fold change values, 
130
          f 
131
          exp , range from 1.0 to 10.0. The raw
132
          data (34 CEL files) and experimental fold change values
133
          are available from
134
          http://carrier.gnf.org/publications/MOID.
135
          We applied both algorithms to these array data and the
136
          fold change numbers calculated were compared to known
137
          experimental values. Because of the way we defined the
138
          base and experiment, 
139
          f 
140
          exp are always no less than 1.0. For
141
          MAS4, we used all the default parameters used by the
142
          Microarray Suite 4.0 program in the calculation. For
143
          MOID, we found using 70 percentile for 
144
          both pct and 
145
          pct 
146
          
147
            f 
148
           (defined in materials and methods) gave optimal
149
          performance. Figure 1shows the comparison results (data
150
          available from http://carrier.gnf.org/publications/MOID).
151
          Two algorithms give virtually similar performance.
152
          In the comparison, we defined relative error, 
153
          R 
154
          err , as the following to benchmark
155
          each algorithm,
156
          
157
          R 
158
          err1 = <|( 
159
          f 
160
          calc - 
161
          f 
162
          exp> )/ 
163
          f 
164
          exp |> 
165
          all pairs ,
166
          
167
          R 
168
          err2 = <|(1/ 
169
          f 
170
          calc - 1)/ 
171
          f 
172
          exp ) · 
173
          f 
174
          exp |> 
175
          all pairs ,
176
          and
177
          
178
          R 
179
          err = <|( 
180
          f 
181
          exp / 
182
          f 
183
          calc - 
184
          f 
185
          calc / 
186
          f 
187
          exp )|> 
188
          all pairs .
189
          Where 
190
          f 
191
          exp and 
192
          f 
193
          calc are fold change values from
194
          experiment and calculation, respectively. 
195
          R 
196
          err1 and 
197
          R 
198
          err2 are defined symmetrically for
199
          relative errors of 
200
          f 
201
          exp and 1/ 
202
          f 
203
          exp , reflecting the fact that base
204
          and experiment can be defined arbitrarily, and the final 
205
          R 
206
          err is defined as the average of the
207
          two. Each relative error number is an average over all
208
          the 366 data points.
209
          It seems the two algorithms have very close
210
          performance Table 1. They all have greater error margins
211
          for calculating down-regulated fold changes (captured by 
212
          R 
213
          err2 ) than up-regulated ones
214
          (captured by 
215
          R 
216
          err1 ). MAS4 is more asymmetric than
217
          MOID in this aspect.
218
        
219
        
220
          No-Change experiments
221
          Another set of experiments used for benchmarking
222
          algorithms is the comparison between two experiments
223
          where mRNA prepared from the same tissue sample was
224
          hybridized twice under slightly different conditions.
225
          Those fold change numbers that deviate largely from 1.0
226
          are considered to be false positives. When we applied the
227
          algorithms to ten experiments done with either human
228
          brain or human lung replicate samples (same mRNA,
229
          slightly different hybridization conditions), the results
230
          were all similar. Figure 2shows one of the typical
231
          comparison results. When only "Present" genes (defined by
232
          MAS4) were taken into account, 80% of fold change numbers
233
          were spread across 0.54 for MAS4 and only 0.30 for MOID.
234
          Clearly MOID assigns much less false positive fold
235
          changes than MAS4 does in this comparison. If all the
236
          genes were counted, the spreads of MAS4 and MOID results
237
          would be 0.89 and 0.29 for the same data sets. This test
238
          suggests MOID is more robust than MAS4.
239
        
240
        
241
          Reduced Probe Set Simulation
242
          In the studies above, we demonstrated the feasibility
243
          of a match-only gene chip design based on the MOID
244
          algorithm. In order to further increase chip information
245
          density, it is of common interest to understand how many
246
          probes are sufficient for expression analysis and push
247
          for the lower limit of the size of a probe set. This is
248
          done via computer simulations.
249
          In the simulation, for each probe set, a subset of 
250
          n 
251
          
252
            r 
253
           probes were randomly chosen to be used in the
254
          calculation. Both the spiking and no-change calculations
255
          presented above were repeated with the selected subset,
256
          as we gradually reduced 
257
          n 
258
          
259
            r 
260
           . Figure 3and figure 4show the results for the
261
          spiking calculation and no-change calculation,
262
          respectively. As the graph suggested, accuracy of MOID is
263
          essentially unaffected while reducing the number of
264
          probes down to ten. This result enables us to almost
265
          triple the amount of clusters one can put on a gene chip
266
          using MOID design. Combined with other new design ideas,
267
          MOID lays the foundation for the first universal human
268
          chip that contains 75,000 UniGene clusters (release 116).
269
          The results have recently been validated and led to many
270
          interesting discoveries, which provides indirect support
271
          for MOID (to be submitted).
272
        
273
        
274
          Noise Tolerance of both Algorithms
275
          Computer simulations also enable us to study the
276
          behaviors of both algorithms under noise disturbance. In
277
          simulations, a subset of 
278
          n 
279
          
280
            n 
281
           probes were randomly chosen, their PM values were
282
          multiplied by ten to mimic the effect of serious noise
283
          effect. We repeated the calculations for both spiking
284
          experiments and no-change experiments, while gradually
285
          increasing 
286
          n 
287
          
288
            n 
289
           . Figure 5and figure 6show the results for spiking
290
          calculation and no-change calculation, respectively. As
291
          the graph suggested, both MAS4 and MOID can resist noise
292
          perturbation up to 2 cells out of 16, the test favors
293
          MOID slightly.
294
        
295
        
296
          Probe-Specific Effect
297
          MOID is based on the assumption that the background
298
          for probes is mainly non-probe-specific. The results
299
          indicate that this assumption is sound overall. However,
300
          it is clear that by better understanding probe-specific
301
          behaviors, the analysis accuracy can be further improved.
302
          For instance, probe response factors might be derived by
303
          accumulating and studying many experiments. If there are
304
          a sufficient number of experiments, where the target gene
305
          is significantly present, probe response factors may be
306
          retrieved by some statistical modeling. A recent study [
307
          8 ] provides an important example in this direction.
308
          Researchers at Corimbia http://www.corimbia.comalso
309
          developed some proprietary methods to identify "good"
310
          probes and assign different weights to them to improve
311
          data analysis. Efforts can be made along similar lines
312
          with the MOID algorithm in the future. If probe response
313
          factors can be calculated, the "bad" but "stable" probes
314
          can be scaled, and the distribution may be expected to be
315
          closer to normal, therefore expression levels and their
316
          uncertainties can be calculated in a more statistically
317
          sound manner.
318
          Future studies for more accurate background
319
          subtraction models may also improve the fold change
320
          distribution; therefore yield better statistics for fold
321
          change evaluation as well. The relative error in spiking
322
          experiments (28%) suggests that there is room for future
323
          improvement.
324
        
325
        
326
          LogP and Absolute Calls
327
          In the derivation of LogP (see materials and methods,
328
          Figure 7), it is assumed that the statistics of any
329
          absent probe can be described by the general background
330
          curve 
331
          B. The 
332
          P value should not be literally
333
          interpreted as accurate probabilities, since it is only
334
          as good as the underlining assumption. Users should also
335
          bear in mind that a 
336
          P value is for the null hypothesis
337
          that the discrepancy between observables and generic
338
          mismatches is generated by noise, which is different from
339
          the likelihood of gene presence conditioned on the
340
          observed discrepancy. However, LogP values more or less
341
          serve as a statistical indicator for the sorting of genes
342
          by their significance. The discrete absolute calls in
343
          MAS4 are determined in a completely different empirically
344
          manner. Although higher LogP values have a higher
345
          tendency of being assigned "Absent" in MAS4, the exact
346
          correspondence between the two varies from experiment to
347
          experiment. Roughly, LogP values above -3.0 have a
348
          greater chance to be called "Absent" than "Present" by
349
          MAS4. Users should be aware not to over interpret this
350
          correspondence. One interesting observation we had in
351
          this study was that discrete calls by MAS4 are not as
352
          stable as LogP values. Sometimes a gene can be reassigned
353
          to "Present" from "Absent" by MAS4 due to a small
354
          perturbation in the underlying quantities without going
355
          through a "Marginal" transitional stage.
356
        
357
        
358
          Curve Normalization
359
          It is clear by observing various experiments that the
360
          response of intensity to signal varies in different
361
          intensity regions. Occasionally, the normalization
362
          constant 
363
          n 
364
          
365
            f 
366
           does not cover well for all intensity values. A more
367
          generalized normalization procedure should use a
368
          normalization curve instead of a constant factor. The
369
          MOID normalization algorithm can be easily modified to
370
          the following (see [ 10 ] for a similar approach):
371
          Step 1: include all genes in the normalization gene
372
          list.
373
          Step 2: using the 
374
          E 
375
          
376
            k 
377
           values for all the genes in the list, generate
378
          integral intensity distributions for both experiments. A
379
          normalization curve 
380
          NF(I) is constructed in such a way
381
          that the two integral distributions are identical after
382
          normalization.
383
          Step 3: normalize the data sets using 
384
          NF obtained in the previous step;
385
          refine the gene list by further excluding those genes
386
          whose intensity values changed by more than a certain
387
          fold, > 
388
          f 
389
          max or < 1/ 
390
          f 
391
          max , between the two data sets.
392
          Step 4: repeat Step 2 and update 
393
          NF to 
394
          NF', until one of the following
395
          conditions is met:
396
          1) the maximum number of iterations, 
397
          Itr 
398
          max , is reached;
399
          2) max(| 
400
          NF ( 
401
          I )/ 
402
          NF '( 
403
          I ) - 1|) ≤ 
404
          max , where 
405
          max is a small predefined
406
          threshold;
407
          3) size of the normalization gene list drops below a
408
          predefined threshold, 
409
          Sz 
410
          min .
411
          This algorithm offers a chance to correct
412
          non-linearity in the chip system to a certain extent. To
413
          demonstrate the algorithm, we applied the procedure to
414
          two measurements, where genomics DNA sample of yeast
415
          s288c strain were hybridized onto Affymetrix YG_S98
416
          arrays. The second experiment was a repeat of the first
417
          one after about 50 days, where some of hybridization and
418
          scanning parameters had been changed over the course. It
419
          is shown in Figure 8that the curve normalization
420
          procedure out-performed constant normalization as
421
          expected.
422
        
423
      
424
      
425
        Conclusions
426
        MOID algorithm allows at least double or even triple the
427
        information density of current (U95 human chip) Affymetrix
428
        high-density oligo nucleotide arrays without compromising
429
        analysis accuracy. Table 2summarizes feature comparisons
430
        between MOID and MAS4.
431
        It should be noted that at the time of this study,
432
        Affymetrix U95 chip uses 16 probe pairs per set. As
433
        Affymetirx is planning on further reducing probe set size
434
        in their next design, the density improvement by using MOID
435
        may be less than the estimation given above.
436
      
437
      
438
        Materials and Methods
439
        
440
          Affymetrix Oligonucleotide Chip
441
          At the time of this study, Affymetrix synthesizes
442
          25-mer oligonucleotides on a 640 × 640 array with 20 μm
443
          feature size using photolithographic fabrication
444
          techniques (some data used in this study were collected
445
          from some previous generation arrays of 540 × 540 cells
446
          and 24 μm feature size). Each cell (also called a probe)
447
          in the array contains the same oligonucleotide sequences.
448
          As shown in Figure 9, typically a set of 16-20 probes are
449
          designed for each targeting cluster (called a probe set,
450
          or sometimes a "gene".). An expression value will be
451
          derived per probe set as the result of analysis. Probes
452
          taken as fragments from a target sequence are called
453
          perfect matches (PM). Multiple matches per set serve as
454
          independent signal detectors and provide a possibility to
455
          capture statistical uncertainties. For each match, there
456
          is also a corresponding mismatch (MM) probe, whose
457
          sequence differs from its match by a single base in the
458
          middle. Mismatches are meant to detect cross
459
          hybridization components of their corresponding matches,
460
          and are used by MAS4 for probe-specific background
461
          subtraction. There are also several Affymetrix control
462
          sets on each chip used for quality references, which were
463
          used in the spiking experiments to validate and refine
464
          hybridization protocols.
465
        
466
        
467
          Intensity Distribution
468
          In an ideal case, all the probes in a set should give
469
          similar signals and serve as replicate measurements. One
470
          common rule in probe design is to select probes with
471
          similar melting temperatures to minimize the variances
472
          among probes. The chemistry involved in a hybridization
473
          experiment is often too complicated to be predicted
474
          computationally at the current stage, therefore in
475
          reality probe intensity distribution for a set is usually
476
          fairly wide and has a long tail, which causes all kinds
477
          of difficulty for data analysis.
478
          One usual attraction in expression analysis is to
479
          assume a normal distribution for probe intensities. This
480
          hypothesis can be examined by the following calculation.
481
          For each probe set, we applied a linear transformation to
482
          probe intensities, so that the mean and the standard
483
          deviation of the intensities in the set were normalized
484
          to zero and one, respectively. Then all the resulting
485
          match intensities were used to generate an overall
486
          distribution. Our calculation shows the normal
487
          distribution assumption is not an appropriate one. When
488
          similar analysis was applied to mismatch intensities and
489
          mismatch-subtracted match intensities, the conclusion
490
          stays more or less the same.
491
          Despite the fact that the intensity distribution of a
492
          set is non-Gaussian, it is still crucial to find out the
493
          distribution function for match or mismatch intensities
494
          to generate meaningful statistics tests, if such
495
          distributions actually exist. The authors tried several
496
          analytical functions on match and mismatch probe
497
          intensities. Unfortunately, the distributions seemed to
498
          be quite "bad", the tail portions even fade out slower
499
          than the extreme value distribution. It was found that
500
          distributions of mismatch signals also significantly
501
          depended on the sample and experimental conditions. One
502
          may reasonably suspect that cross hybridization,
503
          intensity saturation, overall sample concentration, chip
504
          production irregularities, and miscellaneous noise may
505
          all cause a change in the signal distribution. The study
506
          seems to suggest the distribution of cross hybridization
507
          should be taken from each experiment as a result of
508
          measurement instead of using an 
509
          a priori determined analytical
510
          expression. This guideline is used in MOID algorithm.
511
        
512
        
513
          MOID algorithm: hypothesis and principles
514
          As mentioned in the intensity distribution analysis,
515
          MOID assumes cross hybridization behavior of probes is
516
          mainly non-probe-specific, and therefore all the match
517
          cells can share the same cross hybridization background.
518
          MOID primarily aims at saving 50% of the chip space
519
          currently dedicated to mismatch cells in the Affymetrix
520
          design under this study. It was also discussed before
521
          that cross hybridization distributions should be taken
522
          from experimental data instead of from some analytical
523
          function. Based upon the general belief that there is
524
          always a significant amount of genes unexpressed in any
525
          mRNA experiment, we will use the match cell signals from
526
          the 5% darkest probe sets in the current public
527
          Affymetrix arrays to prove the point that it is possible
528
          for MOID not to use any mismatch signal. For the next
529
          generation gene chip designed based on MOID algorithm,
530
          some designated probes may be used as generic mismatches
531
          to collect non-probe-specific cross hybridization data.
532
          In fact, on GNF-HS1, 16,460 probes from 476 viral genes
533
          are used for such a purpose.
534
          As discussed, the distributions of quantities in
535
          expression analysis are usually asymmetric and contain a
536
          long tail, regardless whether it is match intensity,
537
          mismatch intensity, or average intensity difference. For
538
          such abnormal distributions, concepts like mathematical
539
          average and standard deviation are not the most
540
          appropriate statistical terms to use. Pre-filtering the
541
          data, like what MAS4 does, certainly helps reshape the
542
          distribution, but by no means this can bring the
543
          distribution back to normal without throwing out a
544
          significant portion of measurements. The MOID algorithm
545
          is specifically designed to avoid these problems by using
546
          percentile instead of mean and using confidence intervals
547
          instead of standard deviation in all possible cases.
548
        
549
        
550
          MOID algorithm for significance of gene
551
          presence
552
          A non-probe-specific integral background distribution
553
          (representing noise, cross hybridization, and possible
554
          other factors), 
555
          B ( 
556
          I ), is derived from the
557
          intensities of the 5% darkest probe sets (or from generic
558
          mismatches in the future). 
559
          I stands for intensity; 
560
          B satisfies boundary conditions: 
561
          B (0) = 0 and 
562
          B (∞) = 1. For each probe set 
563
          k, the match signals are sorted and
564
          the integral distribution, 
565
          S 
566
          
567
            k 
568
           ( 
569
          I ), is generated as well. Figure
570
          7is a schematic diagram.
571
          If the gene represented by a probe set is absent, the
572
          intensity data from matches are due to pure background
573
          contribution, therefore 
574
          S 
575
          
576
            k 
577
           is likely to be close to 
578
          B. We choose the Kolmogorov-Smirnov
579
          test to determine likelihood that the observed signal
580
          distribution, 
581
          S 
582
          
583
            k 
584
           , could be explained by 
585
          B. If we define 
586
          d 
587
          
588
          k max as the maximum vertical
589
          distance between 
590
          B and 
591
          S 
592
          
593
            k 
594
           , according to K-S statistics, the probability of
595
          observing discrepancies greater than 
596
          d 
597
          
598
          k max is determined by a P
599
          value [ 11 ] ,
600
          
601
          where
602
          
603
          
604
          n 
605
          
606
            k 
607
           is the effective number of data points in the K-S
608
          statistics, which is the number of PM for gene 
609
          k in our case. 
610
          P carries the meaning of
611
          probability, therefore is a number between 0 and 1.
612
          Practically, Log 
613
          10 
614
          P , a negative value, is used as
615
          our final representation. In this way, MOID uses a
616
          continuous LogP criterion to replace MAS4 absolute calls.
617
          Those signal sets that can be easily explained by noise
618
          are assigned a LogP value closer to zero.
619
        
620
        
621
          MOID algorithm for expression level
622
          calculation
623
          MOID uses the horizontal distance between 
624
          S 
625
          
626
            k 
627
           and 
628
          B to represent gene expression
629
          level (Figure 7). Since the darkest probes may be more
630
          likely caused by their poor binding properties to the
631
          target gene, and the brightest probes may be more likely
632
          caused by serious cross hybridization issues, different
633
          parts of the integral distribution tend to have different
634
          qualities. Therefore, the MOID algorithm uses the
635
          horizontal distance measured at a certain percentile, 
636
          pct, instead of the whole curve.
637
          Based on the analysis of the spiking experiments
638
          discussed later, it is empirically determined that 70% is
639
          the optimal percentile for signal retrieval. We also
640
          tried to use the average of horizontal distances from
641
          several 
642
          pct, but without significant
643
          improvement. Therefore the expression level 
644
          E 
645
          
646
            k 
647
           , related to gene concentration for gene 
648
          k, is defined as
649
          
650
          E 
651
          
652
            k 
653
           = max( 
654
          I | 
655
          
656
          S 
657
          
658
            k 
659
           = 
660
          pct  - 
661
          I | 
662
          
663
          B = 
664
          pct  , 0).
665
          Instead of using the standard deviation, we take
666
          confidence intervals directly from the distribution curve
667
          
668
          S 
669
          
670
            k 
671
           , with the background subtracted. E.g., 80%
672
          confidence intervals can be represented by a lower bound
673
          ( 
674
          E 
675
          
676
            kl 
677
           ) at 10 percentile and an upper bound ( 
678
          E 
679
          
680
            ku 
681
           ) at 90 percentile of the distribution, i.e.,
682
          
683
          E 
684
          
685
            kl 
686
           0(0.1) = max ( 
687
          I | 
688
          
689
          S 
690
          
691
            k 
692
           = 0.1 - 
693
          I | 
694
          
695
          B = 
696
          pct  , 0)
697
          and
698
          
699
          E 
700
          
701
            ku 
702
           0(0.9) = max ( 
703
          I | 
704
          
705
          S 
706
          
707
            k 
708
           = 0.9 - 
709
          I | 
710
          
711
          B = 
712
          pct  , 0).
713
          As one might expect, increment of the probe set size
714
          helps narrowing down the confidence intervals in a manner
715
          determined by the statistics of probe distribution.
716
          However, this piece of information is unknown and we took
717
          an 
718
          ad hoc approach by modifying the
719
          boundaries formulae as the following:
720
          
721
          and
722
          
723
        
724
        
725
          MOID algorithm for normalization
726
          The most common way of understanding gene functions by
727
          expression profiling is to study the change of their
728
          expression levels in various disease states and cellular
729
          environments. The observed expression level is a product
730
          of complicated protocols, and is subjected to various
731
          factors from sample preparation to final image scanning.
732
          Therefore, it is a common practice to normalize
733
          expression data drawn from multiple profiles before
734
          making any reliable interpretation.
735
          The normalization procedure in MAS4 aims at
736
          normalizing the average intensities of probe sets
737
          (excluding the top and bottom 2%). To avoid possible
738
          signal contamination, MOID takes similar approach by
739
          normalizing the probe sets between 10 and 90 percentiles.
740
          However, it is noticed that an ideal normalization factor
741
          should be derived from only those genes that do not
742
          change their expression levels between the two
743
          experiments. MAS software provides the possibility to use
744
          a list of housekeeping genes for such purposes; however,
745
          it certainly requires careful downstream research to
746
          validate any such a list. Some common normalization
747
          criteria were summarized in a recent study by Zien et al.
748
          [ 12 ] . MOID uses a heuristic bootstrap method to
749
          identify an approximate list of unchanged genes between
750
          two data sets:
751
          Step 1: include all genes as the normalization gene
752
          list.
753
          Step 2: sort the 
754
          E 
755
          
756
            k 
757
           values (already background subtracted) for all the
758
          genes in the list; the interesting portion of expression
759
          values (between 10% and 90% in our case) is retrieved and
760
          average intensity is calculated for each experiment,
761
          respectively. The initial normalization factor 
762
          nf is calculated by the ratio of
763
          the two average intensities.
764
          Step 3: normalize the data sets using 
765
          nf obtained in the previous step;
766
          refine the gene list by further excluding those genes,
767
          whose intensity values changed by more than a certain
768
          fold, > 
769
          f 
770
          max or < 1/ 
771
          f 
772
          max , between the two data sets.
773
          Step 4: repeat Step 2 and update 
774
          nf to 
775
          nf', until one of the following
776
          conditions is met:
777
          1) the maximum number of iterations, 
778
          Itr 
779
          max , is reached;
780
          2) | 
781
          nf/nf' - 1| ≤ 
782
          max , where 
783
          max is a small predefined
784
          threshold;
785
          3) size of the normalization gene list drops below a
786
          predefined threshold, 
787
          Sz 
788
          min .
789
          In practice, a single iteration is usually sufficient
790
          for most comparisons. User can adjust threshold
791
          parameters towards their preferred stringent levels. In
792
          this study we use 
793
          f 
794
          max = 2.0, 
795
          max = 0.05, 
796
          Itr 
797
          max = 5, 
798
          Sz 
799
          min = 1000.
800
        
801
        
802
          MOID algorithm for comparison analysis
803
          Different probes within the same probe set generally
804
          respond very differently to mRNA sample fragments. We
805
          examined various probe properties such as melting
806
          temperature, nucleotide base composition, possible
807
          sequence motifs, and potential secondary structures in
808
          order to understand the cause of such diverse response
809
          properties. That study did not find any conclusive factor
810
          that explains observed probe signal distribution
811
          satisfactory. Although it seems rather difficult to
812
          pre-compute and predict probe responses, it is possible
813
          to derive some features afterwards for a probe based on a
814
          significant amount of expression data where the target
815
          gene of interest is present [ 8 ] . Instead of using a
816
          modeling approach to explain various signal
817
          contributions, MOID uses a new approach to avoid the
818
          affect of diversities of probe response factors in
819
          calculating expression fold changes.
820
          Let us assume any two probes in a probe set 
821
          k of size 
822
          n 
823
          
824
            k 
825
           always respond to their target gene concentration
826
          with factor 
827
          r 
828
          1 and 
829
          r 
830
          2 , respectively. That is, if the gene
831
          is present at concentration 
832
          c 
833
          1 and 
834
          c 
835
          2 in two hybridizations, the two
836
          probes in average give signal intensities 
837
          r 
838
          1 
839
          c 
840
          1 and 
841
          r 
842
          2 
843
          c 
844
          1 in the first experiment, 
845
          r 
846
          1 
847
          c 
848
          2 and 
849
          r 
850
          2 
851
          c 
852
          2 in the second experiment. As in
853
          MAS4, fold change is calculated by the ratio between 
854
          r 
855
          1 
856
          c 
857
          2 + 
858
          r 
859
          2 
860
          c 
861
          2 and 
862
          r 
863
          1 
864
          c 
865
          1 + 
866
          r 
867
          2 
868
          c 
869
          1 , where the result is essentially
870
          dominated by the probes with the largest response
871
          factors, and the statistics of the ratio becomes
872
          difficult to estimate. However, we observe that by taking
873
          the ratio of the two signals for each probe individually,
874
          one essentially is looking at 
875
          n 
876
          
877
            k 
878
           independent measurements of 
879
          c 
880
          2 / 
881
          c 
882
          1 for that probe set. This opens a
883
          possibility for obtaining a distribution of fold change
884
          values.
885
          If one assumes most probes have a rather stable
886
          intrinsic response factor, 
887
          r, a fold change number can be
888
          calculated from each probe 
889
          i in the set independently, which
890
          hopefully removes the affect of the unknown response
891
          factors 
892
          r 
893
          
894
            i 
895
           . However, it seems the response factor of each
896
          probe has a non-negligible intrinsic spread as well; this
897
          may be further complicated by alternative splicing and
898
          tissue-specific cross hybridization issues. In addition,
899
          the 
900
          n 
901
          
902
            k 
903
           ratio numbers calculated for a probe set are not
904
          normally distributed (it is well known that even the
905
          ratio of two normally distributed values is no longer
906
          normally distributed). The current background subtraction
907
          algorithm may not take fully into account response
908
          factor-unrelated signals, therefore could further
909
          increase the non-normal component. To overcome this
910
          difficulty, MOID uses a percentile, 
911
          pct 
912
          
913
            f 
914
           , of the integral distribution formed by the 
915
          n 
916
          
917
            k 
918
           fold change numbers, 
919
          F 
920
          
921
            k 
922
           ( 
923
          f ), which satisfied the boundary
924
          conditions: 
925
          F (0) = 0 and 
926
          F (∞) = 1. Empirically, 70% is used
927
          for 
928
          pct 
929
          
930
            f 
931
           as determined by spiking experiments discussed
932
          later. The confidence intervals are also directly taken
933
          from relevant percentiles in the distribution
934
          corresponding to the lower bound and upper bound,
935
          respectively. The effect of the number 
936
          n 
937
          
938
            k 
939
           is taken into account in the same heuristic manner
940
          as we did previously for absolute analysis. The final
941
          formulae are:
942
          
943
        
944
        
945
          MAS4 Algorithm for Absolute Analysis
946
          After a sample is hybridized to probes on a chip, the
947
          chip is scanned and fluorescent signals are collected and
948
          stored in an Affymetrix DAT file. For cell 
949
          i, after filtering out boundary
950
          pixels, 
951
          N 
952
          
953
            i 
954
           numbers of pixels are used to calculate an average
955
          intensity value 
956
          I 
957
          
958
            i 
959
           and standard deviation σ 
960
          
961
            i 
962
           . Hereafter we index a probe set by 
963
          k, which contains 
964
          n 
965
          
966
            k 
967
           probe pairs. The calculated intensity values for the
968
          
969
          j th pair in the 
970
          k th probe set are denoted as 
971
          PM 
972
          
973
            kj 
974
           for PM and 
975
          MM 
976
          
977
            kj 
978
           for MM. Background intensity, 
979
          bg, is derived from the average
980
          intensity of the 2% darkest cells. Noise level, 
981
          Q, is determined as
982
          
983
          where average is done over all background cells.
984
          MAS4 assumes that the cross hybridization component of
985
          a match is captured by its mismatch companion, therefore
986
          the difference, 
987
          PM 
988
          
989
            k 
990
           - 
991
          MM 
992
          
993
            k 
994
           , is used to derive expression levels. MAS4 also
995
          uses a "super-Olympic scoring" procedure to iteratively
996
          filter out outliers, defined as those probe pairs with
997
          difference values more than three times the standard
998
          deviation away from the mean. After filtering, the
999
          average difference value for a probe set, 
1000
          D 
1001
          
1002
            k 
1003
           , is used to represent the expression level of that
1004
          particular gene.
1005
          The significance of gene presence is determined by an
1006
          empirical absolute call decision matrix coupled with
1007
          proprietary MAS4 algorithms, parts of which are described
1008
          in the Microarray Suite documentation. As the result, a
1009
          probe set is marked as either "Present", "Absent", or
1010
          "Marginal".
1011
          MAS4 calculates average difference by
1012
          
1013
          D 
1014
          
1015
            k 
1016
           = < 
1017
          PM 
1018
          
1019
            kj 
1020
           - 
1021
          MM 
1022
          
1023
            kj 
1024
           > 
1025
          
1026
            j 
1027
           ,
1028
          which is essentially
1029
          
1030
          D 
1031
          
1032
            k 
1033
           = < 
1034
          PM 
1035
          
1036
            kj 
1037
           > 
1038
          
1039
            j 
1040
           - < 
1041
          MM 
1042
          
1043
            kj 
1044
           > 
1045
          
1046
            j 
1047
           ,
1048
          where the average is taken over all the probe pairs
1049
          surviving the super-Olympic filtering process mentioned
1050
          above. Based on this observation, the one-one
1051
          correspondences between 
1052
          PM 
1053
          
1054
            kj 
1055
           and 
1056
          MM 
1057
          
1058
            kj 
1059
           , typically with an average correlation coefficient
1060
          around 0.8, are essentially lost during such averaging.
1061
          Since MAS4 is successful in various applications despite
1062
          the fact it averages out probe-specific mismatch signals,
1063
          we hypothesize that contributions of the mismatch probes
1064
          are mainly probe-nonspecific, if the pairwise
1065
          correspondence between 
1066
          PM 
1067
          
1068
            kj 
1069
           and 
1070
          MM 
1071
          
1072
            kj 
1073
           are not enforced. This observation led to the idea
1074
          of only using match probes in the MOID algorithm
1075
          discussed above.
1076
        
1077
        
1078
          MAS4 Algorithm for Normalization
1079
          MAS4 introduces a normalization factor (a scaling
1080
          factor in MAS4 serves the same purpose). The goal of
1081
          normalization is to reduce false positives in the
1082
          expression fold change calculation and ensure
1083
          housekeeping genes are marked as unchanged. Since it is
1084
          usually unknown 
1085
          a priori what genes sustain their
1086
          expression level during the two experiments, different
1087
          heuristic normalization schemes may be adopted and it is
1088
          unclear at this point which particular approach is
1089
          optimal. MAS4 normally derives the normalization factor
1090
          by scaling average 
1091
          D 
1092
          
1093
            k 
1094
           for all genes with expression levels within the
1095
          central 96% to a certain target intensity constant. This
1096
          algorithm is based on the assumption that the total copy
1097
          of mRNA per cell is a conserved number among
1098
          experiments.
1099
        
1100
        
1101
          MAS4 Algorithm for Comparison Analysis
1102
          In comparison of two chips, we follow Affymetrix
1103
          convention of calling the reference chip measurement as
1104
          the base ( 
1105
          b ), the other one as the
1106
          experiment ( 
1107
          e ). After the above normalization
1108
          procedure, the fold change value for gene 
1109
          k, f 
1110
          
1111
            k 
1112
           , is essentially calculated by dividing 
1113
          D 
1114
          
1115
            k 
1116
           ( 
1117
          e ) with 
1118
          D 
1119
          
1120
            k 
1121
           ( 
1122
          b ). MAS4 uses a filtering process
1123
          to ensure that a common subset of "good" probe pairs
1124
          between 
1125
          b and 
1126
          e are used to recalculate 
1127
          D 
1128
          
1129
            k 
1130
           . Very weak expression signals (sometimes negative)
1131
          are rounded to the noise level mentioned before. The
1132
          final formula for 
1133
          f is:
1134
          
1135
          f 
1136
          
1137
            k 
1138
           = [ 
1139
          D 
1140
          
1141
            k 
1142
           ( 
1143
          e ) + max ( 
1144
          Q - 
1145
          D 
1146
          
1147
            k 
1148
           ( 
1149
          b ), 0)] / max ( 
1150
          D 
1151
          
1152
            k 
1153
           ( 
1154
          b ), 
1155
          Q ), if 
1156
          D 
1157
          
1158
            k 
1159
           ( 
1160
          e ) ≥ 
1161
          D 
1162
          
1163
            k 
1164
           ( 
1165
          b ),
1166
          or
1167
          
1168
          f 
1169
          
1170
            k 
1171
           = max ( 
1172
          D 
1173
          
1174
            k 
1175
           ( 
1176
          e ), 
1177
          Q ) / [ 
1178
          D 
1179
          
1180
            k 
1181
           ( 
1182
          b ) + max ( 
1183
          Q - 
1184
          D 
1185
          
1186
            k 
1187
           ( 
1188
          e ), 0)], otherwise.
1189
          An empirical difference call decision matrix and a
1190
          proprietary MAS4 algorithm, partly described in the
1191
          Microarray Suite 4.0 documentation, determine the
1192
          significance of the fold change. As a result, the probe
1193
          set is marked as "Increase", "No Change", "Decrease",
1194
          "Marginally Increase", or "Marginally Decrease". It is
1195
          obvious that the fold change number is only meaningful if
1196
          the probe set is clearly present at a level above the
1197
          noise in both data sets.
1198
        
1199
      
1200
      
1201
        List of Abbreviations
1202
        MOID - match-only integral distribution
1203
        MAS4 - Affymetrix algorithms implemented in its
1204
        Microarray Suit 4.0 software
1205
        PM - perfect match
1206
        DMM - mismatch
1207
      
1208
    
1209
  
1210

1211
Product

Resources

Company