Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
Alizadeh
7
et al . [ 1 ] did a large scale,
8
long-term study of diffuse large B-cell lymphoma (DLBCL),
9
using microarray data chips. By doing cluster analysis on
10
this data, they were able to diagnose 96 donors with an
11
accuracy of 93% for this specific lymphoma; they were not
12
able to predict which individual patients would survive to
13
the end of the long-term study. The International
14
Prognostic Index for this disease was incorrect for 30% of
15
these patients.
16
Cluster analysis, together with other statistical
17
methods for identifying and correlating minimal gene lists
18
with outcome, have become established as the primary tools
19
for the analysis of microarray data in cancer studies. We
20
wished to test a different approach, ANN.
21
These two approaches to the analysis of microarray data
22
differ substantially in their mode of operation. In the
23
first examination of the data, clustering, as applied in
24
numerous recent cancer studies, is an unsupervised mapping
25
of the input data examples based on the overall pairwise
26
similarity of those examples to each other (here,
27
similarity with respect to the expression levels of
28
thousands of genes); the method is unsupervised in that no
29
information of the desired outcome is provided. Subsequent
30
analysis of the clusters in these studies generally
31
attempts to reduce the gene set to the subset of genes that
32
are most informative for the problem at hand. This step is
33
a supervised step since there is an explicit effort to find
34
correlations in the pattern of gene expression that match
35
the classification one is attempting to make among the
36
input examples (see Discussion for specific examples). The
37
input for this supervised step is the product of an
38
unsupervised step. As this subselection is not routinely
39
subjected to independent test using input examples
40
originally withheld from the subselection process, it is
41
generally not possible to judge how specifically the
42
subselection choices relate to this specific set of
43
examples as opposed to the general population of potential
44
examples. To the extent that the gene set employed is much
45
larger than the gene set that really determines the
46
classification, it is possible that much of the clustering
47
result will be based on irrelevant similarities.
48
On the other hand, backpropagation neural networks are a
49
supervised learning method that has an excellent reputation
50
for classification problems. During the training phase, the
51
ANN are supplied with both the input data and the answer
52
and are specifically tasked to make the classification of
53
interest, given a training set of examples from all
54
classes. That is, the ANN are constantly checking to see if
55
they have gotten the 'correct' answer, the answer being the
56
actual classification not just the overall similarity of
57
inputs.
58
Networks accomplish this by continually adjusting their
59
internal weighted connections to reduce the observed error
60
in matching input to output. When the network has achieved
61
a solution that correctly identifies all training examples,
62
the weights are fixed; it is then tested on input examples
63
that were not part of the training set to see if the
64
solution is a general one. It is only in this independent
65
test that the quality of the network is judged.
66
Investigators are not limited to a single network. It is
67
feasible to train a series of networks using, say, 90% of
68
the examples for training and holding back 10% for testing.
69
A different 10 % can be tested in a second network and so
70
on. In this way, with the training of ten networks, each
71
input can be found in a test set one time and can,
72
therefore, be independently evaluated. The data presented
73
below, with the exception of a few cases, are the output of
74
ten slightly different trained networks, operating in test
75
mode, which collectively evaluate the entire donor pool.
76
This 'round-robin' procedure was employed, in duplicate, in
77
every trial described throughout this work. The fact that
78
one ends up with 10 networks is not an impediment to
79
analysis since any future examples could be submitted to
80
all 10 networks for evaluation, with a majority poll
81
deciding the classification. That is, six networks in
82
agreement on a particular input datum would determine the
83
classification of that input. These networks are, of
84
course, likely to be very similar in that their training
85
sets differ only slightly.
86
A second major advantage of backpropagation networks
87
follows from the first. Not only are neural networks
88
trained to the specific question, rather than a loose
89
derivative of that question, and tested for generality, but
90
they can also be asked for a quantitative assessment of how
91
they got the correct answer. Numerical partial
92
differentiation of the network with respect to a given test
93
input example [ 2 3 ] allows one to see the network's
94
evaluation of the relative impact of each gene in arriving
95
at the correct answer for this particular input. Cluster
96
analysis, including the statistical correlations, has no
97
corresponding highly focused sight for targeting specific
98
similarities as opposed to non-specific similarities. To
99
the extent that this is true, neural networks should be
100
able to identify relatively small gene subsets which will
101
significantly outperform the initial gene sets in
102
classification and which will also significantly outperform
103
the gene subsets suggested by cluster analysis.
104
105
106
Results
107
108
Determining patient prognosis from microarray
109
data
110
Cluster analysis [ 1 4 ] had shown that the 4026 gene
111
expression panels for 40 DLBCL patients contained some
112
information relevant to the question of prognosis but
113
these authors did not make an attempt to provide survival
114
predictions for individual patients.
115
We wished to see if the neural network strategy, of
116
train, test, differentiate, retrain on the reduced gene
117
set, and retest, could produce any useful result with
118
respect to prognosis on an individual basis. The approach
119
would be: [ 1 ] use the entire gene set without
120
preprocessing to train a network, testing to confirm that
121
it had at least a good fit to the problem and, [ 2 ] use
122
the network's definition of the problem, by
123
differentiating the network, to focus on those genes most
124
essential to the classification. These genes would then
125
form the basis for training new networks with hopefully
126
improved performance. Over 130 networks were trained for
127
this study. Figure 1shows a work flow schematic for this
128
study. Table 1provides a summary overview of the data,
129
including data not shown.
130
Initially a network was trained to accept microarray
131
data on the complete panel of 4026 genes from 40
132
patients. This network had 12078 input neurons with a
133
semi-quantitative assessment of each gene, 100
134
middle-layer neurons, and a single output neuron. The
135
networks were originally designed with 3 input bits per
136
datum: one for sign,'-' = 1, and 2 for quantitative
137
degree of signal with 00 being 0 to 0.5, 01 being >0.5
138
to 1.0, 10 being >1.0 to 2.0, and 11 being >2.0.
139
Thus '011' would indicate a particular gene whose
140
expression, relative to control, was increased at a
141
magnitude >2. The training set included 30 donors,
142
with 10 additional donors being held back as test data.
143
The network was trained by processing 12 iterations of
144
the complete training set. The test set, drawn from a
145
mixture of survivors and non-survivors, was then run. The
146
entire process was then repeated with a different choice
147
of test data each time. In this round-robin fashion, all
148
donors serve as test data for one of the networks, and
149
each training set is necessarily slightly different. A
150
round robin series of 4 networks was generated. Data
151
underlying Figure 5 of the earlier report
152
http://llmpp.nih.gov/lymphoma/data.shtmlwere used for
153
training. The networks were asked to predict, based on
154
the 4026 gene set, which of 40 DLBCL patients would
155
survive to the end of the study (longest point = 10.8
156
yrs). Networks initially varied with from 1 to 3 errors
157
on 10 test patients each, for a total of 31 of 40
158
patients correctly predicted (data not shown). 1However,
159
a trained neural network can be numerically
160
differentiated [ 2 3 ] to show the relative dependence of
161
the output (classification) on each active input neuron
162
within an input vector. Briefly stated, the
163
differentiation process involves slightly perturbing the
164
activation (down from 1.0 to 0.85) of each active input
165
neuron, one at a time, to note the specific change in the
166
output value. In that there is one gene for each active
167
node, the largest change in the output points to the most
168
influential gene. We then trained qualitative networks,
169
with 2 bits per gene, on the 4026 gene set in order to
170
differentiate them ('1 0' for expression greater than, or
171
equal to, the control, '0 1' for less than the control).
172
The networks had 67 middle layer neurons. This coding has
173
the effect that there is an active neuron for each gene
174
in the set regardless of expression level and the total
175
number of active input neurons is constant from input to
176
input. By taking the top 25% of genes in each of 12
177
differentiations and requiring agreement of at least 4 of
178
12 patients in choosing each gene, we obtained a set of
179
34 genes. (These cutoff criteria are necessarily
180
arbitrary and are only justified by subsequent proof that
181
they produced gene subsets having the desired
182
information.) A round-robin series of 10 networks, with 4
183
test donors each, produced a single error (DLCL0018) in
184
survival predictions when trained on these 34 genes (data
185
not shown) 1. The second round-robin training with the
186
same gene set produced no errors, correctly evaluating
187
all 40 patients in a series of 10 test sets (Table
188
2).
189
For a second study, we took 20 patients and held them
190
in reserve to model information from a "follow-up" study.
191
Twenty networks were trained, on the 34 gene set, using
192
the remaining 20 patients; each had 19 patients in the
193
training set and 1 in the test set. Collectively, these
194
networks made no errors in the prognosis of 20 patients.
195
The data for the 20 reserve patients were then tested on
196
all 20 trained networks to emulate follow-up data. Out of
197
400 individual scores, there were 5 errors distributed
198
over 2 patients. A poll of the 20 networks, therefore,
199
produced no errors by a majority, correctly classifying
200
all 20 members of the follow-up group.(data not
201
shown)
202
The 34 genes are given in Table 3. In 5 of 12 cases,
203
the gene chosen as most influential in determining the
204
correct prognosis was 18593, a tyrosine kinase receptor
205
gene. While this gene set may not be the absolute best
206
possible, it clearly does contain sufficient information
207
for error-free predictions on these patients. The
208
identification of this gene set will hopefully lead
209
eventually to a better understanding of the interaction
210
of these genes in this disease as a result of future
211
studies.
212
213
214
Diagnosing lymphoma from microarray data
215
The diagnosis of DLBCL lymphoma by biopsy is not
216
trivial. Even with gene expression data, clustering
217
techniques produced a misreading of 7 out of 96 donors [
218
1 ] , a result unimproved in their hands by further
219
analysis of reduced gene panels. We wished to see if back
220
propagation neural networks could do better using the
221
same data set. Figure 2shows a work flow schematic for
222
this study. This testing over the whole donor set with
223
4026 genes produced 6 errors in diagnosis (data not
224
shown).
225
Thus, in the first round, ANN merely match cluster
226
analysis. In preparation for differentiation, a network
227
was trained with the same donor sets as the first network
228
above, but coded qualitatively. This network correctly
229
classified the 10 members of the test set (data not
230
shown). The 5 positive donors from the test set were each
231
used, in turn, to differentiate the network. In these
232
cases, the first criterion for selection was broad: the
233
gene had to contribute at least 10% as much as the gene
234
making the maximum contribution to the correct
235
classification; the second criterion was that 3 or more
236
of the donors had to agree on the selection. This
237
produced a subset of 292 genes. The number of genes
238
referenced by a given donor under identical criteria
239
ranged from 45 to 1448. Only 38% of the genes overlapped
240
the 670 gene subset identified by cluster analysis. It
241
was of interest to see if these genes were sufficient for
242
correct classification of the donors. Ten different
243
networks were trained with the 292 gene subset. Three
244
(OCI Ly1 and DLBCL0009 and tonsil) errors were produced
245
over 96 donors in 2 separate series (data not shown).
246
At this point, the neural networks were doing a
247
much-improved diagnosis; it remained to be seen if the
248
gene set could be further refined. The set of 292 genes
249
was then treated in two different ways: [ 1 ] it was
250
arbitrarily split into even and odd halves, with each
251
half being used to train ten new networks. [ 2 ] it was
252
used whole to train ten qualitative networks for further
253
differentiation.
254
Twenty different networks were then trained using a
255
146 gene (odd or even numbered) subset of the 292 gene
256
set in 2 series of 10. The odd set again produced 3
257
errors (data not shown). In the even set, a single error
258
was made over 96 donors in ten different test sets,
259
identifying the 'tonsil' inlier in the earlier cluster
260
analysis [ 1 ] as positive (Table 4). Ten additional
261
networks were trained on the even set with the same
262
result (data not shown).
263
The differentiation of the networks from the 292 gene
264
set pointed to 8 genes. Given the high accuracy of the
265
even 146 gene set, we also trained networks on this set
266
for differentiation. These pointed to 11 additional
267
genes. In these cases, only genes in the top 20% in
268
influence chosen in common by at least 25% of the
269
differentiated examples were considered. Networks trained
270
on these 19 genes produced 2 errors over 96 donors in 10
271
test sets (Table 5). The 19 genes, using the designation
272
from the initial report, are given in Table 6.
273
We also wished to test this gene set in the context of
274
a follow-up study. For this purpose, we set aside 50
275
donors as "follow-up" data, using the remaining 46 donors
276
in the usual training/testing round robin. Eleven
277
networks were trained, 9 with 42 training vectors and 4
278
test vectors and 2 with 41 training vectors and 5 test
279
vectors. Collectively, these produced 3 errors over 46
280
donors or 93% correct. The follow-up donors were then
281
tested on the 11 networks. A poll of these networks
282
showed a majority vote for 1 error or 98% correct.
283
284
285
286
Discussion
287
The rather remarkable conclusion of this analysis is
288
that there is sufficient information in a single gene
289
expression time point of less than 5 dozen genes to provide
290
perfect prognosis (out to ten years) and near-perfect
291
diagnosis for this set of donors. Furthermore, neural
292
networks, through a strategy of train and differentiate,
293
bring that information to the fore by progressively
294
focusing on the genes within the larger set which are most
295
responsible for the correct classifications, providing at
296
once a reduction in the noise level and specific donor
297
profiles. This focus on the specific classification problem
298
led to a set of 34 genes for prognosis and a second set of
299
19 genes for diagnosis. These sets are mutually exclusive.
300
The gene subsets suggested by cluster analysis [ 1 ] are
301
not supersets of these sets; the 670 gene set of the
302
initial report captured only 7 of the 19 gene set used for
303
diagnosis and the 148 gene staging set captured only 2 of
304
the 34 gene set used for prognosis. The 234 gene subset
305
proposed by Hastie,
306
et al . [ 4 ] for prognosis contains
307
6 of the 34 gene set. There was no overlap with the 13 gene
308
set identified by Shipp,
309
et al [ 5 ] to correlate with their
310
cured/fatal classes for this disease. At first, it might
311
seem surprising that the gene subsets identified here do
312
not appear to be subsets of those identified earlier by
313
Alizadeh
314
et al . But this surprise is based on
315
a naive intuition. The fact is that we do not know the
316
level of information redundancy that exists in these large
317
arrays. Apropos of this point, Alon
318
et al . [ 6 ] discarded the 1500
319
genes indicated by cluster analysis as most discriminatory
320
in their study of colon cancer and, upon reclustering,
321
found their diagnosis unimpaired. Likewise, it may be that
322
while the top 10% of relevant genes might be sufficient for
323
perfect classification, so might the next 10%. These sets
324
by definition are mutually exclusive. By extension, it is
325
not difficult to believe that some other large gene set
326
might be able to get 75% of the classifications correct
327
with little or no overlap with those genes in the top
328
10%.
329
We have been careful to avoid any claim that the gene
330
sets extracted in this procedure are the "best" gene sets.
331
Only in one, highly qualified sense can they be said to be
332
best; that is in classifying
333
this data set there are no other
334
gene sets which offer a statistically significant
335
improvement in classification accuracy. That is not to say
336
that there may not be other sets which could do as well.
337
Nor is there any implication that these genes are seminal
338
in the etiology of this disease. They may not be necessary
339
but they are sufficient to do this classification. They may
340
not be sufficient to the classification of a much larger
341
patient set. Forty patients are unlikely to be fully
342
representative of the general patient population with this
343
disease. It should be noted, however, that the same caveats
344
apply to the analysis of these data by any other
345
method.
346
There have been a number of additional studies of cancer
347
using microarray data for either prognostic or diagnostic
348
purposes. The following listing includes a brief discussion
349
of 7 of these studies:
350
(1) Shipp
351
et al . [ 5 ] did a study of 58 DLBCL
352
patients and 19 follicular lymphoma patients. They first
353
sought to classify DLBCL and FL patients. They clustered
354
6817 genes. Using their own weighted combination of
355
informative gene markers, they picked out 30 genes whose
356
expression levels would be used to do a 2-way
357
classification. They correctly classified 71/77 patients
358
for a diagnostic accuracy of 92%. They then attempted to
359
develop high risk and low risk groups with respect to 5
360
year prognosis. They used several different methods for
361
associating particular gene clusters with survival outcome:
362
Kaplan Meier analysis, Support Vector Machine, and
363
K-nearest neighbor analysis. They selected 13 genes as most
364
informative and achieved the best result with SVM modeling.
365
They did not explicitly state how many patients initially
366
sorted into the high risk/low risk groups but other data
367
suggest 17 and 41 respectively. The only way in which these
368
survival probability plots can be compared to the patient
369
by patient predictions presented above is to associate low
370
risk with survival and high risk with non-survival (Please
371
note:this equation was not made by any of the authors, with
372
the exception of [ 3 ] below, discussing risk groups). If
373
one makes this association, their best result is 14/58
374
errors for a 5 yr. survival accuracy of 76%.
375
(2) Rosenwald
376
et al . [ 7 ] did what they termed a
377
follow-up study on the original Alizadeh
378
et al . study of DLBCL patients.
379
However, it was not really a follow-up study because a
380
different chip was used for the microarray data. The
381
Alizadeh study identified 2 groups based on an analysis of
382
weighting the gene cluster groups: germinal center B
383
cell-like tumors which correlated with low risk and
384
activated B cell-like tumors which correlated with high
385
risk. If these groups were made survivors and
386
non-survivors, the prognosis accuracy would have been 75%.
387
In the follow-up, the authors found it necessary to
388
introduce a third group, consisting of patients who did not
389
fit either of the previous 2 categories. Although lacking
390
the associated gene profile, this third group had a
391
survival pattern much like the activated B cell-like group.
392
The authors used Cox proportional hazards modeling to
393
assign groups on the basis of the expression of 100 genes.
394
The 5 yr. survival for the low risk group was 60%, 35% for
395
the activated B cell-like group, and 39% for the 3rd group.
396
An improved result was obtained using 16 genes drawn from 4
397
signature gene groupings plus a score for BMP6 expression.
398
Kaplan Meier estimates of survival were determined for 4
399
quartiles for which the 5 yr. survival rate was
400
73%,71%,34%,15%. If these 4 are collapsed into 2 categories
401
of survivor and non-survivor, it would produce 62/240
402
errors for a prognosis accuracy of 74%.
403
(3) van't Veer
404
et al . [ 8 ] did a study of 78
405
patients with breast cancer. Starting with 5000 signature
406
genes, they narrowed down the gene pool to 231 genes by
407
examining the correlation coefficient of each gene with the
408
prognostic outcome. They then rank ordered these genes and
409
added them 5 at a time to a one-man-out test of their 77
410
patients for predicted outcome. This was repeated until an
411
optimum outcome classification was reached. This occurred
412
at 70 genes. A patient by patient classification based on
413
the weighting of these 70 genes was able to produce a
414
survival classification with 13/78 errors for an accuracy
415
of 83%.
416
(4) Beer
417
et al . [ 9 ] used clustering and Cox
418
hazard analysis to generate a list of 50 genes to be used
419
in Kaplan Meier 5 yr. projections of survival. They had 86
420
patients with lung cancer in the study. With 22 patients
421
originally assigned to the low risk group and 19 to the
422
high risk group, the corresponding 5 yr. survival rates
423
were 83% and 40%. If treated as survival categories this
424
would produce 12/41 errors for a prognosis classification
425
accuracy of 71%. Although these authors had complete 5 yr.
426
survival data on 41 of the patients in the study, they at
427
no point attempted to analyze this group specifically for
428
direct comparison with predictions.
429
(5) Khan
430
et al . [ 10 ] used linear neural
431
networks to analyze microarray data from patients with
432
small round blue-cell tumors. They wished to classify the 4
433
subcategories of this tumor. Principle Component Analysis
434
was used to reduce 2308 genes to 10 components. Neural
435
networks were trained using 2/3 of a 63 patient pool to
436
train and 1/3 to test in a fully cross-validated fashion.
437
The groups were shuffled 1250 times to produce 3750
438
networks. These networks correctly classified all 63
439
patients in a 4-way classification. The networks were
440
analyzed for the most influential inputs to produce a list
441
of 96 genes. New networks were calibrated with just these
442
96 genes; these again correctly classified the 63 patients
443
and also correctly classified the 25 patients who had been
444
withheld from the whole process.
445
(6) Dehanasekaran
446
et al . [ 11 ] did a study of 60
447
prostate biopsy samples, 24 non-tumorous,14 tumor in situ,
448
20 metastatic tumor. Cluster analysis of microarray data
449
from nearly 10,000 genes misplaced 2 samples out of 26 for
450
a diagnostic accuracy of 92%. The authors did not state why
451
they limited the clustering result to 26 samples when they
452
had 60. Although they performed additional analyses, they
453
did not involve using the array data for either diagnosis
454
or prognosis.
455
(7) Golub
456
et al . [ 12 ] wished to be able to
457
distinguish acute myeloid leukemia (AML) from acute
458
lymphoblastic leukemia (ALL). Starting with the expression
459
of 6817 genes from 38 patients, they did a 2-class
460
clustering. They then did a neighbor analysis to identify
461
1100 genes occurring above chance levels which related to
462
the AML/ALL distinction. They choose an informative subset
463
of 50 genes to weight for class assignment of the patients.
464
They were able to correctly classify 29/34 patients for a
465
diagnostic accuracy of 85%. They next attempted to use
466
self-organizing-maps (SOM) for 2 classes in place of the
467
initial clustering. This produced only 4/38 errors for 89%
468
diagnostic accuracy. Drawing a 20 gene predictor from these
469
SOM classes, they again produced 4/38 errors, maintaining a
470
89% accuracy. These authors also attempted to use array
471
data to predict clinical outcome on 15 AML patients but
472
without success.
473
The identification of specific genes associated with a
474
particular biological characteristic such as malignant
475
phenotype would be useful in many settings, [ 1 ] Precise
476
classification and staging of tumors is critical for the
477
selection of the appropriate therapy. At present,
478
classification is accomplished by morphologic,
479
immunohistochemical, and limited biological analyses.
480
Neural net analysis in the form of specific donor profiles
481
could provide a fine structure analysis of tumors
482
characterizing them by a precise weighting of the genes,
483
which they express differentially. At present, only subsets
484
of patients with a given type of tumor respond to therapy.
485
Networks trained to distinguish responders from
486
non-responders would allow a comparison of tumor-expressed
487
genes in responders and non-responders to find those genes
488
most predictive of response. Recently we have used neural
489
networks on the data of Perou
490
et al . [ 12 ] for classifying breast
491
tumors as hormonally responsive or non-responsive. Networks
492
that gave a perfect classification with 496 genes pointed
493
to a subset of 12 genes. Retraining on these 12 genes
494
produced no error in classifying 62 tissue samples from
495
their study (unpublished data). We have also analyzed the
496
data of Dhanasekaran,
497
et al [ 11 ] . Here the original set
498
of 9984 genes was reduced to 34 genes. Retraining on these
499
34 genes gave no errors in a 3-way (normal, early tumor,
500
metastatic disease) classification of 53 patients
501
(unpublished data). Given the significant impairment in the
502
quality of life for many patients undergoing chemotherapy
503
and/or radiation therapy, such prospective information
504
would be extremely beneficial. [ 3 ] T cell and
505
antibody-mediated immunotherapy may be efficacious
506
approaches for limiting tumor growth in cancer patients. At
507
present there is a paucity of known tumor rejection
508
antigens that can be targeted. Neural net analysis may
509
identify a panel of tumor-encoded genes shared by many
510
patients with the same type of cancer and thereby provide a
511
repertoire of potentially novel tumor rejection antigens. [
512
4 ] For many patients with autoimmune disease the target
513
antigen(s) is unknown. Enhanced identification of cell-type
514
specific markers of the target organ through neural net
515
profiling could identify potential target antigens as
516
candidate molecules for testing and tolerance
517
induction.
518
519
520
Conclusions
521
We believe neural networks will be an ideal tool to
522
assimilate the vast amount of information contained in
523
microarrays. The artificial networks presented here were
524
not selected from a large number of attempts. The networks
525
described here are the first or second attempts with the
526
data and format stated; the longest training session lasted
527
less than 5 minutes. Indeed, the trained neural network
528
may, in the form of its weight matrix, have the best
529
possible "understanding" of the very broad statement being
530
made in the microarray, a view that is accessible with the
531
differentiation of the network. In this study, that
532
viewpoint suggested a small subset of genes, which proved
533
sufficient to give a near-perfect classification in each of
534
two problems. This approach should be suitable for any
535
microarray study and, indeed, other global studies such as
536
2-D gels and mass-spec data which contain sufficient
537
information for training.
538
539
540
Methods
541
The data from microarray experiments are stored in
542
spreadsheet form, representing the positive or negative
543
level of expression, relative to some control state, of
544
1000's of genes for two or more experimental conditions. A
545
short software program is sufficient to translate these
546
data directly into a binary representation suitable as
547
input vectors for a neural network. The neural network
548
software used throughout this study was NeuralWorks
549
Professional II Plus v.5.3.Neural networks were trained on
550
the corresponding data sets, with a fraction of the data,
551
typically 10%, withheld for testing purposes. All open
552
fields in the data array were set to zero. The trained
553
networks were then asked to classify new test data as to
554
donor type. Since the gene expression levels are read
555
directly from the spreadsheet, their order and names are
556
provided by the spreadsheet. Given the large amount of
557
input data, these networks generally converge to a low
558
error level very quickly during training, often in a few
559
minutes or less. Subsequently additional networks were
560
trained with a simplified input that contained only
561
qualitative information in the form of a plus or minus sign
562
to characterize the expression of each gene in the panel.
563
This reduced the input size to 2 bits per gene, 01 for
564
below the control and 10 for above, or equal to, the
565
control. The output neuron was trained to output 1.0 for a
566
positive donor and 0.0 for a negative donor in the
567
diagnostic networks; for the prognostic networks 1.0
568
indicated a non-survivor and 0.0 a survivor. The 4026 gene
569
panel network was provided, respectively, 100 or 67
570
middle-layer neurons for the 3 bit or 2 bit per gene
571
inputs. With a very large number of input neurons it is
572
possible to overload the middle-layer neurons, effectively
573
always operating them at one extreme limit or the other;
574
this can have the undesirable effect of reducing their
575
sigmoid transfer function to a step function, with the loss
576
of the network's non-linearity. This is clearly indicated
577
if multiple output values are found to be exactly
578
identical. Networks were trained to an error level below
579
0.05 after which they were tested with previously unseen
580
data. A possible disadvantage of neural networks,
581
especially with a large input space and a relatively small
582
sample number, is overtraining. In overtraining, a network
583
can learn the specifics of each training example as opposed
584
to finding a global solution for the entire training set.
585
This behavior is characterized by a degradation in test
586
scores as training sessions are extended. Although we saw
587
no evidence of this in this study, we did look to see how
588
much additional training would be necessary to degrade the
589
test results in the case of the initial diagnosis networks
590
with 4026 genes. It was not until we doubled the training
591
iterations dictated by the 0.05 output error cutoff that we
592
saw some increased test error. At double the normal
593
training interval, 8 networks were unchanged, but 2
594
networks showed an increased error of 1. This is
595
suggestive, but not proof, of the onset of overtraining.
596
The networks trained on the reduced 34 or 19 gene sets had
597
6 or 4 middle-layer neurons.
598
To differentiate a trained network with respect to
599
specific inputs, a network was trained on the 4026 gene
600
panel with 2 bits per gene. The 5 positive donors from the
601
test set were each differentiated, using software that we
602
designed for that purpose [ 2 ] . The selected genes were
603
then compared among the 5 sets, with genes occurring in 3
604
or more instances being included in the final subset. This
605
requirement generated a subset of 292 genes from the
606
original 4026 genes. Networks were trained on this 292 gene
607
subset and on two 146 gene subsets, representing every
608
other gene from the 292 set. All were coded with 3 bits per
609
gene and employed networks with 25 or 12 middle-layer
610
neurons, respectively. Other networks were trained on the
611
292 gene set and the 146 'even' set, coded with 2 bits per
612
gene for subsequent differentiation.
613
The differentiation of the large panel networks trained
614
for prognosis arbitrarily employed more selective criteria
615
(see text) for subset determination with the result that a
616
single differentiation reduced the gene set from 4026 genes
617
to 34 genes. Subsequent networks demonstrated that this was
618
a highly effective selection.
619
All networks in this study were three-layer back
620
propagation networks trained with a learning coefficient of
621
0.3 and a momentum coefficient of 0.4 using the generalized
622
delta learning rule and the standard sigmoidal transfer
623
function. The cutoff, in all cases, between positive and
624
negative scoring was taken to be 0.05 RMS error at the
625
output neuron No network required more than 4 minutes
626
training time on a PC at 650 Mh; in the majority of cases,
627
the network was fully trained in less than a minute.
628
Training and testing a 10 network round-robin series could
629
generally be done in less than 20 minutes. Training was
630
deliberately kept to a minimum to avoid over-training. The
631
networks represented here were in each case the first or
632
second attempt result for the given problem. There was no
633
"data trolling."
634
635
636
Note
637
1All data not shown can be found at the site
638
http://research.umbc.edu/~moneill/GBMS
639
640
641
642
643