Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
Among the many approaches to identifying functional
7
relationships among genes, the use of bibliographic data to
8
group genes that are functionally related has recently
9
attracted great attention. The huge repository of
10
biological literature, which is still growing at a rapid
11
pace, makes it increasingly difficult for any individual to
12
monitor exhaustively the constituent items related to a
13
specific biological process. Therefore, automated data
14
mining systems for biological literature are becoming a
15
necessity.
16
The availability of biomedical literature in electronic
17
format has made it possible to implement automatic text
18
processing methods to expose implicit relationships among
19
different documents, and more importantly, the functional
20
relationships among the molecules and processes that these
21
documents describe.
22
Shatkay
23
et al [ 1 ] proposed a method, which
24
we denote as the "kernel document method", and applied it
25
to the identification of functional relationships among
26
yeast genes. Briefly, for each gene, a kernel document is
27
carefully selected to establish a one-to-one correspondence
28
between a gene and a kernel document. A set of "related
29
documents" associated with each kernel document is
30
identified using statistical information retrieval methods.
31
The extent to which the two sets of related documents
32
corresponding to each of a pair of kernel documents overlap
33
reflects the relevance of these two kernel documents, and
34
hence the possible functional relatedness of the
35
corresponding genes.
36
The utility of this method relies heavily on the quality
37
of the kernel documents. In this context, a good kernel
38
document should focus on the functions of a gene, instead
39
of on other topics such as the methods or techniques used
40
to identify or study the gene. With carefully selected
41
kernel documents, the relatedness of this gene to others
42
can be made reliant on functional rather than, e.g.,
43
structural characteristics. For example, if the topic of
44
one kernel document is "studying gene A by method X", and
45
the topic of the other kernel document is "studying gene B
46
by method X", two functionally unrelated genes A and B
47
could be related to one another simply because they have
48
both been studied by method X. Avoiding such "false
49
positives" is a challenge in applying this method. The
50
selection of functionally-descriptive kernel documents is,
51
therefore, a key to the success of this algorithm.
52
In the original kernel document method, all documents
53
that are related to two kernel documents are weighted
54
equally in establishing the qualitative and quantative
55
aspects of relationship between these two kernel documents.
56
A better practice is to give each document a weight
57
reflecting the relative uniqueness of this document's
58
relationship to the kernel documents. A document that is
59
related to only a few kernel documents is given a greater
60
weight than one that is related to many kernel documents.
61
This argument can be illustrated with an intuitive example:
62
if you are asked to identify two people from a crowd, it is
63
not very helpful if the only information you are given is
64
that each of the two has a nose. However, if you are told
65
that each of the two has a mole on the forehead, it will
66
not be too difficult to single them out. This is because
67
"having a nose" is a feature common to almost everybody.
68
But the description that each of two people has a mole on
69
the forehead, an uncommon feature, is an important piece of
70
information that can be used to establish a link between
71
the two people.
72
The kernel document method was initially applied to
73
yeast genes. Intense, relatively long-standing analysis of
74
yeast genetics has resulted in a large number of PubMed
75
entries on these genes. Whether the kernel document method
76
could be applied to other less abundantly represented
77
genes, such as human genes, was not known. Here we will
78
apply this method to human genes, and show that this method
79
can indeed produce meaningful results when applied to human
80
genes.
81
A potential limitation of the original kernel document
82
method is that only one kernel document is chosen for each
83
gene. Many genes encode multi-functional proteins, and one
84
kernel document might relate only to a certain aspect of
85
the gene's many functions. We addressed this problem by
86
selecting multiple kernel documents for a gene, so that any
87
known function of the gene would be discussed in at least
88
one of these kernel documents.
89
Jenssen
90
et al [ 2 ] took a different
91
approach. They analyzed the titles and abstracts of MEDLINE
92
records to look for co-occurrence of gene symbols. The
93
results are available at PubGene http://www.pubgene.org.
94
This approach is based on the assumption that if two gene
95
symbols appear in the same MEDLINE record, the genes are
96
likely to be related. Furthermore, the number of papers in
97
which the pair of genes both appear is used to assess the
98
strength of relationships between the two genes. Jenssen
99
et al manually examined 1,000
100
randomly selected pairs from the network of genes that had
101
been created using this method: the proportion of incorrect
102
(biologically meaningless) pairs were 40% for the
103
low-weight category and 29% for the high-weight category.
104
The main advantage of this method in comparison with the
105
kernel document method is that it avoids the difficulty of
106
selecting an appropriate kernel document. However, this
107
method cannot identify genes that are functionally related,
108
but are not mentioned together in any MEDLINE abstract.
109
Such implicit relationships between genes are inherently
110
more interesting in the context of mechanism/pathway
111
discovery by computation.
112
In this paper, we employ a method that is based upon the
113
kernel document concept, with several enhancements. First,
114
instead of choosing one kernel document for each gene, we
115
employ all of the reference articles cited for each gene
116
symbol in OMIM. Admittedly, not all of these articles are
117
good candidates for kernel documents. However, the
118
reference articles cited under each OMIM entry are a set of
119
documents selected by investigators familiar with the gene
120
and are, therefore, related to the gene in some way.
121
Furthermore, by a simple examination of the titles of the
122
articles for keywords alluding to methods or techniques,
123
many articles that would be likely to constitute false
124
positives in this context are excluded. Second, instead of
125
weighing each related article equally, a weight is
126
calculated for each article that is related to two or more
127
kernel documents. We call these articles "base vector
128
documents", because eventually a kernel document will be
129
represented by a vector whose elements are determined by
130
whether it is related to a base vector document. The more
131
kernel documents a base vector document is related to, the
132
less its weight.
133
134
135
Methods
136
137
Data Preparation
138
139
1. Download the list of OMIM genes
140
The OMIM gene list can be downloaded from NCBI
141
http://www.ncbi.nlm.nih.gov/Omim/Index/genetable.html.
142
This list is inserted into a relational database table,
143
which consists of only two fields: the symbol of a
144
gene, and the corresponding OMIM identification number
145
(OMIMID). However, due to inconsistencies in gene
146
naming and use conventions, several gene symbols may
147
correspond to the same OMIMID.
148
149
150
2. Download the references cited under each
151
OMIMID
152
The reference papers listed under each OMIMID are
153
then downloaded. Each distinct reference paper has a
154
unique PubMed identification number (PMID). The titles
155
of all such PubMed papers are also obtained. The data
156
are stored in another table consisting of four fields,
157
OMIMID, PMID, TITLE and KEEP. The first three fields
158
are self-explanatory. KEEP is a flag indicating whether
159
a particular PubMed paper should be treated as a kernel
160
document. As indicated earlier, methodology papers are
161
generally not good candidates for kernel documents. To
162
reduce the number of such false positives, a list of
163
keywords/phrases that include the commonly used methods
164
and techniques is compiled. If the title of a paper
165
includes any of the phrases in the list, the KEEP flag
166
of the paper is turned off (set to zero).
167
168
169
3. Download the related documents
170
We treat each reference paper whose KEEP flag is on
171
as if it were a kernel document. The documents related
172
to each of these reference papers can be obtained from
173
NCBI
174
http://www.ncbi.nlm.nih.gov/entrez/utils/pmneighbor.fcgi?pmidfepmid=PMID.
175
A detailed description of the computational methods
176
used by NCBI to identify related documents is available
177
at
178
http://www.ncbi.nim.nih.gov/PubMed/computation.html.
179
The related documents (or neighbors) of a particular
180
paper are listed according to their relevance to the
181
paper. Documents that appear on the top of the list are
182
more similar to the query than those appear near the
183
bottom of the list. We keep only the PMIDs of the first
184
100 related documents in the list and the data are
185
stored in another table, consisting of three fields,
186
PMIDK (PMID of the kernel document), PMIDN (PMID of the
187
related document or the neighbor), and RANK, a number
188
from 1 to 100, indicating the place a document appear
189
in the list of related documents. Obviously, for any
190
PMIDK, RANK = 1 if PMIDN = PMIDK, this is because a
191
document is always most similar to itself.
192
193
194
195
Construction of Base Vectors Documents
196
Using the data obtained in the previous section, the
197
base vector documents are defined. These are the
198
documents that are related to at least two other
199
documents and are among the 50 top-ranking related
200
documents of any document. The result is inserted into
201
another database table that consists of three fields: 1.
202
PMID, the PubMed identifier of the base vector document;
203
2. LINKED2, the number of kernel documents of which the
204
specified document is a neighbor; and 3. WEIGHT, which is
205
an indication of the importance of a base vector document
206
in revealing the relevance between two kernel documents.
207
The weight
208
w
209
210
i
211
for a base vector document
212
b
213
214
i
215
is calculated using the following equation:
216
217
where
218
n
219
220
i
221
is the number of related documents for
222
b
223
224
i
225
and
226
N is the total number of kernel
227
documents. This weight measurement method is based upon
228
information theory [ 3 ] , and is similar to the weight
229
measure employed by Wilbur
230
et al [ 4 ] to evaluate the
231
significance of a specific keyword in determining the
232
relatedness of two papers.
233
234
235
Vector Representation of a Kernel Document
236
Assuming that there are
237
M base vectors documents,
238
b
239
1 ,
240
b
241
2 , . . .,
242
b
243
244
M
245
, and the weight of
246
b
247
248
i
249
is w
250
251
i
252
, then any kernel document
253
K can now be represented by a vector
254
(
255
k
256
1 ,
257
k
258
2 ,...,
259
k
260
261
M
262
), with
263
264
The norm ||
265
K || of a kernel document
266
K , i.e., the length of the
267
corresponding vector, can be calculated as follows:
268
269
270
271
Calculation of Similarity Scores
272
The cosine similarity score
273
S
274
275
ij
276
of any two kernel documents
277
K
278
279
i
280
, and
281
K
282
283
j
284
can now be calculated:
285
286
where
287
288
and
289
290
is the dot product of the two vectors
291
K
292
293
i
294
and
295
K
296
297
j
298
.
299
300
S
301
302
ij
303
is between 0 and 1, i.e., 0 ≤
304
S
305
306
ij
307
≤ 1. The closer
308
S
309
310
ij
311
is to 1, the more similar two kernel documents
312
K
313
314
i
315
and
316
K
317
318
j
319
are.
320
This is the most computationally intensive part of the
321
calculation and the code is implemented in C. Once the
322
similarity scores for all possible pairs of PMIDs are
323
calculated, the scores are stored in a relational
324
database table, and it is not necessary to recalculate
325
the scores for subsequent queries.
326
327
328
Gene Relationship
329
The score
330
S
331
332
ij
333
calculated for two kernel documents
334
K
335
336
i
337
and
338
K
339
340
j
341
does not directly reflect the relevance of two
342
genes. To assess the functional relationship between two
343
genes, gene symbols must be related to PMIDs.
344
In order to identify the set of genes that are
345
relevant to a query gene
346
G, the PMIDs of all reference
347
papers listed under the OMIMID for the query gene are
348
obtained. Each of these reference papers, except any
349
paper whose KEEP flag is turned off, is treated as a
350
kernel document.
351
There are several considerations that support this
352
approach to selection of kernel documents:
353
• The reference papers listed under each OMIMID were
354
selected specifically because of their relevance to the
355
corresponding gene;
356
• The titles of these papers were screened to exclude
357
those that describe commonly used methods or techniques
358
in order to reduce the number of "false positives";
359
• The process can be fully automated to avoid manually
360
selecting kernel documents.
361
An interface is provided to allow the user to
362
"fine-tune" the query by manually selecting only some of
363
the reference papers as kernel documents.
364
Next, all documents (represented by their PMIDs) that
365
are related to each kernel document with a score higher
366
than a specified threshold are identified. The OMIMIDs
367
that have cited papers with any of these PMIDs are
368
collected. Finally, these OMIMIDs are connected to their
369
respective gene symbols. The entire process is shown in
370
Figure 1.
371
372
373
User Interface
374
A user interface is available at
375
http://gene.cpmc.columbia.edu/cgi-bin/gene.cgi. Once the
376
gene symbol and a cutoff score (i.e., the cosine
377
similarity score between two kernel documents that
378
correspond to respective genes) are entered, a list of
379
reference papers cited in OMIM for the gene is displayed.
380
Only those papers whose KEEP flag is turned on are shown.
381
The user may select specific paper(s) from the list as
382
kernel documents, or simply check the "Check All" box to
383
use all these papers as kernel documents.
384
Once the submit button is clicked, the genes with
385
scores higher than the cutoff score are displayed.
386
387
388
389
Results
390
391
Summary of Raw Data
392
At the time when the raw data were downloaded in July
393
2001, there were 11251 gene symbols in the OMIM gene
394
list, corresponding to 7192 distinct OMIMIDs. Multiple
395
gene symbols may have the same OMIMID because many genes
396
have aliases, resulting in several symbols referring to
397
the same gene.
398
Among the 7192 distinct OMIMIDs, 7085 cite reference
399
paper with PMIDs, and 107 (about 1.5%) OMIMIDs do not
400
cite any reference paper, or only cite reference papers
401
whose PMIDs are not specified in OMIM. 54024 reference
402
papers are listed under the 7085 distinct OMIMIDs. Some
403
papers are referenced under several OMIMIDs, therefore,
404
the actual number of distinct PMIDs is 47428.
405
The title of the corresponding document for each of
406
these 47428 PMIDs is also obtained. After screening the
407
titles using the method described earlier, the KEEP flags
408
of 3680 documents (about 7.8%) were turned off.
409
Ultimately, only those 43748 documents whose KEEP flags
410
are turned on will be used as kernel documents. However,
411
we initially treat all 47428 documents as kernel
412
documents, allowing us to estimate the extent to which
413
these documents whose KEEP flags are turned off
414
contribute to false positives.
415
For each of the 47428 distinct PMIDs, the related
416
documents ("neighbors") are obtained from NCBI. As
417
indicated earlier, only the first 100 PMIDs of the list
418
of related documents are stored, because they are the
419
ones most related to the kernel document. The highest
420
ranking neighbor of any document is, of course, itself.
421
This search resulted in 4629037 pairs of neighbors, a
422
number that would be much larger if all, instead of only
423
the top 100, neighbors of a document are kept.
424
425
426
Summary of Results of Calculation
427
The preliminary search identified 437382 base vector
428
documents. Any of these documents is a neighbor of at
429
least two kernel documents. On average, a base vector
430
document is related to 9.1 kernel documents. The average
431
weight of the base vector documents is of 13.13, the
432
maximum weight is 14.53, which corresponds to those base
433
vector documents that are only related to two kernel
434
documents; the minimum weight is 4.66, which corresponds
435
to a base vector document with 1873 neighbors. As
436
described in the Methods section, the weight of a base
437
vector document indicates how much information is
438
conveyed about the relevance of two kernel documents by
439
knowing that both of them are neighbors of this
440
particular base vector document. The more kernel
441
documents a base vector document is related to, the less
442
its weight. Figure 2shows this relationship. For example,
443
a base vector document that is related to 740 kernel
444
documents has a weight of 6, only half of the weight of a
445
document that is related to 12 kernel documents.
446
Next, the norm of each kernel document is calculated.
447
There are 95 kernel documents with a norm of zero. These
448
documents do not have any neighbor that is one of the
449
base vector documents. As a result, only 47333 kernel
450
documents are left.
451
Finally, the cosine similarity score of each pair of
452
kernel documents is calculated. A document is treated as
453
a kernel document if its KEEP flag is on and its norm is
454
greater than zero. There are 43658 such documents. Out of
455
the 43658(43658-1)/2 = 952988653 possible pairs, only
456
6596918 (about 0.7%) have a similarity score that is
457
greater than zero, indicating some relationship between
458
the two kernel documents of the pair. The average score
459
is 0.04. However, if both documents of a pair are listed
460
as references under the same OMIMID, the average score is
461
0.14, which is much higher than the overall average
462
score. This difference is expected because the documents
463
listed under the same OMIMID have been selected because
464
they all have some relationship to the gene that
465
corresponds to the OMIMID. Furthermore, this average
466
score also provides an indication of the approximate
467
value of the threshold score that should be used to
468
decide whether two kernel documents are closely
469
related.
470
Documents that discuss methods or techniques are not
471
included when the similarity scores are calculated,
472
because these documents can lead to false positives - a
473
pair of genes with a high score that are functionally
474
unrelated. To investigate the impact of such documents,
475
we intentionally included them in the calculation of the
476
scores. Excluding these documents when responding to a
477
query is straightfoward, one needs only to check the KEEP
478
flag of a document. The average similarity score of any
479
pair in which both documents have a turned-off KEEP flag
480
is 0.11, much higher than the overall average score 0.04
481
and close to the average score among a pair of documents
482
referenced by the same OMIMID, i.e., 0.14. This result
483
indicates that these documents should be excluded from
484
calculations designed to find functional
485
relationships.
486
Although documents that are likely to cause false
487
positive have been excluded by the automated screening
488
process described above, the screened set of documents
489
may still include many that are not optimal kernel
490
document candidates. A solution to this is to actually
491
let the users select specific kernel documents from a
492
list of documents.
493
494
495
An Example
496
As an illustration, we use this computational strategy
497
to identify genes related to the apoptosis (programmed
498
cell death) pathway in human. A brief recent review of
499
this pathway has been given by DeFrancesco [ 9 ] .
500
To use this strategy, it is necessary to have a gene
501
to start with. This is usually a gene that is known to be
502
associated with the pathway or function of interest.
503
Usually, such a gene is known to the user who submits the
504
query. If necessary, one can also perform a preliminary
505
search of PubMed for the functions or processes of
506
interest in order to obtain the name of a gene to start
507
with.
508
We start with APAF1, a gene known to be involved in
509
the apoptosis pathway [ 8 ] . A cutoff score of 0.2 is
510
employed, and all reference papers cited in OMIM for this
511
gene are used as kernel documents. The analysis
512
identified the list of related genes displayed in Table
513
1.
514
CASP1, CAPS2 and CASP3 all belong to the family of
515
apoptosis-related cysteine proteases. Caspase activation
516
is a key regulatory step for apoptosis [ 10 11 ] .
517
DIABLO, also known as SMAC (second
518
mitochondria-derived activator of caspase), promotes
519
caspase activation in a cytochrome c-APAF1-CASP9 pathway
520
[ 5 ] .
521
The identification of XK and ABC3 is more interesting,
522
because they are not well recognized as components of the
523
apoptosis pathway. In order to identify the process by
524
which XK was included, we retrace the search path to find
525
the two original kernel documents that related APAF1 to
526
XK. They are: "Apaf-1, a human protein homologous to C.
527
elegans CED-4, participates in cytochrome c-dependent
528
activation of caspase-3" (PMID: 9267021), a paper linked
529
to APAF1; and "The ced-8 gene controls the timing of
530
programmed cell death in C. elegans" (PMID: 10882128), a
531
paper linked to XK. XK is a Kell blood group precursor.
532
Stanfield
533
et al [ 6 ] noted that 458-amino
534
acid CED8 transmembrane protein of C. elegans is weakly
535
similar to the human XK protein. The CED8 and XK proteins
536
share 19% amino acid identity, have similar hydropathy
537
plots, and both contain 10 hydrophobic predicted
538
membrane-spanning segments. CED8 functions downstream of,
539
or in parallel to, the regulatory cell death gene CED9,
540
and may function as a cell death effector downstream of
541
the caspase encoded by programmed cell death gene, CED3.
542
APAF1 is known to share amino acid similarity with CED3
543
and CED4, a protein that is believed to initiate
544
apoptosis in C. elegans.
545
The gene ABC3 (ABC Transporter 3) is linked to APAF1
546
in a manner similar to that which connects XK to APAF1.
547
It is reported that CED7 protein has sequence similarity
548
to ABC transporters. CED7 functions in the engulfment of
549
cell remnants during programmed cell death [ 7 ] .
550
There was evidence that BCL2 is a homolog of CED9:
551
CED9 encodes a 280 amino acid protein showing sequence
552
and structural similarity to BCL2 [ 12 ] . BCL2 is
553
involved in programmed cell death [ 9 ] .
554
A secondary search can be performed with each of the
555
genes in Table 1. Usually, more stringent criteria is
556
required for secondary searches because the genes used
557
for secondary queries often have other functions not
558
related to the one of interest. Kernel documents need to
559
be selected more carefully, and a higher cut-off score
560
might be used.
561
For example, for XK, if all papers cited in OMIM for
562
the particular gene are used as kernel documents, there
563
are many high-score hits that do not seem to be directly
564
linked to apoptosis. Among the kernel document candidates
565
for XK, the title of only one of the papers mentions
566
programmed cell death. The majority of papers discusses
567
McLeod syndrome, which is associated with XK, but has no
568
recognized relationship with apoptosis.
569
Therefore, further inspection is necessary to
570
determine whether these hits are really linked to the
571
apoptosis pathway. To simplify the process and obtain
572
better results, instead using all reference papers cited
573
in OMIM for each of these genes, we manually select
574
kernel documents from the list of OMIM reference papers
575
for these secondary searches, using the interface
576
described before. For example, in a list of more than 20
577
papers cited for XK, we choose only one paper, titled
578
"The ced-8 gene controls the timing of programmed cell
579
death in C. elegans".
580
With the results of the initial and secondary
581
searches, a network of genes nominally associated with
582
apoptosis can be built. The network is shown in Figure
583
3.
584
If necessary, further searches can be performed with
585
the hits from a previous search, so that the network can
586
be expanded to include more genes.
587
588
589
590
Discussion
591
The similarity score is the only criterion used to
592
determine whether two documents are related. Any two
593
documents with a similarity score above the cutoff score
594
are considered to be related.
595
Here we discuss how the cutoff score should be
596
determined. To this end, it is necessary to investigate how
597
the distribution of similarity scores differs between
598
related and unrelated document pairs.
599
To simplify the problem, we assume that any two
600
documents that are listed as references under the same
601
OMIMID are related, and that the distribution between such
602
documents approximates the distribution between two related
603
documents.
604
For any two documents that are not listed under the same
605
OMIMID, it is reasonable to assume that they are unrelated,
606
because the vast majority of such documents are, in fact,
607
unrelated. Therefore, we assign the score distribution for
608
unrelated documents to such pairs. It should be emphasized
609
that this assumption is an approximation. Indeed, the most
610
interesting documents are those documents that are not
611
listed under the same OMIMID, but are found through
612
analysis to be related. However, this assumption makes
613
finding the distribution of similarity scores among
614
unrelated documents much easier.
615
Table 2is a summary of the score distributions of
616
related and unrelated document pairs. Note that for
617
unrelated documents, 75% of the scores are less than
618
0.03087, while for related documents, only 25% of the
619
scores are less than 0.03027.
620
The probability
621
P (
622
S >
623
S
624
625
cutoff
626
) of the score
627
S being greater than a cutoff score,
628
S
629
630
cutoff
631
, can be easily found:
632
633
where
634
N (
635
S ≤
636
S
637
638
cutoff
639
) is the number of document pairs whose similarity
640
score is not greater than the cutoff score, and
641
N is the total number of such
642
pairs.
643
644
P (
645
S >
646
S
647
648
cutoff
649
) was calculated separately for those pairs in which
650
both documents were listed under the same OMIMID, i.e., the
651
"related documents" according to the assumption above, and
652
for those pairs in which the two documents were not listed
653
under the same OMIMID, i.e., the "unrelated documents"
654
corresponding to our definitions. The results are shown in
655
Figure 4. The solid curve is the probability
656
P (
657
S >
658
S
659
660
cutoff
661
) for related document pairs (true positives), the
662
dotted curve is the probability
663
P(S >
664
S
665
666
cutoff
667
) for unrelated document pairs (false positives).
668
Using a cutoff score of 0.05, about 61% of the related
669
documents will be accepted; these documents are true
670
positives. About 39% of the related documents will be
671
rejected; these are the false negatives. Only 14% of the
672
unrelated documents will be accepted; these are the false
673
positives. And, 86% of the unrelated documents will be
674
rejected, these are the true negatives.
675
Based on these results, the sensitivity and specificity
676
of the search can be calculated. The sensitivity is the
677
proportion of related document pairs that are about the
678
cutoff score, and therefore are accepted. Therefore, the
679
solid curve in Figure 4is also the sensitivity curve. The
680
specificity is the proportion of unrelated documents that
681
are below the cutoff score, and therefore are rejected.
682
Specificity is equal to 1 -
683
P (
684
S >
685
S
686
687
cutoff
688
), where
689
P (
690
S >
691
S
692
693
cutoff
694
) is the proportion of unrelated document pairs that
695
are above the cutoff score
696
S
697
698
cutoff
699
. In Figure 4, the dashed curve is the specificity
700
curve.
701
Figure 4can be used to determine what cutoff score to
702
use for any specific purpose. For example, using a high
703
cutoff score such as 0.2, the specificity will be 0.985,
704
corresponding to a false positive rate of only 1.5%.
705
However, the corresponding sensitivity is 0.248, so that
706
above three quarters of the related documents will also be
707
rejected. On the other hand, choosing a low cutoff score
708
will result in many false positives, while ensuring that
709
most related documents are accepted. Using a cutoff score
710
of 0.03, both the sensitivity and specificity will be
711
around 0.75. However, because there are often many more
712
unrelated documents than related documents, the search
713
result will still contain many false positives. By
714
referring to Figure 4, users can select a cutoff score that
715
is best suited to their needs.
716
717
718
Conclusions
719
The key to the success of the kernel document method is
720
the selection of the kernel documents. However, this is
721
also the most difficult and tedious part of the
722
implementation. An efficient way to select the kernel
723
documents related to gene function is necessary for a
724
large-scale literature mining effort using this method. We
725
started with all of the reference papers listed in OMIM,
726
and applied a filter to exclude those papers that are
727
likely to focus primarily on methods and techniques. We can
728
either treat the rest of papers as kernel documents, or
729
allow the user to select kernel documents from this small
730
pool of papers (usually contain around a dozen papers).
731
This process can be fully automated. Furthermore, since
732
we are not limited to one kernel document per gene, a gene
733
can correspond to multiple kernel documents that capture
734
different aspects of its functions. This characteristic of
735
the strategy makes it possible to identify genes that are
736
related to the query gene through a variety of functional
737
mechanisms.
738
In distinction to the gene co-occurrence method used by
739
Jenssen
740
et al, this approach does not require
741
the symbols of two gene to appear in the title or abstract
742
of the same paper in order to establish a relationship
743
between them. As long as similar or related functions of
744
the two genes are described in the literature, the
745
relationship between the two genes is likely to be
746
revealed. Furthermore, it is easier to identify the related
747
functions of these genes because they are precisely those
748
functions that related one gene to another by computation.
749
While the co-occurrence method is biased towards revealing
750
gene relationships that have been explicitly described in
751
the literature, the method we propose is more sensitive to
752
implicit relationships between two genes that have not
753
necessarily been explicitly identified.
754
The process of selecting kernel documents can also be
755
improved by taking advantage of user feedback in a
756
networked environment. For example, the user can be allowed
757
to select kernel documents from a list of candidate papers.
758
The papers selected most frequently by users can then be
759
used as the bases for subsequent automatic kernel document
760
selection in searches related to a specific gene or
761
pathway.
762
Finally, it is important to take note of the limitation
763
of literature mining tools: two genes could be found to be
764
related for many reasons, some of which might not be
765
biologically meaningful. The identified relationships could
766
therefore have different biological meanings, if any.
767
Further investigation is always necessary to determine the
768
origin of such relatedness. However, bibiliographic data
769
mining efforts such as ours could shed light on the less
770
obvious relationships between two genes. When considered in
771
conjunction with other data, such as gene expression
772
profiles, the results could lead to biologically meaningful
773
conclusions.
774
775
776
777
778