CoCalc -- 1471-2105-3-16.txt

OANC_GrAF / data / written_2 / technical / biomed / 1471-2105-3-16.txt
³⁹⁶⁷³ views
1

2
  
3
    
4
      
5
        Background
6
        Among the many approaches to identifying functional
7
        relationships among genes, the use of bibliographic data to
8
        group genes that are functionally related has recently
9
        attracted great attention. The huge repository of
10
        biological literature, which is still growing at a rapid
11
        pace, makes it increasingly difficult for any individual to
12
        monitor exhaustively the constituent items related to a
13
        specific biological process. Therefore, automated data
14
        mining systems for biological literature are becoming a
15
        necessity.
16
        The availability of biomedical literature in electronic
17
        format has made it possible to implement automatic text
18
        processing methods to expose implicit relationships among
19
        different documents, and more importantly, the functional
20
        relationships among the molecules and processes that these
21
        documents describe.
22
        Shatkay 
23
        et al [ 1 ] proposed a method, which
24
        we denote as the "kernel document method", and applied it
25
        to the identification of functional relationships among
26
        yeast genes. Briefly, for each gene, a kernel document is
27
        carefully selected to establish a one-to-one correspondence
28
        between a gene and a kernel document. A set of "related
29
        documents" associated with each kernel document is
30
        identified using statistical information retrieval methods.
31
        The extent to which the two sets of related documents
32
        corresponding to each of a pair of kernel documents overlap
33
        reflects the relevance of these two kernel documents, and
34
        hence the possible functional relatedness of the
35
        corresponding genes.
36
        The utility of this method relies heavily on the quality
37
        of the kernel documents. In this context, a good kernel
38
        document should focus on the functions of a gene, instead
39
        of on other topics such as the methods or techniques used
40
        to identify or study the gene. With carefully selected
41
        kernel documents, the relatedness of this gene to others
42
        can be made reliant on functional rather than, e.g.,
43
        structural characteristics. For example, if the topic of
44
        one kernel document is "studying gene A by method X", and
45
        the topic of the other kernel document is "studying gene B
46
        by method X", two functionally unrelated genes A and B
47
        could be related to one another simply because they have
48
        both been studied by method X. Avoiding such "false
49
        positives" is a challenge in applying this method. The
50
        selection of functionally-descriptive kernel documents is,
51
        therefore, a key to the success of this algorithm.
52
        In the original kernel document method, all documents
53
        that are related to two kernel documents are weighted
54
        equally in establishing the qualitative and quantative
55
        aspects of relationship between these two kernel documents.
56
        A better practice is to give each document a weight
57
        reflecting the relative uniqueness of this document's
58
        relationship to the kernel documents. A document that is
59
        related to only a few kernel documents is given a greater
60
        weight than one that is related to many kernel documents.
61
        This argument can be illustrated with an intuitive example:
62
        if you are asked to identify two people from a crowd, it is
63
        not very helpful if the only information you are given is
64
        that each of the two has a nose. However, if you are told
65
        that each of the two has a mole on the forehead, it will
66
        not be too difficult to single them out. This is because
67
        "having a nose" is a feature common to almost everybody.
68
        But the description that each of two people has a mole on
69
        the forehead, an uncommon feature, is an important piece of
70
        information that can be used to establish a link between
71
        the two people.
72
        The kernel document method was initially applied to
73
        yeast genes. Intense, relatively long-standing analysis of
74
        yeast genetics has resulted in a large number of PubMed
75
        entries on these genes. Whether the kernel document method
76
        could be applied to other less abundantly represented
77
        genes, such as human genes, was not known. Here we will
78
        apply this method to human genes, and show that this method
79
        can indeed produce meaningful results when applied to human
80
        genes.
81
        A potential limitation of the original kernel document
82
        method is that only one kernel document is chosen for each
83
        gene. Many genes encode multi-functional proteins, and one
84
        kernel document might relate only to a certain aspect of
85
        the gene's many functions. We addressed this problem by
86
        selecting multiple kernel documents for a gene, so that any
87
        known function of the gene would be discussed in at least
88
        one of these kernel documents.
89
        Jenssen 
90
        et al [ 2 ] took a different
91
        approach. They analyzed the titles and abstracts of MEDLINE
92
        records to look for co-occurrence of gene symbols. The
93
        results are available at PubGene http://www.pubgene.org.
94
        This approach is based on the assumption that if two gene
95
        symbols appear in the same MEDLINE record, the genes are
96
        likely to be related. Furthermore, the number of papers in
97
        which the pair of genes both appear is used to assess the
98
        strength of relationships between the two genes. Jenssen 
99
        et al manually examined 1,000
100
        randomly selected pairs from the network of genes that had
101
        been created using this method: the proportion of incorrect
102
        (biologically meaningless) pairs were 40% for the
103
        low-weight category and 29% for the high-weight category.
104
        The main advantage of this method in comparison with the
105
        kernel document method is that it avoids the difficulty of
106
        selecting an appropriate kernel document. However, this
107
        method cannot identify genes that are functionally related,
108
        but are not mentioned together in any MEDLINE abstract.
109
        Such implicit relationships between genes are inherently
110
        more interesting in the context of mechanism/pathway
111
        discovery by computation.
112
        In this paper, we employ a method that is based upon the
113
        kernel document concept, with several enhancements. First,
114
        instead of choosing one kernel document for each gene, we
115
        employ all of the reference articles cited for each gene
116
        symbol in OMIM. Admittedly, not all of these articles are
117
        good candidates for kernel documents. However, the
118
        reference articles cited under each OMIM entry are a set of
119
        documents selected by investigators familiar with the gene
120
        and are, therefore, related to the gene in some way.
121
        Furthermore, by a simple examination of the titles of the
122
        articles for keywords alluding to methods or techniques,
123
        many articles that would be likely to constitute false
124
        positives in this context are excluded. Second, instead of
125
        weighing each related article equally, a weight is
126
        calculated for each article that is related to two or more
127
        kernel documents. We call these articles "base vector
128
        documents", because eventually a kernel document will be
129
        represented by a vector whose elements are determined by
130
        whether it is related to a base vector document. The more
131
        kernel documents a base vector document is related to, the
132
        less its weight.
133
      
134
      
135
        Methods
136
        
137
          Data Preparation
138
          
139
            1. Download the list of OMIM genes
140
            The OMIM gene list can be downloaded from NCBI
141
            http://www.ncbi.nlm.nih.gov/Omim/Index/genetable.html.
142
            This list is inserted into a relational database table,
143
            which consists of only two fields: the symbol of a
144
            gene, and the corresponding OMIM identification number
145
            (OMIMID). However, due to inconsistencies in gene
146
            naming and use conventions, several gene symbols may
147
            correspond to the same OMIMID.
148
          
149
          
150
            2. Download the references cited under each
151
            OMIMID
152
            The reference papers listed under each OMIMID are
153
            then downloaded. Each distinct reference paper has a
154
            unique PubMed identification number (PMID). The titles
155
            of all such PubMed papers are also obtained. The data
156
            are stored in another table consisting of four fields,
157
            OMIMID, PMID, TITLE and KEEP. The first three fields
158
            are self-explanatory. KEEP is a flag indicating whether
159
            a particular PubMed paper should be treated as a kernel
160
            document. As indicated earlier, methodology papers are
161
            generally not good candidates for kernel documents. To
162
            reduce the number of such false positives, a list of
163
            keywords/phrases that include the commonly used methods
164
            and techniques is compiled. If the title of a paper
165
            includes any of the phrases in the list, the KEEP flag
166
            of the paper is turned off (set to zero).
167
          
168
          
169
            3. Download the related documents
170
            We treat each reference paper whose KEEP flag is on
171
            as if it were a kernel document. The documents related
172
            to each of these reference papers can be obtained from
173
            NCBI
174
            http://www.ncbi.nlm.nih.gov/entrez/utils/pmneighbor.fcgi?pmidfepmid=PMID.
175
            A detailed description of the computational methods
176
            used by NCBI to identify related documents is available
177
            at
178
            http://www.ncbi.nim.nih.gov/PubMed/computation.html.
179
            The related documents (or neighbors) of a particular
180
            paper are listed according to their relevance to the
181
            paper. Documents that appear on the top of the list are
182
            more similar to the query than those appear near the
183
            bottom of the list. We keep only the PMIDs of the first
184
            100 related documents in the list and the data are
185
            stored in another table, consisting of three fields,
186
            PMIDK (PMID of the kernel document), PMIDN (PMID of the
187
            related document or the neighbor), and RANK, a number
188
            from 1 to 100, indicating the place a document appear
189
            in the list of related documents. Obviously, for any
190
            PMIDK, RANK = 1 if PMIDN = PMIDK, this is because a
191
            document is always most similar to itself.
192
          
193
        
194
        
195
          Construction of Base Vectors Documents
196
          Using the data obtained in the previous section, the
197
          base vector documents are defined. These are the
198
          documents that are related to at least two other
199
          documents and are among the 50 top-ranking related
200
          documents of any document. The result is inserted into
201
          another database table that consists of three fields: 1.
202
          PMID, the PubMed identifier of the base vector document;
203
          2. LINKED2, the number of kernel documents of which the
204
          specified document is a neighbor; and 3. WEIGHT, which is
205
          an indication of the importance of a base vector document
206
          in revealing the relevance between two kernel documents.
207
          The weight 
208
          w 
209
          
210
            i 
211
           for a base vector document 
212
          b 
213
          
214
            i 
215
           is calculated using the following equation:
216
          
217
          where 
218
          n 
219
          
220
            i 
221
           is the number of related documents for 
222
          b 
223
          
224
            i 
225
           and 
226
          N is the total number of kernel
227
          documents. This weight measurement method is based upon
228
          information theory [ 3 ] , and is similar to the weight
229
          measure employed by Wilbur 
230
          et al [ 4 ] to evaluate the
231
          significance of a specific keyword in determining the
232
          relatedness of two papers.
233
        
234
        
235
          Vector Representation of a Kernel Document
236
          Assuming that there are 
237
          M base vectors documents, 
238
          b 
239
          1 , 
240
          b 
241
          2 , . . ., 
242
          b 
243
          
244
            M 
245
           , and the weight of 
246
          b 
247
          
248
            i 
249
           is w 
250
          
251
            i 
252
           , then any kernel document 
253
          K can now be represented by a vector
254
          ( 
255
          k 
256
          1 , 
257
          k 
258
          2 ,..., 
259
          k 
260
          
261
            M 
262
           ), with
263
          
264
          The norm || 
265
          K || of a kernel document 
266
          K , i.e., the length of the
267
          corresponding vector, can be calculated as follows:
268
          
269
        
270
        
271
          Calculation of Similarity Scores
272
          The cosine similarity score 
273
          S 
274
          
275
            ij 
276
           of any two kernel documents 
277
          K 
278
          
279
            i 
280
           , and 
281
          K 
282
          
283
            j 
284
           can now be calculated:
285
          
286
          where
287
          
288
          and
289
          
290
          is the dot product of the two vectors 
291
          K 
292
          
293
            i 
294
           and 
295
          K 
296
          
297
            j 
298
           .
299
          
300
          S 
301
          
302
            ij 
303
           is between 0 and 1, i.e., 0 ≤ 
304
          S 
305
          
306
            ij 
307
           ≤ 1. The closer 
308
          S 
309
          
310
            ij 
311
           is to 1, the more similar two kernel documents 
312
          K 
313
          
314
            i 
315
           and 
316
          K 
317
          
318
            j 
319
           are.
320
          This is the most computationally intensive part of the
321
          calculation and the code is implemented in C. Once the
322
          similarity scores for all possible pairs of PMIDs are
323
          calculated, the scores are stored in a relational
324
          database table, and it is not necessary to recalculate
325
          the scores for subsequent queries.
326
        
327
        
328
          Gene Relationship
329
          The score 
330
          S 
331
          
332
            ij 
333
           calculated for two kernel documents 
334
          K 
335
          
336
            i 
337
           and 
338
          K 
339
          
340
            j 
341
           does not directly reflect the relevance of two
342
          genes. To assess the functional relationship between two
343
          genes, gene symbols must be related to PMIDs.
344
          In order to identify the set of genes that are
345
          relevant to a query gene 
346
          G, the PMIDs of all reference
347
          papers listed under the OMIMID for the query gene are
348
          obtained. Each of these reference papers, except any
349
          paper whose KEEP flag is turned off, is treated as a
350
          kernel document.
351
          There are several considerations that support this
352
          approach to selection of kernel documents:
353
          • The reference papers listed under each OMIMID were
354
          selected specifically because of their relevance to the
355
          corresponding gene;
356
          • The titles of these papers were screened to exclude
357
          those that describe commonly used methods or techniques
358
          in order to reduce the number of "false positives";
359
          • The process can be fully automated to avoid manually
360
          selecting kernel documents.
361
          An interface is provided to allow the user to
362
          "fine-tune" the query by manually selecting only some of
363
          the reference papers as kernel documents.
364
          Next, all documents (represented by their PMIDs) that
365
          are related to each kernel document with a score higher
366
          than a specified threshold are identified. The OMIMIDs
367
          that have cited papers with any of these PMIDs are
368
          collected. Finally, these OMIMIDs are connected to their
369
          respective gene symbols. The entire process is shown in
370
          Figure 1.
371
        
372
        
373
          User Interface
374
          A user interface is available at
375
          http://gene.cpmc.columbia.edu/cgi-bin/gene.cgi. Once the
376
          gene symbol and a cutoff score (i.e., the cosine
377
          similarity score between two kernel documents that
378
          correspond to respective genes) are entered, a list of
379
          reference papers cited in OMIM for the gene is displayed.
380
          Only those papers whose KEEP flag is turned on are shown.
381
          The user may select specific paper(s) from the list as
382
          kernel documents, or simply check the "Check All" box to
383
          use all these papers as kernel documents.
384
          Once the submit button is clicked, the genes with
385
          scores higher than the cutoff score are displayed.
386
        
387
      
388
      
389
        Results
390
        
391
          Summary of Raw Data
392
          At the time when the raw data were downloaded in July
393
          2001, there were 11251 gene symbols in the OMIM gene
394
          list, corresponding to 7192 distinct OMIMIDs. Multiple
395
          gene symbols may have the same OMIMID because many genes
396
          have aliases, resulting in several symbols referring to
397
          the same gene.
398
          Among the 7192 distinct OMIMIDs, 7085 cite reference
399
          paper with PMIDs, and 107 (about 1.5%) OMIMIDs do not
400
          cite any reference paper, or only cite reference papers
401
          whose PMIDs are not specified in OMIM. 54024 reference
402
          papers are listed under the 7085 distinct OMIMIDs. Some
403
          papers are referenced under several OMIMIDs, therefore,
404
          the actual number of distinct PMIDs is 47428.
405
          The title of the corresponding document for each of
406
          these 47428 PMIDs is also obtained. After screening the
407
          titles using the method described earlier, the KEEP flags
408
          of 3680 documents (about 7.8%) were turned off.
409
          Ultimately, only those 43748 documents whose KEEP flags
410
          are turned on will be used as kernel documents. However,
411
          we initially treat all 47428 documents as kernel
412
          documents, allowing us to estimate the extent to which
413
          these documents whose KEEP flags are turned off
414
          contribute to false positives.
415
          For each of the 47428 distinct PMIDs, the related
416
          documents ("neighbors") are obtained from NCBI. As
417
          indicated earlier, only the first 100 PMIDs of the list
418
          of related documents are stored, because they are the
419
          ones most related to the kernel document. The highest
420
          ranking neighbor of any document is, of course, itself.
421
          This search resulted in 4629037 pairs of neighbors, a
422
          number that would be much larger if all, instead of only
423
          the top 100, neighbors of a document are kept.
424
        
425
        
426
          Summary of Results of Calculation
427
          The preliminary search identified 437382 base vector
428
          documents. Any of these documents is a neighbor of at
429
          least two kernel documents. On average, a base vector
430
          document is related to 9.1 kernel documents. The average
431
          weight of the base vector documents is of 13.13, the
432
          maximum weight is 14.53, which corresponds to those base
433
          vector documents that are only related to two kernel
434
          documents; the minimum weight is 4.66, which corresponds
435
          to a base vector document with 1873 neighbors. As
436
          described in the Methods section, the weight of a base
437
          vector document indicates how much information is
438
          conveyed about the relevance of two kernel documents by
439
          knowing that both of them are neighbors of this
440
          particular base vector document. The more kernel
441
          documents a base vector document is related to, the less
442
          its weight. Figure 2shows this relationship. For example,
443
          a base vector document that is related to 740 kernel
444
          documents has a weight of 6, only half of the weight of a
445
          document that is related to 12 kernel documents.
446
          Next, the norm of each kernel document is calculated.
447
          There are 95 kernel documents with a norm of zero. These
448
          documents do not have any neighbor that is one of the
449
          base vector documents. As a result, only 47333 kernel
450
          documents are left.
451
          Finally, the cosine similarity score of each pair of
452
          kernel documents is calculated. A document is treated as
453
          a kernel document if its KEEP flag is on and its norm is
454
          greater than zero. There are 43658 such documents. Out of
455
          the 43658(43658-1)/2 = 952988653 possible pairs, only
456
          6596918 (about 0.7%) have a similarity score that is
457
          greater than zero, indicating some relationship between
458
          the two kernel documents of the pair. The average score
459
          is 0.04. However, if both documents of a pair are listed
460
          as references under the same OMIMID, the average score is
461
          0.14, which is much higher than the overall average
462
          score. This difference is expected because the documents
463
          listed under the same OMIMID have been selected because
464
          they all have some relationship to the gene that
465
          corresponds to the OMIMID. Furthermore, this average
466
          score also provides an indication of the approximate
467
          value of the threshold score that should be used to
468
          decide whether two kernel documents are closely
469
          related.
470
          Documents that discuss methods or techniques are not
471
          included when the similarity scores are calculated,
472
          because these documents can lead to false positives - a
473
          pair of genes with a high score that are functionally
474
          unrelated. To investigate the impact of such documents,
475
          we intentionally included them in the calculation of the
476
          scores. Excluding these documents when responding to a
477
          query is straightfoward, one needs only to check the KEEP
478
          flag of a document. The average similarity score of any
479
          pair in which both documents have a turned-off KEEP flag
480
          is 0.11, much higher than the overall average score 0.04
481
          and close to the average score among a pair of documents
482
          referenced by the same OMIMID, i.e., 0.14. This result
483
          indicates that these documents should be excluded from
484
          calculations designed to find functional
485
          relationships.
486
          Although documents that are likely to cause false
487
          positive have been excluded by the automated screening
488
          process described above, the screened set of documents
489
          may still include many that are not optimal kernel
490
          document candidates. A solution to this is to actually
491
          let the users select specific kernel documents from a
492
          list of documents.
493
        
494
        
495
          An Example
496
          As an illustration, we use this computational strategy
497
          to identify genes related to the apoptosis (programmed
498
          cell death) pathway in human. A brief recent review of
499
          this pathway has been given by DeFrancesco [ 9 ] .
500
          To use this strategy, it is necessary to have a gene
501
          to start with. This is usually a gene that is known to be
502
          associated with the pathway or function of interest.
503
          Usually, such a gene is known to the user who submits the
504
          query. If necessary, one can also perform a preliminary
505
          search of PubMed for the functions or processes of
506
          interest in order to obtain the name of a gene to start
507
          with.
508
          We start with APAF1, a gene known to be involved in
509
          the apoptosis pathway [ 8 ] . A cutoff score of 0.2 is
510
          employed, and all reference papers cited in OMIM for this
511
          gene are used as kernel documents. The analysis
512
          identified the list of related genes displayed in Table
513
          1.
514
          CASP1, CAPS2 and CASP3 all belong to the family of
515
          apoptosis-related cysteine proteases. Caspase activation
516
          is a key regulatory step for apoptosis [ 10 11 ] .
517
          DIABLO, also known as SMAC (second
518
          mitochondria-derived activator of caspase), promotes
519
          caspase activation in a cytochrome c-APAF1-CASP9 pathway
520
          [ 5 ] .
521
          The identification of XK and ABC3 is more interesting,
522
          because they are not well recognized as components of the
523
          apoptosis pathway. In order to identify the process by
524
          which XK was included, we retrace the search path to find
525
          the two original kernel documents that related APAF1 to
526
          XK. They are: "Apaf-1, a human protein homologous to C.
527
          elegans CED-4, participates in cytochrome c-dependent
528
          activation of caspase-3" (PMID: 9267021), a paper linked
529
          to APAF1; and "The ced-8 gene controls the timing of
530
          programmed cell death in C. elegans" (PMID: 10882128), a
531
          paper linked to XK. XK is a Kell blood group precursor.
532
          Stanfield 
533
          et al [ 6 ] noted that 458-amino
534
          acid CED8 transmembrane protein of C. elegans is weakly
535
          similar to the human XK protein. The CED8 and XK proteins
536
          share 19% amino acid identity, have similar hydropathy
537
          plots, and both contain 10 hydrophobic predicted
538
          membrane-spanning segments. CED8 functions downstream of,
539
          or in parallel to, the regulatory cell death gene CED9,
540
          and may function as a cell death effector downstream of
541
          the caspase encoded by programmed cell death gene, CED3.
542
          APAF1 is known to share amino acid similarity with CED3
543
          and CED4, a protein that is believed to initiate
544
          apoptosis in C. elegans.
545
          The gene ABC3 (ABC Transporter 3) is linked to APAF1
546
          in a manner similar to that which connects XK to APAF1.
547
          It is reported that CED7 protein has sequence similarity
548
          to ABC transporters. CED7 functions in the engulfment of
549
          cell remnants during programmed cell death [ 7 ] .
550
          There was evidence that BCL2 is a homolog of CED9:
551
          CED9 encodes a 280 amino acid protein showing sequence
552
          and structural similarity to BCL2 [ 12 ] . BCL2 is
553
          involved in programmed cell death [ 9 ] .
554
          A secondary search can be performed with each of the
555
          genes in Table 1. Usually, more stringent criteria is
556
          required for secondary searches because the genes used
557
          for secondary queries often have other functions not
558
          related to the one of interest. Kernel documents need to
559
          be selected more carefully, and a higher cut-off score
560
          might be used.
561
          For example, for XK, if all papers cited in OMIM for
562
          the particular gene are used as kernel documents, there
563
          are many high-score hits that do not seem to be directly
564
          linked to apoptosis. Among the kernel document candidates
565
          for XK, the title of only one of the papers mentions
566
          programmed cell death. The majority of papers discusses
567
          McLeod syndrome, which is associated with XK, but has no
568
          recognized relationship with apoptosis.
569
          Therefore, further inspection is necessary to
570
          determine whether these hits are really linked to the
571
          apoptosis pathway. To simplify the process and obtain
572
          better results, instead using all reference papers cited
573
          in OMIM for each of these genes, we manually select
574
          kernel documents from the list of OMIM reference papers
575
          for these secondary searches, using the interface
576
          described before. For example, in a list of more than 20
577
          papers cited for XK, we choose only one paper, titled
578
          "The ced-8 gene controls the timing of programmed cell
579
          death in C. elegans".
580
          With the results of the initial and secondary
581
          searches, a network of genes nominally associated with
582
          apoptosis can be built. The network is shown in Figure
583
          3.
584
          If necessary, further searches can be performed with
585
          the hits from a previous search, so that the network can
586
          be expanded to include more genes.
587
        
588
      
589
      
590
        Discussion
591
        The similarity score is the only criterion used to
592
        determine whether two documents are related. Any two
593
        documents with a similarity score above the cutoff score
594
        are considered to be related.
595
        Here we discuss how the cutoff score should be
596
        determined. To this end, it is necessary to investigate how
597
        the distribution of similarity scores differs between
598
        related and unrelated document pairs.
599
        To simplify the problem, we assume that any two
600
        documents that are listed as references under the same
601
        OMIMID are related, and that the distribution between such
602
        documents approximates the distribution between two related
603
        documents.
604
        For any two documents that are not listed under the same
605
        OMIMID, it is reasonable to assume that they are unrelated,
606
        because the vast majority of such documents are, in fact,
607
        unrelated. Therefore, we assign the score distribution for
608
        unrelated documents to such pairs. It should be emphasized
609
        that this assumption is an approximation. Indeed, the most
610
        interesting documents are those documents that are not
611
        listed under the same OMIMID, but are found through
612
        analysis to be related. However, this assumption makes
613
        finding the distribution of similarity scores among
614
        unrelated documents much easier.
615
        Table 2is a summary of the score distributions of
616
        related and unrelated document pairs. Note that for
617
        unrelated documents, 75% of the scores are less than
618
        0.03087, while for related documents, only 25% of the
619
        scores are less than 0.03027.
620
        The probability 
621
        P ( 
622
        S > 
623
        S 
624
        
625
          cutoff 
626
         ) of the score 
627
        S being greater than a cutoff score, 
628
        S 
629
        
630
          cutoff 
631
         , can be easily found:
632
        
633
        where 
634
        N ( 
635
        S ≤ 
636
        S 
637
        
638
          cutoff 
639
         ) is the number of document pairs whose similarity
640
        score is not greater than the cutoff score, and 
641
        N is the total number of such
642
        pairs.
643
        
644
        P ( 
645
        S > 
646
        S 
647
        
648
          cutoff 
649
         ) was calculated separately for those pairs in which
650
        both documents were listed under the same OMIMID, i.e., the
651
        "related documents" according to the assumption above, and
652
        for those pairs in which the two documents were not listed
653
        under the same OMIMID, i.e., the "unrelated documents"
654
        corresponding to our definitions. The results are shown in
655
        Figure 4. The solid curve is the probability 
656
        P ( 
657
        S > 
658
        S 
659
        
660
          cutoff 
661
         ) for related document pairs (true positives), the
662
        dotted curve is the probability 
663
        P(S > 
664
        S 
665
        
666
          cutoff 
667
         ) for unrelated document pairs (false positives).
668
        Using a cutoff score of 0.05, about 61% of the related
669
        documents will be accepted; these documents are true
670
        positives. About 39% of the related documents will be
671
        rejected; these are the false negatives. Only 14% of the
672
        unrelated documents will be accepted; these are the false
673
        positives. And, 86% of the unrelated documents will be
674
        rejected, these are the true negatives.
675
        Based on these results, the sensitivity and specificity
676
        of the search can be calculated. The sensitivity is the
677
        proportion of related document pairs that are about the
678
        cutoff score, and therefore are accepted. Therefore, the
679
        solid curve in Figure 4is also the sensitivity curve. The
680
        specificity is the proportion of unrelated documents that
681
        are below the cutoff score, and therefore are rejected.
682
        Specificity is equal to 1 - 
683
        P ( 
684
        S > 
685
        S 
686
        
687
          cutoff 
688
         ), where 
689
        P ( 
690
        S > 
691
        S 
692
        
693
          cutoff 
694
         ) is the proportion of unrelated document pairs that
695
        are above the cutoff score 
696
        S 
697
        
698
          cutoff 
699
         . In Figure 4, the dashed curve is the specificity
700
        curve.
701
        Figure 4can be used to determine what cutoff score to
702
        use for any specific purpose. For example, using a high
703
        cutoff score such as 0.2, the specificity will be 0.985,
704
        corresponding to a false positive rate of only 1.5%.
705
        However, the corresponding sensitivity is 0.248, so that
706
        above three quarters of the related documents will also be
707
        rejected. On the other hand, choosing a low cutoff score
708
        will result in many false positives, while ensuring that
709
        most related documents are accepted. Using a cutoff score
710
        of 0.03, both the sensitivity and specificity will be
711
        around 0.75. However, because there are often many more
712
        unrelated documents than related documents, the search
713
        result will still contain many false positives. By
714
        referring to Figure 4, users can select a cutoff score that
715
        is best suited to their needs.
716
      
717
      
718
        Conclusions
719
        The key to the success of the kernel document method is
720
        the selection of the kernel documents. However, this is
721
        also the most difficult and tedious part of the
722
        implementation. An efficient way to select the kernel
723
        documents related to gene function is necessary for a
724
        large-scale literature mining effort using this method. We
725
        started with all of the reference papers listed in OMIM,
726
        and applied a filter to exclude those papers that are
727
        likely to focus primarily on methods and techniques. We can
728
        either treat the rest of papers as kernel documents, or
729
        allow the user to select kernel documents from this small
730
        pool of papers (usually contain around a dozen papers).
731
        This process can be fully automated. Furthermore, since
732
        we are not limited to one kernel document per gene, a gene
733
        can correspond to multiple kernel documents that capture
734
        different aspects of its functions. This characteristic of
735
        the strategy makes it possible to identify genes that are
736
        related to the query gene through a variety of functional
737
        mechanisms.
738
        In distinction to the gene co-occurrence method used by
739
        Jenssen 
740
        et al, this approach does not require
741
        the symbols of two gene to appear in the title or abstract
742
        of the same paper in order to establish a relationship
743
        between them. As long as similar or related functions of
744
        the two genes are described in the literature, the
745
        relationship between the two genes is likely to be
746
        revealed. Furthermore, it is easier to identify the related
747
        functions of these genes because they are precisely those
748
        functions that related one gene to another by computation.
749
        While the co-occurrence method is biased towards revealing
750
        gene relationships that have been explicitly described in
751
        the literature, the method we propose is more sensitive to
752
        implicit relationships between two genes that have not
753
        necessarily been explicitly identified.
754
        The process of selecting kernel documents can also be
755
        improved by taking advantage of user feedback in a
756
        networked environment. For example, the user can be allowed
757
        to select kernel documents from a list of candidate papers.
758
        The papers selected most frequently by users can then be
759
        used as the bases for subsequent automatic kernel document
760
        selection in searches related to a specific gene or
761
        pathway.
762
        Finally, it is important to take note of the limitation
763
        of literature mining tools: two genes could be found to be
764
        related for many reasons, some of which might not be
765
        biologically meaningful. The identified relationships could
766
        therefore have different biological meanings, if any.
767
        Further investigation is always necessary to determine the
768
        origin of such relatedness. However, bibiliographic data
769
        mining efforts such as ours could shed light on the less
770
        obvious relationships between two genes. When considered in
771
        conjunction with other data, such as gene expression
772
        profiles, the results could lead to biologically meaningful
773
        conclusions.
774
      
775
    
776
  
777

778
Product

Resources

Company