CoCalc -- 1471-2105-3-28.txt

OANC_GrAF / data / written_2 / technical / biomed / 1471-2105-3-28.txt
³⁹⁶⁷³ views
1

2
  
3
    
4
      
5
        Background
6
        
7
          Step 1
8
          Identify the unique symbols in the analyzed sequences.
9
          For a nucleotide sequence the unique symbols would be A,
10
          G, C, and T for the four nucleotides found in DNA
11
          sequences.
12
        
13
        
14
          Step 2
15
          Map each symbol to a unique corner in the unit
16
          hypercube. The dimension of the unit hypercube, 
17
          n , is chosen as the upper integer
18
          of 
19
          log 
20
          2 ( 
21
          uu ) where 
22
          uu is the number of unique symbols.
23
          Therefore 
24
          n has the value of 2 for DNA and 5
25
          for proteins. For each symbol mapped in such a way, let 
26
          u 
27
          s 
28
          
29
            j 
30
           be the 
31
          j 
32
          th coordinate of the corner of the
33
          hypercube to which symbol 
34
          s is mapped. For DNA one possible
35
          mapping is 
36
          u 
37
          A = [0,0], 
38
          u 
39
          C = [0,1], 
40
          u 
41
          G = [1,0], and 
42
          u 
43
          T = [1,1]. There are sparser
44
          implementations of USM that may use values of 
45
          n up to the number of unique units,
46
          
47
          uu , as detailed in the original
48
          proposition [ 4 ] . However, those solutions are not
49
          essentially different and the implementation presented
50
          here can be straightforwardly ported to sparser USM
51
          representations.
52
        
53
        
54
          Step 3
55
          Iteratively generate the forward USM coordinates for
56
          each of the 
57
          k symbols and each of the 
58
          n coordinates as follows
59
          
60
        
61
        
62
          Step 4
63
          Iteratively generate the backward USM coordinates for
64
          each symbol and each coordinate as follows
65
          
66
          The given procedure results in the 2 
67
          n USM coordinates for each of the 
68
          k symbols in the transformed
69
          sequence. The similarity of two sequences at any pair of
70
          symbols can be measured using the distance measure
71
          defined by Almeida and Vinga [ 4 ] . The measure is
72
          defined by
73
          
74
          D = 
75
          d 
76
          
77
            f 
78
           ( 
79
          a 
80
          
81
            i 
82
           
83
          b 
84
          
85
            j 
86
           ) + 
87
          d 
88
          
89
            b 
90
           ( 
91
          a 
92
          
93
            i 
94
           
95
          b 
96
          
97
            j 
98
           )     (3)
99
          where
100
          
101
          Almeida and Vinga present the distance measure as an
102
          estimator of the length of the similar segment containing
103
          the compared symbols. They also note that the measure, as
104
          defined, necessarily overestimates the length of these
105
          segments.
106
          As can be seen in the USM procedure, each symbol in
107
          the sequence is encoded as a set of USM coordinates.
108
          These coordinates are constructed in such a way as to
109
          encode the symbol itself as well as the preceding symbols
110
          (forward coordinates) and following symbols (backward
111
          coordinates). Scale independence is a property of this
112
          representation that allows the complete recovery of the
113
          encompassing sequence (preceding and following symbols)
114
          from the USM coordinates of any symbol in the sequence to
115
          any resolution (any scale) up to the complete length of
116
          the sequence. In practical implementations we are faced
117
          with the limitations of finite word length
118
          representations of USM coordinates. In these
119
          implementations our ability to recover the encompassing
120
          sequence is bounded by the word length of the coordinate
121
          representation. For this reason, we refer to USM and bUSM
122
          implementations with finite precision coordinates as
123
          bounded scale independent representations. In this paper
124
          we consider the implementation of the USM algorithm and
125
          propose a modification to Almeida and Vinga's approach [
126
          4 ] that eliminates the overestimation and allows
127
          determination of similar segment lengths of bounded
128
          length and offer an algorithm for overcoming the bounded
129
          length restriction.
130
        
131
      
132
      
133
        Results
134
        
135
          Source of the USM distance metric
136
          over-estimation
137
          The application of USM as a tool for measuring
138
          scale-independent discrete sequence similarity and its
139
          particular application to genomics and proteomics
140
          exploits a distance metric providing an estimate of the
141
          length of similar regions surrounding a pair of symbols.
142
          In the approach presented by Almeida and Vinga, this
143
          distance metric is shown to overestimate the true length
144
          of the similar segment. We propose a variation on their
145
          approach that retains the distance property, eliminates
146
          the over-estimation, and uses more computationally
147
          efficient operations. We begin this discussion by first
148
          providing a more complete proof of the distribution of
149
          overestimation in the unidirectional USM. This proof aids
150
          in the illumination of the source of the
151
          over-estimation.
152
          The USM distance metric estimates the length of
153
          ungapped identical segments in the region surrounding the
154
          symbols being compared. As such, we now consider two
155
          sequences, 
156
          V and 
157
          W with 
158
          k symbols in agreement starting at
159
          symbol indices 
160
          m 
161
          
162
            v 
163
           and 
164
          m 
165
          
166
            w 
167
           respectively.
168
          
169
          From the definition of the USM recursion (Equation 1)
170
          we can see that the j thcoordinate at the k thstep can
171
          also be written as
172
          
173
          Where the is the j thcoordinate of the k thsymbol and
174
          the are determined by the initial values of the
175
          coordinates. These values are assigned in the
176
          initialization step of the USM encoding process. We write
177
          a coordinate of the sequence at the 
178
          m 
179
          
180
            v 
181
           thand 
182
          m 
183
          
184
            w 
185
           thstep in the recursion as:
186
          
187
          These representations are given as three summations
188
          corresponding to the 
189
          k symbols in agreement, the symbols
190
          preceding the similar segment back to the beginning of
191
          each sequence, and the initial value of the coordinate.
192
          From our definition of these sequences ( 
193
          k most recent symbols in agreement)
194
          we see that that the first terms (first summation) in
195
          each of the expressions are equal. We can factor a common
196
          term from each of the remaining two summations giving
197
          
198
          By change of index we get
199
          
200
          We let , , and for the non-similar segment be
201
          independent Bernoulli random variables with 
202
          p = 1/2. The term in brackets is
203
          recognized as a uniformly distributed random value on
204
          [0,1). We can rewrite 
205
          USMv and 
206
          USMw as follows
207
          
208
          Where 
209
          Rv 
210
          
211
            j 
212
           and 
213
          Rw 
214
          
215
            j 
216
           are uniformly distributed on [0,1). Next consider
217
          the differences between the USM coordinates for each of
218
          these sequences.
219
          
220
          The terms in the similar segment are eliminated and
221
          the difference becomes a scaled difference between two
222
          uniformly distributed random variables. The scale factor
223
          of this difference gives the length of the similar
224
          segment.
225
          The unidirectional distance metric given by Almeida
226
          and Vinga is defined as
227
          
228
          d = -log 
229
          2 (max|Δ 
230
          USM 
231
          
232
            j 
233
           |) for 
234
          j = 1.. 
235
          n     (12)
236
          where 
237
          n is the number of coordinates in
238
          the USM vector. Exploiting the fact that 
239
          log is monotone increasing we
240
          substitute our expression for the USM difference and
241
          write 
242
          d as:
243
          
244
          where Δ 
245
          R 
246
          
247
            j 
248
           = | 
249
          Rv 
250
          
251
            j 
252
           - 
253
          Rw 
254
          
255
            j 
256
           |. The overestimation is given by φ. Since 
257
          Rv 
258
          
259
            j 
260
           and 
261
          Rw 
262
          
263
            j 
264
           are uniformly distributed on [0,1), the distribution
265
          of Δ 
266
          R 
267
          
268
            j 
269
           is given by:
270
          
271
          And the associated cumulative distribution is given
272
          by:
273
          
274
          Therefore for 
275
          n independent coordinates
276
          distributed as defined, the distribution for the maximum
277
          is
278
          
279
          P ( 
280
          R 
281
          max ≤ 
282
          r ) = 
283
          P (Δ 
284
          R 
285
          1 ≤ 
286
          r , Δ 
287
          R 
288
          2 ≤ 
289
          r ,...,Δ 
290
          R 
291
          
292
            n 
293
           ≤ 
294
          r ) = ( 
295
          P (Δ 
296
          R ≤ 
297
          r )) 
298
          n = ( 
299
          r ·(2 - 
300
          r )) 
301
          n     (16)
302
          Under the transformation
303
          φ( 
304
          r ) = -log 
305
          2 ( 
306
          r )     (17)
307
          the distribution of φ can be determined as follows
308
          
309
          This confirms the result originally reported by
310
          Almeida and Vinga [ 4 ] . We see from this derivation
311
          that the exact length of the similar segment, given by 
312
          k , is determined by the exponent
313
          of the common factor of 1/2 factored from the non-similar
314
          segment. The remaining factor in that term constitutes
315
          the overestimation. Overestimation is, therefore,
316
          determined by the difference between terms in the series
317
          representation of the maximum coordinate difference
318
          beyond the similar segment. The symbol sequence encoded
319
          in this portion of the coordinate provides no information
320
          as to the length of the similar segment and so we wish to
321
          eliminate its effect on the estimation of the similar
322
          length.
323
        
324
        
325
          Boolean USMs
326
          It is clear from the discussion above that
327
          overestimation of the length of similar segments by USM
328
          results from contributions to the coordinate difference
329
          from terms in the coordinate summation preceding the
330
          similar segment. This effect is due to the use of an
331
          arithmetic difference in the computation of Almeida and
332
          Vinga's distance metric. Consider the following
333
          example
334
          
335
          where 
336
          R 
337
          
338
            a 
339
           and 
340
          R 
341
          
342
            b 
343
           are the uniformly distributed random values on [0,1)
344
          described in the previous derivation. Finite length
345
          coordinates are used here for illustration purposes.
346
          These two quantities must differ in the most significant
347
          position but are not constrained in the remaining terms.
348
          In this example we see that for the least significant
349
          subtraction to occur, the difference must borrow from the
350
          next more significant term. The borrow propagates the
351
          length of the sum and the length is overestimated, in
352
          this case, by 5. The overestimation is therefore, due to
353
          the interaction of the symbols prior to the first symbol
354
          at which the sequences differ.
355
          We propose a variation on the USM encoding and
356
          difference metric in which arithmetic operations
357
          (subtraction, maximum, and base 2 logarithm) are replaced
358
          with equivalent Boolean operations in which the values of
359
          individual terms in the above expression do not interact.
360
          We now present the proposed change of arithmetic.
361
          Consider the following representation of a bUSM
362
          (Boolean USM) coordinate
363
          
364
          c = 
365
          R 
366
          i 
367
          a 
368
          
369
            i 
370
           
371
          where 
372
          a 
373
          
374
            i 
375
           ∈ {0,1}     (20)
376
          Where represents the bit-wise logical OR of a series
377
          of terms and 
378
          R 
379
          i is the right shift operator
380
          repeated 
381
          i times. The value 
382
          c is then an infinite bit
383
          representation that encodes one bit from each symbol in
384
          the encoded sequence. The bUSM recursion is then written
385
          as
386
          
387
          The representation of a coordinate of the USM after
388
          the k thstep of the recursion is given by
389
          
390
          Again, consider two sequences, 
391
          V and 
392
          W , as defined previously. We write
393
          the coordinates of these sequences at the 
394
          m 
395
          
396
            v 
397
           thand 
398
          m 
399
          
400
            w 
401
           thstep in the recursion as:
402
          
403
          As before, we break the representation into three
404
          terms; one representing the similar segment, one
405
          representing terms prior to the similar segment, and one
406
          representing the initial value of the coordinate. The
407
          proposed recursion replaces division by 2 with a right
408
          shift operation and addition with a bit-wise logical OR.
409
          Resulting coordinates are histories of one bit of the
410
          symbols preceding the encoded symbol with the most recent
411
          symbol's bit stored in the left-most bit position (most
412
          significant position).
413
          Given the proposed coordinate representation and
414
          recursion, we now examine the bit-wise binary equivalent
415
          to the computation of the distance metric. Consider the
416
          exclusive OR of the coordinates of two symbols being
417
          compared.
418
          
419
          The exclusive OR operation yields true (bit is set) if
420
          the bits differ and false (bit is not set) otherwise.
421
          Based on our definition of sequences 
422
          V and 
423
          W none of the bits in the similar
424
          segment of the newly defined bUSM coordinate difference
425
          are set and the first bit beyond the similar segment must
426
          be set. The exact length of the similar segment is given
427
          by one less than the position of the left-most set bit in
428
          the set of coordinate differences.
429
          Under the original approach, the maximum of the
430
          differences across all coordinates is taken prior to
431
          computing the base 2 logarithm. Under the proposed
432
          approach we replace this operation with the bit-wise OR
433
          of the differences across all coordinates. The left-most
434
          bit set in the result corresponds to the bUSM coordinate
435
          that determines the length of the similar segment
436
          (equivalent to the coordinate winning the max operation
437
          in the standard USM). The computation of the distance
438
          metric in the original approach employs a base-2
439
          logarithm. Under the proposed approach we substitute the
440
          logarithm with a scan for the position of the most
441
          significant bit set in the bit-wise OR of the coordinate
442
          differences. By forming both the forward and reverse bUSM
443
          coordinates and adding the forward and backward
444
          distances, the exact length of the similar segment can be
445
          determined for all pairs within the segment.
446
          For the standard USM, initial values for coordinates
447
          are taken as random draws on [0,1). This allows the
448
          statistical properties of the overestimation to remain
449
          consistent at the beginning and end of the sequences. The
450
          Boolean USM does not overestimate the similar length and
451
          we must, therefore, reexamine the initialization approach
452
          so as to preserve the determination of exact lengths at
453
          the beginning and end of the sequences. This can be
454
          accomplished through the addition of two unique symbols,
455
          a tail symbol for sequence V and a tail symbol for
456
          sequence W. The use of two special symbols to mark the
457
          ends of the sequences results in no change to the
458
          recursion or similar segment length determination. It
459
          does impact the initialization and possibly the
460
          computational costs due to a potential increase in the
461
          number of coordinates required to represent the alphabet
462
          and the tail symbols. The addition of two extra symbols
463
          will, in some cases, increase the number of coordinates
464
          required. If, for example, an alphabet of 4 is required,
465
          the addition of two symbols increases the number of
466
          coordinates from 2 to 3. If, however, an alphabet of 20
467
          is required, the addition of two symbols results in no
468
          increase in the number of coordinates. Naturally, this
469
          would not ever happen for a sparse implementation of USM,
470
          where 
471
          n = 
472
          uu [ 4 ] , and, consequently, the
473
          number of coordinates increases with every new symbol.
474
          Nevertheless, this would not, in any way, change the bUSM
475
          proposition as the sole difference between sparse and
476
          compact representations concerns the binary
477
          representation of each symbol, 
478
          u 
479
          
480
          j = 1,..., 
481
          uu  . It is noteworthy,
482
          however, that the incentive to use sparser USM
483
          representations, which is the smaller extent of
484
          over-determination, does not exist for bUSM, where
485
          determination of length unit identity is exact, as shown
486
          below.
487
          The initial value of the Boolean USM coordinates for
488
          sequence V are set to indicate that an occurrence of the
489
          sequence V tail symbol precedes the first symbol in V. An
490
          instance of the tail symbol is also added to the end of
491
          the sequence (follows the last symbol in V). The initial
492
          value for sequence W's coordinates are set similarly
493
          using the sequence W tail symbol. Since tail symbols
494
          differ from each other and from all non-tail symbols,
495
          similar regions will be terminated at the beginning and
496
          end of the sequence and exact distances will be
497
          determined as required.
498
          Both forms of the USM coordinates can be considered a
499
          form of embedding and reorientation of the sequence data
500
          as illustrated in Figure 1. Instead of coding the
501
          information as a sequence of symbol codes, we code it as
502
          a collection of coordinates containing one code bit for
503
          each symbol in the sequence. Each USM coordinate stores
504
          one bit for each symbol preceding (forward coordinates)
505
          or following (backward coordinates) the symbol associated
506
          with the coordinate. The sequence of coordinates
507
          redundantly embeds the symbols surrounding the current
508
          symbol.
509
          The standard USM coordinates can be directly obtained
510
          from the bUSM form by interpreting the bUSM coordinates
511
          as block floating point representations of the USM
512
          coordinates with the binary decimal point set to the left
513
          of the most significant bit. Dividing the unsigned word
514
          representation of the bUSM coordinate by 2 
515
          W , where 
516
          W is the word length, yields the
517
          equivalent standard USM coordinate with 
518
          W symbols of precision. We also
519
          recognize that for both the standard and Boolean USM
520
          coordinates, the determination of similar segment lengths
521
          is limited by the length (or precision) of the word used
522
          to represent the coordinate. The original implementation
523
          of the standard USM was created in Matlab and used 64-bit
524
          floating point coordinates. As such, lengths for similar
525
          segments longer than 53 symbols (IEEE 754 format provides
526
          53 bits of precision [ 5 ] ) cannot be determined. The
527
          bUSM coordinates are similarly limited. The comparable
528
          implementation encodes bUSM coordinates in 64-bit fixed
529
          point representations and so exact similar segment
530
          lengths up to a maximum length of 63 can be
531
          determined.
532
        
533
        
534
          Overcoming finite word length limitations
535
          In theory, USM encoded sequences could be used to
536
          detect arbitrarily long similar segments. Previously we
537
          discuss the constraint imposed by finite-length binary
538
          representations of USM and bUSM coordinates. An algorithm
539
          for overcoming this limitation will now be presented.
540
          First we define the following two functions
541
          
542
          Under the finite word length limitation we note that 
543
          d 
544
          
545
            f 
546
           and 
547
          d 
548
          
549
            b 
550
           , the true forward and backward distances, cannot be
551
          determined from a single symbol pair comparison. We
552
          define two new functions and which can be determined from
553
          
554
          W -bit coordinate representations.
555
          These functions provide the length of similar segments up
556
          to the length of the underlying representation 
557
          W . The true length of the forward
558
          and backward similar segments are returned for lengths
559
          less than 
560
          W and the word length is returned
561
          otherwise. Next we present the following recursive
562
          functions
563
          
564
          The functions 
565
          D 
566
          
567
            f 
568
           and 
569
          D 
570
          
571
            b 
572
           recursively locate the end of the similar segment by
573
          stepping backward ( 
574
          D 
575
          
576
            f 
577
           ) and forward ( 
578
          D 
579
          
580
            b 
581
           ) through the similar segment until the end of the
582
          region is detected. The exact forward and backward
583
          lengths of similar segments of arbitrary length can be
584
          determined from these recursions. If the similar segment
585
          extends to the end of the sequence, the recursion will
586
          terminate on the last step due to the tail symbols added
587
          to the beginning and end of the sequence. The exact
588
          length of the similar segment containing symbols 
589
          v 
590
          
591
            i 
592
           and 
593
          w 
594
          
595
            j 
596
           is given by
597
          
598
          The proposed recursive definition of similar segment
599
          length overcomes the finite word length limitation and
600
          provides a practical method for recovering exact
601
          distances for arbitrarily long similar segments.
602
        
603
        
604
          Performance comparisons
605
          Computational comparisons of the two approaches were
606
          performed using the C-code implementations developed as
607
          described in Methods. This comparison examines the
608
          performance gains achieved through the use of binary
609
          operations (shift, exclusive OR, OR, bit scan) as
610
          replacements for the equivalent functions in the standard
611
          USM (division by 2, subtraction, max, and log). Since the
612
          Boolean USM requires two additional symbols to represent
613
          the tail symbols for each of the sequences, three
614
          coordinates are required for each of the forward and
615
          reverse directions (six total). Four coordinates were
616
          required for the standard USM. Standard USM coordinates
617
          were stored as 64 bit floating point numbers and bUSM
618
          coordinates as 64 bit unsigned fixed point numbers. The
619
          standard and binary USM implementations were, therefore,
620
          limited to detecting similar segments up to lengths 53
621
          and 63 respectively in a single comparison (not using the
622
          recursive distance algorithm).
623
          Elapsed execution time measurements were made for each
624
          of the 10 test cases and are presented in Table 1. For
625
          small sequence lengths (less than 10,000 symbols) the
626
          symbol-pair distance calculations for the binary USM
627
          achieved approximately 8.3 M comparisons per second on
628
          our test platform while the standard USM achieved rates
629
          of approximately 2.3 M comparisons per second. The binary
630
          USM approach was approximately 3.7 times the speed of the
631
          standard USM. For these small sequence test cases, the
632
          test application and the USM representations of the small
633
          sequences fit within the processor's on-chip cache (512
634
          kilobytes). For cases where the USM encoded sequences
635
          exceed the cache size and require reads from main memory,
636
          the performance decreases as indicated for the larger
637
          cases (10,000 symbols and above). For these cases, the
638
          binary USM executes at 3.3 M comparisons per second and
639
          the standard USM at 1.7 M comparisons per second for a
640
          performance ratio of approximately 1.9.
641
          Two examples applying both the standard and binary USM
642
          approaches were prepared to illustrate the difference in
643
          results obtained from overestimated distance and exact
644
          similar segment length determination. The first case
645
          duplicates the example given by Almeida and Vinga.
646
          Phrases from the Wendy Cope poem were converted to
647
          standard and binary USM for each of the sequences. The
648
          pair-wise comparisons were performed using both
649
          approaches and the results are shown as pixel maps
650
          (Figure 2). In these pixel maps, brighter regions
651
          indicate higher distance values and should correspond to
652
          symbol pairs found in similar segments. It is clear from
653
          the images in this figure that the major similar segments
654
          (lengths 7, 9, and 11) are clearly visible in both
655
          images. However, the exact distances in the Boolean USM
656
          image clearly show the shorter segments (lengths 3, 4,
657
          and 5) that are somewhat hidden by the standard USM
658
          overestimation error (Figure 2A). A similar illustration
659
          is provided using a sample nucleotide sequence. The
660
          sequence coding the human insulin receptor was acquired
661
          through NCBI (XM_048346, INSR) and used in a BLAST search
662
          for similar sequences. The second sequence (M69243,
663
          CTK-1) was taken from that list. A 100 nucleotide segment
664
          of the of the human insulin receptor (XM_048346,
665
          3056-3155) associated with the predicted tyrosine kinase
666
          domain and a 100 nucleotide segment from the chicken
667
          tyrosine kinase (M69243, 51-150) were converted to
668
          standard USM and bUSM coordinates and pair-wise compared
669
          using the associated distance metrics. Pixel maps of the
670
          distance metrics were prepared (Figure 3). Again, the
671
          long similar segments are clearly visible in both images.
672
          The overestimation noise in the standard USM (image A)
673
          again masks many of the shorter similar segments seen in
674
          the Boolean USM (image B).
675
        
676
      
677
      
678
        Discussion
679
        Almeida and Vinga presented a fundamentally interesting
680
        and practically useful extension of the Chaos Game
681
        Representation iterative function (CGR) referred to as
682
        Universal Sequence Maps (USM) and demonstrated the
683
        application of this representation and an associated
684
        distance metric in the identification of similar segments
685
        of discrete sequences. In this report we have presented
686
        considerations for the practical implementation of these
687
        methods and offer an implementation of USM that 1)
688
        eliminates the overestimation of the length of similar
689
        segments, 2) eliminates the inability to recognize similar
690
        segments longer than the word length of the coordinate
691
        representation, 3) can be implemented with more efficient
692
        operations, and 4) provides a simple conversion that
693
        recovers the standard USM coordinates. As currently
694
        defined, the USM distance metric (and associated bUSM
695
        implementation) is limited to the estimation of lengths of
696
        local identity about the pair of symbols being
697
        compared.
698
        The nature of the overestimation by the unidirectional
699
        distance metric was revealed in a proof of the distribution
700
        of the overestimation from the standard method. The
701
        algebraic difference taken in computing the distance
702
        results in the overestimation of length. This observation
703
        leads to a modification of the algorithm that eliminates
704
        the interaction of symbols when computing distance. We also
705
        recognize, in this derivation, that to achieve this result
706
        we must assume that the i thcoordinates for the symbol
707
        representations are equally likely ( 
708
        p = 1/2). The symbol coding
709
        selections for standard USM sequences must be balanced so
710
        as to make the occurrence of 1 and 0 in a standard USM
711
        coordinate equally likely for the given symbol set (and
712
        frequency of occurrence). This indicates that instead of
713
        choosing the first n binary representations of symbols as
714
        suggested in [ 4 ] , the symbols should be chosen from the
715
        2 
716
        n possible values in such a way as to
717
        balance the occurrences of 1 and 0 for each coordinate. The
718
        Boolean USM approach places no constraints other than
719
        uniqueness on the symbol representation.
720
        The Boolean USM approach eliminates the overestimation
721
        problem noted in the standard USM and can recover more
722
        symbols than the standard approach for a given amount of
723
        storage. The binary approach is faster than the standard
724
        approach (based on a straightforward implementation) and
725
        offers the potential for further enhancement through, for
726
        example, the use of processor instructions designed
727
        specifically to find the first or last bit set in a word
728
        (e.g. Pentium Bit Scan Reverse (BSR) instruction [ 6 ] ).
729
        In our test cases the binary approach performed 1.9 to 3.7
730
        times that of the standard approach even though it
731
        processed 6 (3 forward, 3 backward) rather than 4 (2
732
        forward, 2 backward) coordinates. These measurements are
733
        specific to the test platform (processor, OS, compiler,
734
        etc.) and with optimizations these ratios will change
735
        considerably.
736
        The computational cost of preparing the coordinates is
737
        insignificant when compared to the cost of the distance
738
        calculations in cases we examined. This may, in part, be
739
        due to the fact that we performed 
740
        N × 
741
        M comparisons for sequences of length
742
        
743
        N and 
744
        M and with a minimum value for 
745
        N or 
746
        M of 1000. Since the storage
747
        requirements for a USM representation of a sequence are
748
        considerably larger than that of the sequence itself, it
749
        may be more efficient to store the sequence in its native
750
        form (a sequence of symbols) and compute the USM
751
        coordinates just prior to sequence comparison when a large
752
        number of comparisons are being performed. We also observed
753
        that the storage size of the coordinates had an impact on
754
        computational performance. The performance of the algorithm
755
        decreases sharply when the USM coordinate representation
756
        exceeds the on-chip cache of the processor. This may
757
        indicate that we should be considering tradeoffs between
758
        word length and the size of the similar segments we wish to
759
        detect without using the recursive length function. If, for
760
        example, the probability of sequences longer than 16
761
        symbols is sufficiently small, then a 16 bit coordinate
762
        might be used.
763
        A single pairwise comparison grows as the upper integer
764
        of the base 2 logarithm of the number of unique symbols.
765
        Almeida and Vinga note that for a given length of interest 
766
        w , we need to sample the distance
767
        metric at no more than 
768
        N 
769
        
770
          A 
771
         · 
772
        N 
773
        
774
          B 
775
         / 
776
        w indices to locate all sequences of
777
        length 
778
        w or greater. In an application
779
        focused on locating all identical segments of length 
780
        w or greater, the computational cost
781
        therefore grows as [log 
782
        2 ( 
783
        nn )]·( 
784
        N 
785
        
786
          A 
787
         · 
788
        N 
789
        
790
          B 
791
         )/ 
792
        w . The exact distance given in the
793
        binary approach allows us to locate, from this sampling,
794
        the beginning and end of the similar segment from the
795
        sampled symbol indices and the forward and reverse
796
        distances. Subtracting one less than the exact forward
797
        distance from the indices of the symbols being compared and
798
        adding one less than the exact backward distance from the
799
        indices we can identify the start and end of the similar
800
        segment containing the given symbol pair. The Boolean USM,
801
        offers this advantage due to its unique ability to
802
        determine the exact length of the similar segment.
803
      
804
      
805
        Conclusions
806
        USM representations of discrete sequences in genomics
807
        and proteomics offer the possibility of scale-independent
808
        representations of sequence information surrounding points
809
        of comparison in those sequences. In order to maximize the
810
        potential for application of these methods we must consider
811
        both their theoretical properties and the computational
812
        methods for efficient implementation that retain these
813
        properties. We have offered one further step in that
814
        direction by identifying a Boolean implementation of USM
815
        (bUSM) that not only preserves the theoretical properties
816
        of numerical USM but actually uses the binary environment
817
        of the computational implementation to achieve a more exact
818
        logical solution. The proposed implementation leads to a
819
        distance metric that exactly determines similarity length
820
        between sequences. Ultimately, this achievement can be
821
        described as one that replaces the determination of
822
        logarithmic, numerical distance, with computationally more
823
        efficient logic operations. It could then be argued that,
824
        given the discrete nature of biological sequences, new
825
        scale-independent numerical representations, such as USM,
826
        are all but the first step in the identification of more
827
        accurate Boolean equivalents of fundamental relevance.
828
      
829
      
830
        Methods
831
        
832
          Software implementation
833
          Test versions of both the standard USM and the Boolean
834
          USM algorithms were developed in order to facilitate
835
          performance comparisons of practical implementations. A
836
          single "generic" implementation of the framework for both
837
          methods was first developed and then used as the starting
838
          point for developing method-specific implementations.
839
          Care was taken so as to minimize differences in any code
840
          other than that implementing the recursion and the
841
          distance metric calculations. Programs were written in C
842
          and compiled and tested in the same environment
843
          (compiler, linker, libraries, and host) and executed
844
          against the same test cases. No attempt was made to
845
          optimize either implementation (beyond that performed by
846
          the compiler). Both applications were compiled using the
847
          GNU compiler (gcc version 2.95.3) under the cygwin
848
          environment http://www.cygwin.com/on a 1 GHz Pentium
849
          III-based system running Windows 2000. The source code,
850
          Makefiles, and test data are available from
851
          http://bioinformatics.musc.edu/resources.html.
852
        
853
        
854
          Software performance testing
855
          Test cases were produced by replicating two different
856
          1 kbp segments of the e-coli genome to create 2 kbp, 3
857
          kbp, 4 kbp, 5 kbp, 10 kbp, 15 kbp, 17 kbp, 20 kbp, and 40
858
          kbp sequences. Each implementation is run as a separate
859
          program. The programs load the two sequences to be
860
          compared from disk, compute the USM (or bUSM) coordinates
861
          for all symbols in each sequence, and compute the local
862
          distance metric ( or ) for all pairs of symbols. Compute
863
          time was measured from the point following the load of
864
          the sequence from disk to the point at which all of the
865
          distance calculations were completed. Times were measured
866
          using the unix clock() function. Compute time was also
867
          measured from the point at which the USM coordinate
868
          calculations were completed to the end of the distance
869
          calculations.
870
        
871
      
872
      
873
        Acknowledgements
874
        The authors thankfully acknowledge support by the
875
        training grant 1-T15-LM07438-01 "training of toolmakers for
876
        Biomedical Informatics" by the National Library of Medicine
877
        of the National Institutes of Health, USA (NLM/NIH,
878
        http://www.nlm.nih.gov/ep/T15Training.html).
879
      
880
      
881
        Authors' contributions
882
        First author developed bolean implementation of
883
        Universal Sequence Map (bUSM). Second author, the original
884
        proponent of USM [ 4 ] identified theoretical context.
885
      
886
    
887
  
888

889
Product

Resources

Company