Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
7
Step 1
8
Identify the unique symbols in the analyzed sequences.
9
For a nucleotide sequence the unique symbols would be A,
10
G, C, and T for the four nucleotides found in DNA
11
sequences.
12
13
14
Step 2
15
Map each symbol to a unique corner in the unit
16
hypercube. The dimension of the unit hypercube,
17
n , is chosen as the upper integer
18
of
19
log
20
2 (
21
uu ) where
22
uu is the number of unique symbols.
23
Therefore
24
n has the value of 2 for DNA and 5
25
for proteins. For each symbol mapped in such a way, let
26
u
27
s
28
29
j
30
be the
31
j
32
th coordinate of the corner of the
33
hypercube to which symbol
34
s is mapped. For DNA one possible
35
mapping is
36
u
37
A = [0,0],
38
u
39
C = [0,1],
40
u
41
G = [1,0], and
42
u
43
T = [1,1]. There are sparser
44
implementations of USM that may use values of
45
n up to the number of unique units,
46
47
uu , as detailed in the original
48
proposition [ 4 ] . However, those solutions are not
49
essentially different and the implementation presented
50
here can be straightforwardly ported to sparser USM
51
representations.
52
53
54
Step 3
55
Iteratively generate the forward USM coordinates for
56
each of the
57
k symbols and each of the
58
n coordinates as follows
59
60
61
62
Step 4
63
Iteratively generate the backward USM coordinates for
64
each symbol and each coordinate as follows
65
66
The given procedure results in the 2
67
n USM coordinates for each of the
68
k symbols in the transformed
69
sequence. The similarity of two sequences at any pair of
70
symbols can be measured using the distance measure
71
defined by Almeida and Vinga [ 4 ] . The measure is
72
defined by
73
74
D =
75
d
76
77
f
78
(
79
a
80
81
i
82
83
b
84
85
j
86
) +
87
d
88
89
b
90
(
91
a
92
93
i
94
95
b
96
97
j
98
)     (3)
99
where
100
101
Almeida and Vinga present the distance measure as an
102
estimator of the length of the similar segment containing
103
the compared symbols. They also note that the measure, as
104
defined, necessarily overestimates the length of these
105
segments.
106
As can be seen in the USM procedure, each symbol in
107
the sequence is encoded as a set of USM coordinates.
108
These coordinates are constructed in such a way as to
109
encode the symbol itself as well as the preceding symbols
110
(forward coordinates) and following symbols (backward
111
coordinates). Scale independence is a property of this
112
representation that allows the complete recovery of the
113
encompassing sequence (preceding and following symbols)
114
from the USM coordinates of any symbol in the sequence to
115
any resolution (any scale) up to the complete length of
116
the sequence. In practical implementations we are faced
117
with the limitations of finite word length
118
representations of USM coordinates. In these
119
implementations our ability to recover the encompassing
120
sequence is bounded by the word length of the coordinate
121
representation. For this reason, we refer to USM and bUSM
122
implementations with finite precision coordinates as
123
bounded scale independent representations. In this paper
124
we consider the implementation of the USM algorithm and
125
propose a modification to Almeida and Vinga's approach [
126
4 ] that eliminates the overestimation and allows
127
determination of similar segment lengths of bounded
128
length and offer an algorithm for overcoming the bounded
129
length restriction.
130
131
132
133
Results
134
135
Source of the USM distance metric
136
over-estimation
137
The application of USM as a tool for measuring
138
scale-independent discrete sequence similarity and its
139
particular application to genomics and proteomics
140
exploits a distance metric providing an estimate of the
141
length of similar regions surrounding a pair of symbols.
142
In the approach presented by Almeida and Vinga, this
143
distance metric is shown to overestimate the true length
144
of the similar segment. We propose a variation on their
145
approach that retains the distance property, eliminates
146
the over-estimation, and uses more computationally
147
efficient operations. We begin this discussion by first
148
providing a more complete proof of the distribution of
149
overestimation in the unidirectional USM. This proof aids
150
in the illumination of the source of the
151
over-estimation.
152
The USM distance metric estimates the length of
153
ungapped identical segments in the region surrounding the
154
symbols being compared. As such, we now consider two
155
sequences,
156
V and
157
W with
158
k symbols in agreement starting at
159
symbol indices
160
m
161
162
v
163
and
164
m
165
166
w
167
respectively.
168
169
From the definition of the USM recursion (Equation 1)
170
we can see that the j thcoordinate at the k thstep can
171
also be written as
172
173
Where the is the j thcoordinate of the k thsymbol and
174
the are determined by the initial values of the
175
coordinates. These values are assigned in the
176
initialization step of the USM encoding process. We write
177
a coordinate of the sequence at the
178
m
179
180
v
181
thand
182
m
183
184
w
185
thstep in the recursion as:
186
187
These representations are given as three summations
188
corresponding to the
189
k symbols in agreement, the symbols
190
preceding the similar segment back to the beginning of
191
each sequence, and the initial value of the coordinate.
192
From our definition of these sequences (
193
k most recent symbols in agreement)
194
we see that that the first terms (first summation) in
195
each of the expressions are equal. We can factor a common
196
term from each of the remaining two summations giving
197
198
By change of index we get
199
200
We let , , and for the non-similar segment be
201
independent Bernoulli random variables with
202
p = 1/2. The term in brackets is
203
recognized as a uniformly distributed random value on
204
[0,1). We can rewrite
205
USMv and
206
USMw as follows
207
208
Where
209
Rv
210
211
j
212
and
213
Rw
214
215
j
216
are uniformly distributed on [0,1). Next consider
217
the differences between the USM coordinates for each of
218
these sequences.
219
220
The terms in the similar segment are eliminated and
221
the difference becomes a scaled difference between two
222
uniformly distributed random variables. The scale factor
223
of this difference gives the length of the similar
224
segment.
225
The unidirectional distance metric given by Almeida
226
and Vinga is defined as
227
228
d = -log
229
2 (max|Δ
230
USM
231
232
j
233
|) for
234
j = 1..
235
n     (12)
236
where
237
n is the number of coordinates in
238
the USM vector. Exploiting the fact that
239
log is monotone increasing we
240
substitute our expression for the USM difference and
241
write
242
d as:
243
244
where Δ
245
R
246
247
j
248
= |
249
Rv
250
251
j
252
-
253
Rw
254
255
j
256
|. The overestimation is given by φ. Since
257
Rv
258
259
j
260
and
261
Rw
262
263
j
264
are uniformly distributed on [0,1), the distribution
265
of Δ
266
R
267
268
j
269
is given by:
270
271
And the associated cumulative distribution is given
272
by:
273
274
Therefore for
275
n independent coordinates
276
distributed as defined, the distribution for the maximum
277
is
278
279
P (
280
R
281
max ≤
282
r ) =
283
P (Δ
284
R
285
1 ≤
286
r , Δ
287
R
288
2 ≤
289
r ,...,Δ
290
R
291
292
n
293
294
r ) = (
295
P (Δ
296
R ≤
297
r ))
298
n = (
299
r ·(2 -
300
r ))
301
n     (16)
302
Under the transformation
303
φ(
304
r ) = -log
305
2 (
306
r )     (17)
307
the distribution of φ can be determined as follows
308
309
This confirms the result originally reported by
310
Almeida and Vinga [ 4 ] . We see from this derivation
311
that the exact length of the similar segment, given by
312
k , is determined by the exponent
313
of the common factor of 1/2 factored from the non-similar
314
segment. The remaining factor in that term constitutes
315
the overestimation. Overestimation is, therefore,
316
determined by the difference between terms in the series
317
representation of the maximum coordinate difference
318
beyond the similar segment. The symbol sequence encoded
319
in this portion of the coordinate provides no information
320
as to the length of the similar segment and so we wish to
321
eliminate its effect on the estimation of the similar
322
length.
323
324
325
Boolean USMs
326
It is clear from the discussion above that
327
overestimation of the length of similar segments by USM
328
results from contributions to the coordinate difference
329
from terms in the coordinate summation preceding the
330
similar segment. This effect is due to the use of an
331
arithmetic difference in the computation of Almeida and
332
Vinga's distance metric. Consider the following
333
example
334
335
where
336
R
337
338
a
339
and
340
R
341
342
b
343
are the uniformly distributed random values on [0,1)
344
described in the previous derivation. Finite length
345
coordinates are used here for illustration purposes.
346
These two quantities must differ in the most significant
347
position but are not constrained in the remaining terms.
348
In this example we see that for the least significant
349
subtraction to occur, the difference must borrow from the
350
next more significant term. The borrow propagates the
351
length of the sum and the length is overestimated, in
352
this case, by 5. The overestimation is therefore, due to
353
the interaction of the symbols prior to the first symbol
354
at which the sequences differ.
355
We propose a variation on the USM encoding and
356
difference metric in which arithmetic operations
357
(subtraction, maximum, and base 2 logarithm) are replaced
358
with equivalent Boolean operations in which the values of
359
individual terms in the above expression do not interact.
360
We now present the proposed change of arithmetic.
361
Consider the following representation of a bUSM
362
(Boolean USM) coordinate
363
364
c =
365
R
366
i
367
a
368
369
i
370
371
where
372
a
373
374
i
375
∈ {0,1}     (20)
376
Where represents the bit-wise logical OR of a series
377
of terms and
378
R
379
i is the right shift operator
380
repeated
381
i times. The value
382
c is then an infinite bit
383
representation that encodes one bit from each symbol in
384
the encoded sequence. The bUSM recursion is then written
385
as
386
387
The representation of a coordinate of the USM after
388
the k thstep of the recursion is given by
389
390
Again, consider two sequences,
391
V and
392
W , as defined previously. We write
393
the coordinates of these sequences at the
394
m
395
396
v
397
thand
398
m
399
400
w
401
thstep in the recursion as:
402
403
As before, we break the representation into three
404
terms; one representing the similar segment, one
405
representing terms prior to the similar segment, and one
406
representing the initial value of the coordinate. The
407
proposed recursion replaces division by 2 with a right
408
shift operation and addition with a bit-wise logical OR.
409
Resulting coordinates are histories of one bit of the
410
symbols preceding the encoded symbol with the most recent
411
symbol's bit stored in the left-most bit position (most
412
significant position).
413
Given the proposed coordinate representation and
414
recursion, we now examine the bit-wise binary equivalent
415
to the computation of the distance metric. Consider the
416
exclusive OR of the coordinates of two symbols being
417
compared.
418
419
The exclusive OR operation yields true (bit is set) if
420
the bits differ and false (bit is not set) otherwise.
421
Based on our definition of sequences
422
V and
423
W none of the bits in the similar
424
segment of the newly defined bUSM coordinate difference
425
are set and the first bit beyond the similar segment must
426
be set. The exact length of the similar segment is given
427
by one less than the position of the left-most set bit in
428
the set of coordinate differences.
429
Under the original approach, the maximum of the
430
differences across all coordinates is taken prior to
431
computing the base 2 logarithm. Under the proposed
432
approach we replace this operation with the bit-wise OR
433
of the differences across all coordinates. The left-most
434
bit set in the result corresponds to the bUSM coordinate
435
that determines the length of the similar segment
436
(equivalent to the coordinate winning the max operation
437
in the standard USM). The computation of the distance
438
metric in the original approach employs a base-2
439
logarithm. Under the proposed approach we substitute the
440
logarithm with a scan for the position of the most
441
significant bit set in the bit-wise OR of the coordinate
442
differences. By forming both the forward and reverse bUSM
443
coordinates and adding the forward and backward
444
distances, the exact length of the similar segment can be
445
determined for all pairs within the segment.
446
For the standard USM, initial values for coordinates
447
are taken as random draws on [0,1). This allows the
448
statistical properties of the overestimation to remain
449
consistent at the beginning and end of the sequences. The
450
Boolean USM does not overestimate the similar length and
451
we must, therefore, reexamine the initialization approach
452
so as to preserve the determination of exact lengths at
453
the beginning and end of the sequences. This can be
454
accomplished through the addition of two unique symbols,
455
a tail symbol for sequence V and a tail symbol for
456
sequence W. The use of two special symbols to mark the
457
ends of the sequences results in no change to the
458
recursion or similar segment length determination. It
459
does impact the initialization and possibly the
460
computational costs due to a potential increase in the
461
number of coordinates required to represent the alphabet
462
and the tail symbols. The addition of two extra symbols
463
will, in some cases, increase the number of coordinates
464
required. If, for example, an alphabet of 4 is required,
465
the addition of two symbols increases the number of
466
coordinates from 2 to 3. If, however, an alphabet of 20
467
is required, the addition of two symbols results in no
468
increase in the number of coordinates. Naturally, this
469
would not ever happen for a sparse implementation of USM,
470
where
471
n =
472
uu [ 4 ] , and, consequently, the
473
number of coordinates increases with every new symbol.
474
Nevertheless, this would not, in any way, change the bUSM
475
proposition as the sole difference between sparse and
476
compact representations concerns the binary
477
representation of each symbol,
478
u
479
480
j = 1,...,
481
uu . It is noteworthy,
482
however, that the incentive to use sparser USM
483
representations, which is the smaller extent of
484
over-determination, does not exist for bUSM, where
485
determination of length unit identity is exact, as shown
486
below.
487
The initial value of the Boolean USM coordinates for
488
sequence V are set to indicate that an occurrence of the
489
sequence V tail symbol precedes the first symbol in V. An
490
instance of the tail symbol is also added to the end of
491
the sequence (follows the last symbol in V). The initial
492
value for sequence W's coordinates are set similarly
493
using the sequence W tail symbol. Since tail symbols
494
differ from each other and from all non-tail symbols,
495
similar regions will be terminated at the beginning and
496
end of the sequence and exact distances will be
497
determined as required.
498
Both forms of the USM coordinates can be considered a
499
form of embedding and reorientation of the sequence data
500
as illustrated in Figure 1. Instead of coding the
501
information as a sequence of symbol codes, we code it as
502
a collection of coordinates containing one code bit for
503
each symbol in the sequence. Each USM coordinate stores
504
one bit for each symbol preceding (forward coordinates)
505
or following (backward coordinates) the symbol associated
506
with the coordinate. The sequence of coordinates
507
redundantly embeds the symbols surrounding the current
508
symbol.
509
The standard USM coordinates can be directly obtained
510
from the bUSM form by interpreting the bUSM coordinates
511
as block floating point representations of the USM
512
coordinates with the binary decimal point set to the left
513
of the most significant bit. Dividing the unsigned word
514
representation of the bUSM coordinate by 2
515
W , where
516
W is the word length, yields the
517
equivalent standard USM coordinate with
518
W symbols of precision. We also
519
recognize that for both the standard and Boolean USM
520
coordinates, the determination of similar segment lengths
521
is limited by the length (or precision) of the word used
522
to represent the coordinate. The original implementation
523
of the standard USM was created in Matlab and used 64-bit
524
floating point coordinates. As such, lengths for similar
525
segments longer than 53 symbols (IEEE 754 format provides
526
53 bits of precision [ 5 ] ) cannot be determined. The
527
bUSM coordinates are similarly limited. The comparable
528
implementation encodes bUSM coordinates in 64-bit fixed
529
point representations and so exact similar segment
530
lengths up to a maximum length of 63 can be
531
determined.
532
533
534
Overcoming finite word length limitations
535
In theory, USM encoded sequences could be used to
536
detect arbitrarily long similar segments. Previously we
537
discuss the constraint imposed by finite-length binary
538
representations of USM and bUSM coordinates. An algorithm
539
for overcoming this limitation will now be presented.
540
First we define the following two functions
541
542
Under the finite word length limitation we note that
543
d
544
545
f
546
and
547
d
548
549
b
550
, the true forward and backward distances, cannot be
551
determined from a single symbol pair comparison. We
552
define two new functions and which can be determined from
553
554
W -bit coordinate representations.
555
These functions provide the length of similar segments up
556
to the length of the underlying representation
557
W . The true length of the forward
558
and backward similar segments are returned for lengths
559
less than
560
W and the word length is returned
561
otherwise. Next we present the following recursive
562
functions
563
564
The functions
565
D
566
567
f
568
and
569
D
570
571
b
572
recursively locate the end of the similar segment by
573
stepping backward (
574
D
575
576
f
577
) and forward (
578
D
579
580
b
581
) through the similar segment until the end of the
582
region is detected. The exact forward and backward
583
lengths of similar segments of arbitrary length can be
584
determined from these recursions. If the similar segment
585
extends to the end of the sequence, the recursion will
586
terminate on the last step due to the tail symbols added
587
to the beginning and end of the sequence. The exact
588
length of the similar segment containing symbols
589
v
590
591
i
592
and
593
w
594
595
j
596
is given by
597
598
The proposed recursive definition of similar segment
599
length overcomes the finite word length limitation and
600
provides a practical method for recovering exact
601
distances for arbitrarily long similar segments.
602
603
604
Performance comparisons
605
Computational comparisons of the two approaches were
606
performed using the C-code implementations developed as
607
described in Methods. This comparison examines the
608
performance gains achieved through the use of binary
609
operations (shift, exclusive OR, OR, bit scan) as
610
replacements for the equivalent functions in the standard
611
USM (division by 2, subtraction, max, and log). Since the
612
Boolean USM requires two additional symbols to represent
613
the tail symbols for each of the sequences, three
614
coordinates are required for each of the forward and
615
reverse directions (six total). Four coordinates were
616
required for the standard USM. Standard USM coordinates
617
were stored as 64 bit floating point numbers and bUSM
618
coordinates as 64 bit unsigned fixed point numbers. The
619
standard and binary USM implementations were, therefore,
620
limited to detecting similar segments up to lengths 53
621
and 63 respectively in a single comparison (not using the
622
recursive distance algorithm).
623
Elapsed execution time measurements were made for each
624
of the 10 test cases and are presented in Table 1. For
625
small sequence lengths (less than 10,000 symbols) the
626
symbol-pair distance calculations for the binary USM
627
achieved approximately 8.3 M comparisons per second on
628
our test platform while the standard USM achieved rates
629
of approximately 2.3 M comparisons per second. The binary
630
USM approach was approximately 3.7 times the speed of the
631
standard USM. For these small sequence test cases, the
632
test application and the USM representations of the small
633
sequences fit within the processor's on-chip cache (512
634
kilobytes). For cases where the USM encoded sequences
635
exceed the cache size and require reads from main memory,
636
the performance decreases as indicated for the larger
637
cases (10,000 symbols and above). For these cases, the
638
binary USM executes at 3.3 M comparisons per second and
639
the standard USM at 1.7 M comparisons per second for a
640
performance ratio of approximately 1.9.
641
Two examples applying both the standard and binary USM
642
approaches were prepared to illustrate the difference in
643
results obtained from overestimated distance and exact
644
similar segment length determination. The first case
645
duplicates the example given by Almeida and Vinga.
646
Phrases from the Wendy Cope poem were converted to
647
standard and binary USM for each of the sequences. The
648
pair-wise comparisons were performed using both
649
approaches and the results are shown as pixel maps
650
(Figure 2). In these pixel maps, brighter regions
651
indicate higher distance values and should correspond to
652
symbol pairs found in similar segments. It is clear from
653
the images in this figure that the major similar segments
654
(lengths 7, 9, and 11) are clearly visible in both
655
images. However, the exact distances in the Boolean USM
656
image clearly show the shorter segments (lengths 3, 4,
657
and 5) that are somewhat hidden by the standard USM
658
overestimation error (Figure 2A). A similar illustration
659
is provided using a sample nucleotide sequence. The
660
sequence coding the human insulin receptor was acquired
661
through NCBI (XM_048346, INSR) and used in a BLAST search
662
for similar sequences. The second sequence (M69243,
663
CTK-1) was taken from that list. A 100 nucleotide segment
664
of the of the human insulin receptor (XM_048346,
665
3056-3155) associated with the predicted tyrosine kinase
666
domain and a 100 nucleotide segment from the chicken
667
tyrosine kinase (M69243, 51-150) were converted to
668
standard USM and bUSM coordinates and pair-wise compared
669
using the associated distance metrics. Pixel maps of the
670
distance metrics were prepared (Figure 3). Again, the
671
long similar segments are clearly visible in both images.
672
The overestimation noise in the standard USM (image A)
673
again masks many of the shorter similar segments seen in
674
the Boolean USM (image B).
675
676
677
678
Discussion
679
Almeida and Vinga presented a fundamentally interesting
680
and practically useful extension of the Chaos Game
681
Representation iterative function (CGR) referred to as
682
Universal Sequence Maps (USM) and demonstrated the
683
application of this representation and an associated
684
distance metric in the identification of similar segments
685
of discrete sequences. In this report we have presented
686
considerations for the practical implementation of these
687
methods and offer an implementation of USM that 1)
688
eliminates the overestimation of the length of similar
689
segments, 2) eliminates the inability to recognize similar
690
segments longer than the word length of the coordinate
691
representation, 3) can be implemented with more efficient
692
operations, and 4) provides a simple conversion that
693
recovers the standard USM coordinates. As currently
694
defined, the USM distance metric (and associated bUSM
695
implementation) is limited to the estimation of lengths of
696
local identity about the pair of symbols being
697
compared.
698
The nature of the overestimation by the unidirectional
699
distance metric was revealed in a proof of the distribution
700
of the overestimation from the standard method. The
701
algebraic difference taken in computing the distance
702
results in the overestimation of length. This observation
703
leads to a modification of the algorithm that eliminates
704
the interaction of symbols when computing distance. We also
705
recognize, in this derivation, that to achieve this result
706
we must assume that the i thcoordinates for the symbol
707
representations are equally likely (
708
p = 1/2). The symbol coding
709
selections for standard USM sequences must be balanced so
710
as to make the occurrence of 1 and 0 in a standard USM
711
coordinate equally likely for the given symbol set (and
712
frequency of occurrence). This indicates that instead of
713
choosing the first n binary representations of symbols as
714
suggested in [ 4 ] , the symbols should be chosen from the
715
2
716
n possible values in such a way as to
717
balance the occurrences of 1 and 0 for each coordinate. The
718
Boolean USM approach places no constraints other than
719
uniqueness on the symbol representation.
720
The Boolean USM approach eliminates the overestimation
721
problem noted in the standard USM and can recover more
722
symbols than the standard approach for a given amount of
723
storage. The binary approach is faster than the standard
724
approach (based on a straightforward implementation) and
725
offers the potential for further enhancement through, for
726
example, the use of processor instructions designed
727
specifically to find the first or last bit set in a word
728
(e.g. Pentium Bit Scan Reverse (BSR) instruction [ 6 ] ).
729
In our test cases the binary approach performed 1.9 to 3.7
730
times that of the standard approach even though it
731
processed 6 (3 forward, 3 backward) rather than 4 (2
732
forward, 2 backward) coordinates. These measurements are
733
specific to the test platform (processor, OS, compiler,
734
etc.) and with optimizations these ratios will change
735
considerably.
736
The computational cost of preparing the coordinates is
737
insignificant when compared to the cost of the distance
738
calculations in cases we examined. This may, in part, be
739
due to the fact that we performed
740
N ×
741
M comparisons for sequences of length
742
743
N and
744
M and with a minimum value for
745
N or
746
M of 1000. Since the storage
747
requirements for a USM representation of a sequence are
748
considerably larger than that of the sequence itself, it
749
may be more efficient to store the sequence in its native
750
form (a sequence of symbols) and compute the USM
751
coordinates just prior to sequence comparison when a large
752
number of comparisons are being performed. We also observed
753
that the storage size of the coordinates had an impact on
754
computational performance. The performance of the algorithm
755
decreases sharply when the USM coordinate representation
756
exceeds the on-chip cache of the processor. This may
757
indicate that we should be considering tradeoffs between
758
word length and the size of the similar segments we wish to
759
detect without using the recursive length function. If, for
760
example, the probability of sequences longer than 16
761
symbols is sufficiently small, then a 16 bit coordinate
762
might be used.
763
A single pairwise comparison grows as the upper integer
764
of the base 2 logarithm of the number of unique symbols.
765
Almeida and Vinga note that for a given length of interest
766
w , we need to sample the distance
767
metric at no more than
768
N
769
770
A
771
·
772
N
773
774
B
775
/
776
w indices to locate all sequences of
777
length
778
w or greater. In an application
779
focused on locating all identical segments of length
780
w or greater, the computational cost
781
therefore grows as [log
782
2 (
783
nn )]·(
784
N
785
786
A
787
·
788
N
789
790
B
791
)/
792
w . The exact distance given in the
793
binary approach allows us to locate, from this sampling,
794
the beginning and end of the similar segment from the
795
sampled symbol indices and the forward and reverse
796
distances. Subtracting one less than the exact forward
797
distance from the indices of the symbols being compared and
798
adding one less than the exact backward distance from the
799
indices we can identify the start and end of the similar
800
segment containing the given symbol pair. The Boolean USM,
801
offers this advantage due to its unique ability to
802
determine the exact length of the similar segment.
803
804
805
Conclusions
806
USM representations of discrete sequences in genomics
807
and proteomics offer the possibility of scale-independent
808
representations of sequence information surrounding points
809
of comparison in those sequences. In order to maximize the
810
potential for application of these methods we must consider
811
both their theoretical properties and the computational
812
methods for efficient implementation that retain these
813
properties. We have offered one further step in that
814
direction by identifying a Boolean implementation of USM
815
(bUSM) that not only preserves the theoretical properties
816
of numerical USM but actually uses the binary environment
817
of the computational implementation to achieve a more exact
818
logical solution. The proposed implementation leads to a
819
distance metric that exactly determines similarity length
820
between sequences. Ultimately, this achievement can be
821
described as one that replaces the determination of
822
logarithmic, numerical distance, with computationally more
823
efficient logic operations. It could then be argued that,
824
given the discrete nature of biological sequences, new
825
scale-independent numerical representations, such as USM,
826
are all but the first step in the identification of more
827
accurate Boolean equivalents of fundamental relevance.
828
829
830
Methods
831
832
Software implementation
833
Test versions of both the standard USM and the Boolean
834
USM algorithms were developed in order to facilitate
835
performance comparisons of practical implementations. A
836
single "generic" implementation of the framework for both
837
methods was first developed and then used as the starting
838
point for developing method-specific implementations.
839
Care was taken so as to minimize differences in any code
840
other than that implementing the recursion and the
841
distance metric calculations. Programs were written in C
842
and compiled and tested in the same environment
843
(compiler, linker, libraries, and host) and executed
844
against the same test cases. No attempt was made to
845
optimize either implementation (beyond that performed by
846
the compiler). Both applications were compiled using the
847
GNU compiler (gcc version 2.95.3) under the cygwin
848
environment http://www.cygwin.com/on a 1 GHz Pentium
849
III-based system running Windows 2000. The source code,
850
Makefiles, and test data are available from
851
http://bioinformatics.musc.edu/resources.html.
852
853
854
Software performance testing
855
Test cases were produced by replicating two different
856
1 kbp segments of the e-coli genome to create 2 kbp, 3
857
kbp, 4 kbp, 5 kbp, 10 kbp, 15 kbp, 17 kbp, 20 kbp, and 40
858
kbp sequences. Each implementation is run as a separate
859
program. The programs load the two sequences to be
860
compared from disk, compute the USM (or bUSM) coordinates
861
for all symbols in each sequence, and compute the local
862
distance metric ( or ) for all pairs of symbols. Compute
863
time was measured from the point following the load of
864
the sequence from disk to the point at which all of the
865
distance calculations were completed. Times were measured
866
using the unix clock() function. Compute time was also
867
measured from the point at which the USM coordinate
868
calculations were completed to the end of the distance
869
calculations.
870
871
872
873
Acknowledgements
874
The authors thankfully acknowledge support by the
875
training grant 1-T15-LM07438-01 "training of toolmakers for
876
Biomedical Informatics" by the National Library of Medicine
877
of the National Institutes of Health, USA (NLM/NIH,
878
http://www.nlm.nih.gov/ep/T15Training.html).
879
880
881
Authors' contributions
882
First author developed bolean implementation of
883
Universal Sequence Map (bUSM). Second author, the original
884
proponent of USM [ 4 ] identified theoretical context.
885
886
887
888
889