Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
29547 views
1
2
3
4
5
Background
6
For over a decade the idea of representing biological
7
sequences in a continuous coordinate space has maintained
8
its appeal but not been fully realized [ 1 2 3 ] . The
9
basic idea is that sequences of symbols, such as
10
nucleotides in genomes, aminoacids in proteomes, repeated
11
sequences in MLST [Multi Locus Sequence Typing, 4], words
12
in languages or letters in words, would define trajectories
13
in this continuous space conserving the statistical
14
properties of the original sequences [ 3 5 6 7 8 9 ] .
15
Accordingly, the coordinate position of each unit would
16
uniquely encode for both its identity and its context, i.e.
17
the identity of its neighbors [ 10 ] . Ideally, the
18
position should be scale-independent, such that the
19
extraction of the encompassing sequence can be performed
20
with any resolution, leading to an oligomer of arbitrary
21
length. The pioneer work by Jeffrey published in 1990 [ 5 ]
22
achieved this for genomic sequences by using the Chaos Game
23
Representation technique (CGR), defining a unit-square
24
where each corner corresponds to one of the 4 possible
25
nucleotides. Subsequent work further explored the
26
properties of CGR of biological sequences, but two main
27
obstacles prevented the realization of its early promise -
28
lack of scalability with regard to the number of possible
29
unique units and inability to represent succession schemes.
30
Meanwhile, Markov Chain theory already offered a solid
31
foundation for the identification of discrete spaces to
32
represent sequences as cross-tabulated conditional
33
probabilities - Markov transition tables. This Bayesian
34
technique is widely explored in bioinformatic applications
35
seeking to measure homology and align sequences [ 11 ] . In
36
a recent report [ 12 ] we have shown that, for genomic
37
sequences, Markov tables are in fact a special case of CGR,
38
contrary to what had been suggested previously [ 13 ] .
39
This raised the prospect of an advantageous use of
40
iterative maps as state spaces not only for representation
41
of sequences but also to identify scale independent
42
stochastic models of the succession scheme. That work [ 12
43
] is hereby extended and further generalized to be
44
applicable to sequences with arbitrary numbers of unique
45
component units, without sacrificing the inverse
46
correlation between distance in the map and sequence
47
similarity independent of position. Accordingly, the
48
technique is named Universal Sequence Map (USM).
49
50
51
Results
52
53
Conceptual foundations
54
The USM generalization proposed here is achieved by
55
observing two stipulations: A-alternative units in the
56
iterative map are positioned in distinct corners of
57
unit block structures', and B -
58
sequence processing is bi-directional.
59
Basis for USM generalization:
60
A. Each unique unit is referenced in the map for
61
positions that are at equal
62
n-distances from each other, and
63
possibly, but not necessarily, defining a complete
64
block structure [ 3 ] .
65
n-distances are defined as the
66
maximum distance along any dimension, e.g.
67
n-distance between [
68
a
69
1 ,
70
a
71
2 , ...,
72
a
73
74
n
75
] and [
76
b
77
1 ,
78
b
79
2 , ...,
80
b
81
82
n
83
] is
84
max(|b
85
1 -
86
a
87
1 |, |
88
b
89
2 -
90
a
91
2 |, ..., |
92
b
93
94
n
95
-
96
a
97
98
n
99
|), see also Equation 3. It will be shown that this
100
stipulation leads to the definition of spaces where
101
distance is inversely proportional to sequence
102
similarity, independent of position. In this respect, USM
103
departs from previous attempts to generalize Chaos Game
104
Representation that conserve the bi-dimensionality of the
105
original CGR representation [ 8 15 16 17 ] .
106
B. The iterative positioning is performed in both
107
directions. Therefore, there will be two sets of
108
coordinates, the result of forward and backward iterative
109
operations. It will be shown that, by adding backward and
110
forward map distances between two positions, the number
111
of identical units in the encompassing sequences can be
112
extracted directly from the USM coordinates. As a
113
consequence, two arbitrary positions can be compared, and
114
the number of contiguous similar units is extracted by an
115
algebraic operation that relies solely on the USM
116
coordinates of those very two positions.
117
118
119
Implementation of USM algorithm
120
The algorithm will be first illustrated for the first
121
and last stanzas of Wendy Cope's poem "The Uncertainty of
122
the Poet" (14), respectively, "I am a poet. I am very
123
fond of bananas." and "I am of very fond bananas. Am I a
124
poet?". The procedure includes four steps:
125
1. Identification of unique sequence units - e.g.
126
these two stanzas have 19 unique characters, (table 1),
127
i.e.
128
uu = 19.
129
2. Replacement of each unique unit (in this case units
130
are alphabetic characters) by a unique binary number -
131
e.g. in table 1each of the 19 unique units is replaced by
132
its rank order minus one, represented as a binary number.
133
Other arrangements are possible leading to the same final
134
result as discussed below. The minimum number of
135
dimensions necessary to accommodate
136
uu unique units,
137
n, is the upper integer of the
138
length of its binary representation:
139
n = ceil(log
140
2
141
(uu)). For W. Cope's stanzas,
142
n = ceil(log
143
2
144
(19)) = 5. The binary reference
145
coordinates for the unique units are defined by the
146
numerals of the binary code - for example,
147
a will be assigned to the position
148
U
149
'
150
a ' =
151
[0,0,1,0,1]. Each symbol is
152
represented as a corner in a n-dimensional cube (Table
153
1). The purpose of these first two steps is to guarantee
154
that the reference positions for each unique sequence
155
unit component are equidistant (stipulation A) in the
156
n-metric defined above. Any other
157
procedure resulting in equidistant unique positions will
158
lead to the same final results independently of the
159
actual binary numbers used or the number of dimensions
160
used to contain them.
161
3. The CGR procedure [ 5 ] (Eq. 1) is applied
162
independently to each coordinate,
163
j = 1,2, ...,n, for each unit,
164
i, in the sequence of length
165
k, u
166
167
j
168
169
(i) with
170
i = 1,2, ...,k, and starting with a
171
random map position taken from a uniform distribution in
172
[0,1]
173
n , i.e.
174
Unif([0,1]
175
n ). The random seed is not
176
fundamentally different from using the middle position in
177
the map as is conventional in CGR and it has the added
178
feature that it prevents the invalidation of the inverse
179
logarithmic proportionality of
180
n-distance to sequence similarity [
181
12 ] for sequences that start or end with the same
182
motif.
183
For a sequence with
184
k units, the USM positions
185
i = 1,..., k for the
186
j = 1,..., n dimensions are
187
determined as follows:
188
189
4. The previous step generated
190
k positions in a
191
n -dimension space by processing
192
the sequence forward (Eq. 1). This subsequent step adds
193
an additional set of
194
n dimensions by implementing the
195
same procedure backward (Eq. 2), again starting at random
196
positions for each coordinate. Consequently the first
197
n dimensions of USM will be
198
referred as defining
199
a forward map and the second set of
200
201
n dimensions will define a
202
backward map. Put together, the
203
bidirectional USM map defines a
204
2n-unit block structure.
205
The
206
n additional backward coordinates
207
are determined as follows:
208
209
The forward USM map for genomic sequences, where
210
uu = 4, and, consequently,
211
n = 2, is the same as the result
212
generated by CGR. However, by freeing the iterative map
213
from the dual-dimensional constraint of conventional CGR,
214
the USM forward map alone achieved the goal of producing
215
a scale independent representation of sequences of
216
arbitrary number of unique units. These properties will
217
be briefly illustrated with W Cope's example. The 16
218
thunit of the first stanza, "I am a poet. I am very fond
219
of bananas.", has USM coordinates
220
USM
221
222
[1,...,2n]
223
224
(16) =
225
[0.02 0.01 0.63 0.00 0.53 0.07 0.30
226
0.52 0.27 0.57]. The first
227
n = 5 coordinates, the position in
228
the forward map, can now be used, by reversing equation 1
229
[ 12 13 ] , not only to extract the identity the unit
230
i = 16 but also the identity of the
231
preceding units:
232
- using forward coordinates alone
233
[0.0156 0.0138 0.6314 0.0001
234
0.5338]
235
236
The same procedure can be applied to the remaining
237
n = 5 coordinates, the position in
238
the backward map, to extract the identity of the
239
succeeding units, now ordered backwards.
240
- using backward coordinates alone
241
[0.0703 0.3004 0.5169 0.2742
242
0.5652]
243
244
The length of the sequence that can be recovered from
245
a position in the CGR or USM space is only as long as the
246
resolution, in bits, of the coordinates themselves. In
247
addition, the relevance of these iterative techniques is
248
not associated with the property of recovering sequences
249
as much as with the ability to recover the succession
250
schemes, e.g. the Markov probability tables. It has been
251
recognized for almost a decade that the density of
252
positions in unidirectional, bi-dimensional, iterated CGR
253
maps (e.g. of genomic sequences,
254
uu = 4 ->
255
n = 2 ) defines a Markov table [ 12
256
13 ] . The complete accommodation of Markov chains in
257
unidirectional USM (i.e. either forward or backward,
258
which is an equivalent to a multidimensional solution for
259
CGR) can be quickly established by noting that the
260
identity of a quadrant is set by its middle coordinates [
261
13 ] . In order to extract the Markov format, for an
262
arbitrary integer order
263
ord, each of the two n-unit
264
hypercubes, the set of
265
n forward or backward coordinates,
266
would be divided in
267
q = 2
268
n.(ord+1) equal quadrants and the
269
quadrant frequencies rearranged [ 12 ] . The use of
270
quadrant to designate what is in
271
fact a sub-unit hypercube is a consonance with the
272
preceding work on bidimensional CGR maps [ 12 ] , where
273
it was shown that since any number of subdivisions can be
274
considered in a continuous domain, the density
275
distribution becomes an order-free Markov Table that
276
accommodates both integer and fractal memory lengths. The
277
extraction of Markov chain transition tables from USM
278
representations, both forward and backward, is included
279
in the accompanying web-based application (see
280
Abstract).
281
Above, the USM procedure was shown to allow for the
282
representation of sequences as multidimensional objects
283
without loss of identity or context. These objects can
284
now be analyzed to characterize the sequences for
285
quantities such as similarity between segments or entropy
286
[ 18 19 ] within the sequence. In figure 1the
287
10-dimensional object defined by the USM positions of the
288
two stanzas was projected in 3-dimensions by principal
289
component analysis. The dimensionality reduction by
290
principal factor extraction has visualization purposes
291
only. As established above, the minimum necessary
292
dimensionality of the USM state space is set by the
293
binary logarithm of the number of unique units.
294
Nevertheless, the sequence variance associated with each
295
component is provided in the figure legend. In figure 1a,
296
the segments "
297
very fond of" in the two stanzas
298
are linked by solid lines to highlight the fact that
299
sequence similarity is reflected by spatial proximity of
300
USM coordinates. The representation is repeated in Figure
301
1bwith solid lining of the segment "
302
bananas ". The matching of the two
303
segments of the second stanza (light) to the similar
304
segments of the first stanza (dark) is, again, visually
305
apparent.
306
The USM algorithm determines that similar sequences,
307
or segments of sequences, will have converging iterated
308
trajectories: the distance will be cut in half for every
309
consecutive similar unit. This property was notice before
310
for CGR of genomic sequences [ 12 ] , and will be further
311
explored here for USM generalization. In that preceding
312
work it was shown that the number of similar consecutive
313
units can be approximated by a symmetrical logarithmic
314
transformation of the maximum distance between two
315
positions in either of the dimensions (
316
n-distance ),
317
d.
318
319
d = -log
320
2 (Max|Δ
321
USM
322
323
undirectional
324
|)     (Eq. 3)
325
Since the USM coordinates include two CGR iterations
326
per dimension, one forward and another backward, two
327
distances can be extracted. The first 1,...,
328
n coordinates define a forward
329
similarity estimate,
330
d
331
332
f
333
, and the second
334
n+1,..., 2n coordinates can be used
335
to estimate backward similarity,
336
d
337
338
b
339
. The former measures similarity with regard to the
340
units preceding the one being compared and the latter
341
does the same for those the succeeding that same units.
342
Therefore, the forward and backward distances between the
343
positions
344
i and
345
j of two sequences,
346
a and
347
b, with a length of
348
k
349
350
a
351
and
352
k
353
354
b
355
, respectively, would be calculated as described by
356
equation 4, defining two rectangular matrices,
357
d
358
359
f
360
and
361
d
362
363
b
364
, of size
365
k
366
367
a
368
×
369
k
370
371
b
372
(Fig. 2a,2b).
373
374
However, the values of d necessarily overestimate the
375
number of similar contiguous units preceding (
376
d
377
378
f
379
, illustration for stanza comparison in Fig. 2a) or
380
succeeding (
381
d
382
383
b
384
, illustration for stanza comparison in Fig. 2b) the
385
positions being compared. The value of
386
d would be the exact number of
387
contiguous similar units,
388
h, if the starting positions for
389
the similar segments where at a
390
n-distance of 1, e.g. if they were
391
in different corners of the unit hyper-dimensional USM
392
cube. Since the initial distance is always somewhat
393
smaller, the homology,
394
h, measured as the number of
395
consecutive similar units, will be smaller than
396
d (Eq. 5).
397
398
The contribution of φ to the similarity distance,
399
d, can be estimated from the
400
distribution of positions in the USM map of a random
401
sequence. A uniformly random sequence [ 3 19 20 ] will
402
occupy the USM space uniformly, and, for that matter, so
403
will the random seed of forward and backward iterative
404
mapping, respectively equations 1 and 2. Therefore, a
405
uniform distribution is an appropriated starting point to
406
estimate the effect of (
407
p, the over-determination of
408
h by
409
d (Eq.5). Accordingly, for a given
410
x∈ [0,1], the probability,
411
P
412
413
o
414
, that any two coordinates,
415
x
416
1 and
417
x
418
2 , are located within a radius
419
r ∈ (0,1) is given by Equation
420
6.
421
422
Since
423
P
424
425
o
426
(
427
r ) is the probability of two
428
points chosen randomly from a uniform distribution
429
Unif([0,1]) being at a distance
430
less than
431
r from each other, for any set of
432
n coordinates in the USM, the
433
likelihood of finding another position within a block
434
distance of
435
r would be described by raising
436
equation 6 to the
437
n exponent. Finally, recalling from
438
equation 3 that sequence similarity can be obtained by a
439
logarithmic transformation of r, the probability that the
440
unidirectional coordinates of two random sequences are at
441
a similar length
442
d > φ is described by equation
443
7. The simplicity of the expansion for higher dimensions
444
highlights the order-statistics properties [ 21 ] of the
445
n-metric introduced above (Eq. 3).
446
It is noteworthy that the model for the likelihood of
447
over-determination is the null-model, e.g. the comparison
448
of actual sequences is evaluated against the hypothesis
449
that the similarity observed happened by chance
450
alone.
451
452
Finally, it is also relevant to recall that the null
453
model for
454
d (Eq.7 for unidirectional
455
comparisons, bi-directional null models are derived
456
below) allows the generalization for non-integer
457
dimensions. For example, the 19 unique unites found in
458
the two stanzas (Table 1), define forward and backward
459
USM maps in 5 dimensions each. However the 5 thdimension
460
is not fully utilized, as that would require
461
2 5=
462
32 unique units. Therefore, if
463
there is no requirement for an integer result, the
464
effective value of
465
n for the two stanzas can be
466
refined as being
467
n = log
468
2
469
(19) = 4.25.
470
An estimation of bi-directional similarity will now be
471
introduced that adds the forward and backward distance
472
measures
473
d
474
475
f
476
and
477
d
478
479
b
480
. The motivation for this new estimate is the the
481
determination of the similar length of the entire similar
482
segment between two sequences solely by comparing any two
483
homologous units. Accordingly, since
484
d
485
486
f
487
is an estimate of preceding similarity and
488
d
489
490
b
491
provides the succeeding similarity equivalent the
492
sum of the two similar distances,
493
D, (Eq. 8) will estimate of the
494
bi-directional similarity, e.g. the length of the similar
495
segment,
496
H.
497
498
As illustrated later in the implementation, for
499
pairwise comparisons of homologous units of similar
500
segments, all values of
501
D and, consequently, of φ, are
502
exactly the same. This result could possibly have been
503
anticipated from the preceding work [ 12 ] by noting that
504
the value of
505
d between two adjacent homologous
506
units differs exactly by one unit. However, this result
507
was in fact a surprise and one with far reaching
508
fundamental and practical implications.
509
Similarly to unidirectional similarity estimation,
510
d, the bi-directional estimate,
511
D, being the sum of two
512
overestimates, is also overestimated by a quantity to be
513
defined, φ (Eq. 8). The derivation of an expression for
514
the bi-directional overestimation will require the
515
decomposition of
516
P
517
1 (Eq. 7) for two cases, comparison
518
between unidirectional coordinates of similar quadrants,
519
P
520
1
521
a , and of opposite quadrants,
522
523
P
524
1
525
b , as described in equation
526
9. Recalling from equation 2, positions in the same
527
quadrant correspond to sequence units with the same
528
identity, and positions in opposite quadrants correspond
529
to comparison between coordinates of units with a
530
different identity.
531
532
The need for the distinction between same and opposite
533
quadrant comparison, which is to say between similar and
534
between dissimilar sequence units, is caused by the fact
535
that same quadrant comparisons are more likely to lead to
536
higher values of
537
d. As illustrated above for the 16
538
thunit of the first stanza, the forward and backward
539
coordinates must fall in the same quadrant. Consequently,
540
the similar pattern of same and opposite quadrant
541
comparisons for each dimension will be reflected as a
542
bias in the bi-directional overestimation. The
543
determination of probability,
544
P
545
2 , of over-determination between sums
546
of independent unidirectional similarity estimates is
547
derived in equation 10.
548
549
The probability of bi-directional over-determination,
550
can now be established by using the same and opposite
551
unidirectional comparison expressions presented in
552
Equation 9. The resulting expression for similarity
553
over-determination by the distance between bidirectional
554
USM coordinates,
555
P
556
3 , is presented in equation 11.
557
558
In figure 3, the probability distribution for both
559
unidirectional (
560
P
561
1 , in gray) and bidirectional (
562
P
563
3 , in black) comparisons is
564
represented for different dimensions,
565
n. It is clearly apparent that the
566
over-determination becomes much less significant as
567
dimensionality increases. From a practical point of view,
568
the over-determination is of little consequence because
569
the computational load of comparing sequences corresponds
570
mostly to the identification of candidate pairing
571
combinations. The fact that the n-metric unidirectional
572
distances,
573
d
574
575
f
576
and
577
d
578
579
b
580
, defined in Equation 4, and bidirectional
581
D, defined in Eq. 8, are
582
over-determined implies that the identification of
583
similar segments between two sequences will include false
584
positives but will not generate false negatives. The
585
false positive identifications can be readily recognized
586
by comparing the sequences extracted from the
587
coordinates, as demonstrated above for the 16 thunit of
588
the first stanza. Nevertheless, since over-determination
589
will necessarily occur, its probability distribution was
590
identified (Eq. 11, Fig. 3). This can also be achieved
591
for individual values by solving Eq. 11 for the value of
592
φ observed. For example, for the conditions of the two
593
stanzas, the value of (φ
594
595
p 1 = 0.5,
596
n = 4.25 is 0.71 sequence units,
597
which is the expected median unidirectional
598
over-determination,
599
P
600
1 , of
601
d
602
603
f
604
and
605
d
606
607
b
608
(Eqs. 5, 7). The corresponding probability of
609
bi-directional overdetermination,
610
P
611
3 , should be somewhat above twice
612
that value. Using equation 11, the value obtained is 1.67
613
similar units. Finally, it is worthy to stress that the
614
expressions for calculation of likelihood of arbitrary
615
levels of over-determination (Eq. 5-11) can be inverted
616
to anticipate the level of over-determination for
617
arbitrary probability levels. This use of the null random
618
model is also included in the accompanying online tool
619
(see Abstract for URL).
620
621
622
623
Discussion
624
625
USM of biological sequences
626
The representation of biological information as
627
discrete sequences is dominated by the fact that genomes
628
are sequences of discrete units and so are the products
629
of its transcription and translation. However, not all
630
biological sequences are composed of units that are
631
functionally equally distinct from each other, as is the
632
case of proteomic data and Multi-locus sequence typing
633
[MLST, 4]. To avoid the issue of unit inequality and
634
highlight the general applicability of the USM procedure,
635
stanzas of a poem were used to illustrate the
636
implementation instead. Nevertheless the original
637
motivation of analyzing biological sequences is now
638
recalled.
639
In the preceding report the authors have illustrated
640
the properties of unidirectional
641
n-metric estimation of similarity
642
for the threonine operon of
643
E. coli [ 12 ] . The same two two
644
regions of
645
thrA and
646
thrB sequences of
647
E. coli K-12 MG1655 are compared in
648
Figure 5to highlight the advancement achieved by USM. It
649
should be recalled that the particular dimensionality of
650
DNA sequences,
651
n = 2, allows a very convenient
652
unidirectional bi-dimensional representation, which is in
653
fact the Chaos Game Representation procedure (CGR) [ 5 ]
654
. Consequently, CGR is a particular case of USM, obtained
655
when
656
n = 2 and only the forward
657
coordinates are determined. This can also be verified by
658
comparing Figure 5awith a similar representation reported
659
before [ 12 ] , obtained with the same data using CGR
660
[Fig. 10 of that report]. The advantageous properties of
661
full (bi-directional) USM become apparent when Fig. 5ais
662
compared with Fig. 5b. It is clearly apparent for
663
bi-directional USM (Fig. 5b) that all pair-wise
664
comparisons of units of identical segments now have the
665
same
666
D values. This coverts any
667
individual homologous pair-wise comparison into an
668
estimation of the length of the entire similar segment.
669
The conservation of statistical properties by the
670
distances obtained,
671
D, can also be confirmed by
672
comparing observed values with the corresponding null
673
models (Fig. 4). For the analysis of this figure it is
674
noteworthy to recall that the statistical properties of
675
prokaryote DNA are often undistinguishable from uniform
676
randomness [ 12 18 19 ] . The genomic sequence of the
677
first gene of the threonine operon of
678
E. coli, thrA, is compared with
679
that of the second,
680
thrB. The distribution of the
681
resulting
682
D values is represented in figure
683
4(solid black line), alongside with the null model for
684
that dimensionality (Eq. 11, with
685
n = log
686
2
687
(4) = 2 , gray dotted line). The
688
genomic sequences of
689
thrA and
690
thrB were translated into proteomic
691
sequences using SwissProt's on line translator, applied
692
to the 5'3' first frame
693
http://www.expasy.ch/tools/dna.html. Similarly, the
694
distribution of
695
D values for the comparison of the
696
proteomic
697
thrA and
698
thrB sequences is also represented
699
in Figure 4, alongside with the null model, Eq. 11, for
700
its dimensionality (
701
n = log
702
2 (
703
uu = 20 possible aminoacids) =
704
4.32 ), which is graphically nearly undistiguishable
705
from that of the comparison between the stanzas, with
706
n = log
707
2 (
708
uu = 19 possible letters) =
709
4.25 (dotted gray line for the rounded value,
710
n = 4.3). Both the genomic and the
711
proteomic distribution of
712
D values is observed to be
713
contained by the null model, unlike the comparison
714
between the stanzas discussed above, where the existence
715
of structure is clearly reflected by its distribution.
716
The genomic and proteomic of
717
thrA and
718
thraB, used to illustrate this
719
discussion, are provided with the web-based
720
implementation of USM (see Methods for URL).
721
722
723
724
Conclusions
725
The mounting quantity and complexity of biological
726
sequence data being produced [ 22 ] commands the
727
investigation of new approaches to sequence analysis. In
728
particular, the need for scale independent methodologies
729
becomes even more necessary as the limitations of
730
conventional Markov chains are increasingly noted [ 6 ] .
731
These limitations are bound to become overwhelming when
732
signals such as succession schemes of the expression of
733
over 30,000 human genes [ 23 ] become available. This
734
particular signal would be conveniently packaged within a
735
30 dimension USM unit block (
736
n = ceil(log
737
2 (
738
3 10 3) = 15).
739
In addition, the advances in statistical mechanics for
740
the study of complex systems, particularly in non-linear
741
dynamics, have not been fully utilizable for the analysis
742
of sequences due to the missing formal link between
743
discrete sequences and trajectories in continuous spaces.
744
The properties of USM reported above suggest that this may
745
indeed be such a bridge. For example, the embedding of
746
dimensions, a technique at the foundations of many
747
time-series analysis techniques offers a good example of
748
the completeness of USM representation of sequences. By
749
embedding the forward and backward coordinates separately,
750
at the relevant memory length, the resulting embedded USM
751
is exactly what would be obtained by applying USM technique
752
to the embedded dimeric sequence itself.
753
754
755
Methods
756
757
Computation
758
The algorithms described in this manuscript were coded
759
using MATLAB™ 6.0 language (Release 12), licensed by The
760
MathWorks Inc http://www.mathworks.com. An internet
761
interface was also developed to make them freely
762
accessible through user-friendly web-pages
763
http://bioinformatics.musc.edu/~jonas/usm/.
764
765
766
Source code
767
In order to facilitate the development of sequence
768
analysis applications based on the USM state space, the
769
software library of functions written to calculate the
770
USM coordinates is provided with the web-based
771
implementation (see address above). The code is provided
772
in MATLAB format, which is general enough as to be easily
773
ported into other environments. These functions process
774
sequences provided as text files in FASTA format. In
775
addition to the functions, the test datasets and a brief
776
readme.txt documentation file are also included.
777
778
779
Test data
780
The USM mapping proposed is applicable to any discrete
781
sequence, even if the primary goal is the analysis of
782
biological sequences. For ease of illustration and to
783
emphasize USM's general validity, the test dataset used
784
to describe implementation of the algorithm consists of
785
two stanzas of a Poem by Wendy Cope, "The Uncertainty of
786
the Poet" [ 14 ] . In the Discussion section, USM was
787
also applied to the DNA sequence of the threonine operon
788
of
789
Escherichia coli K-12 MG1655,
790
obtained from the University of Winsconsin
791
E. coli Genome Project
792
http:/www.genetics.wisc.edu, and to its 5'3' first frame
793
proteomic translation obtained by using SwissProt on line
794
translator http://www.expasy.ch/tools/dna.html. The three
795
test sequence datasets are also included in the web-based
796
USM application.
797
798
799
800
801
802