CoCalc -- 1471-2105-3-6.txt

OANC_GrAF / data / written_2 / technical / biomed / 1471-2105-3-6.txt
³⁹⁶⁷³ views
1

2
  
3
    
4
      
5
        Background
6
        For over a decade the idea of representing biological
7
        sequences in a continuous coordinate space has maintained
8
        its appeal but not been fully realized [ 1 2 3 ] . The
9
        basic idea is that sequences of symbols, such as
10
        nucleotides in genomes, aminoacids in proteomes, repeated
11
        sequences in MLST [Multi Locus Sequence Typing, 4], words
12
        in languages or letters in words, would define trajectories
13
        in this continuous space conserving the statistical
14
        properties of the original sequences [ 3 5 6 7 8 9 ] .
15
        Accordingly, the coordinate position of each unit would
16
        uniquely encode for both its identity and its context, i.e.
17
        the identity of its neighbors [ 10 ] . Ideally, the
18
        position should be scale-independent, such that the
19
        extraction of the encompassing sequence can be performed
20
        with any resolution, leading to an oligomer of arbitrary
21
        length. The pioneer work by Jeffrey published in 1990 [ 5 ]
22
        achieved this for genomic sequences by using the Chaos Game
23
        Representation technique (CGR), defining a unit-square
24
        where each corner corresponds to one of the 4 possible
25
        nucleotides. Subsequent work further explored the
26
        properties of CGR of biological sequences, but two main
27
        obstacles prevented the realization of its early promise -
28
        lack of scalability with regard to the number of possible
29
        unique units and inability to represent succession schemes.
30
        Meanwhile, Markov Chain theory already offered a solid
31
        foundation for the identification of discrete spaces to
32
        represent sequences as cross-tabulated conditional
33
        probabilities - Markov transition tables. This Bayesian
34
        technique is widely explored in bioinformatic applications
35
        seeking to measure homology and align sequences [ 11 ] . In
36
        a recent report [ 12 ] we have shown that, for genomic
37
        sequences, Markov tables are in fact a special case of CGR,
38
        contrary to what had been suggested previously [ 13 ] .
39
        This raised the prospect of an advantageous use of
40
        iterative maps as state spaces not only for representation
41
        of sequences but also to identify scale independent
42
        stochastic models of the succession scheme. That work [ 12
43
        ] is hereby extended and further generalized to be
44
        applicable to sequences with arbitrary numbers of unique
45
        component units, without sacrificing the inverse
46
        correlation between distance in the map and sequence
47
        similarity independent of position. Accordingly, the
48
        technique is named Universal Sequence Map (USM).
49
      
50
      
51
        Results
52
        
53
          Conceptual foundations
54
          The USM generalization proposed here is achieved by
55
          observing two stipulations: A-alternative units in the
56
          iterative map are positioned in distinct corners of 
57
          unit block structures', and B -
58
          sequence processing is bi-directional.
59
          Basis for USM generalization:
60
          A. Each unique unit is referenced in the map for
61
          positions that are at equal 
62
          n-distances from each other, and
63
          possibly, but not necessarily, defining a complete 
64
          block structure [ 3 ] . 
65
          n-distances are defined as the
66
          maximum distance along any dimension, e.g. 
67
          n-distance between [ 
68
          a 
69
          1 , 
70
          a 
71
          2 , ..., 
72
          a 
73
          
74
            n 
75
           ] and [ 
76
          b 
77
          1 , 
78
          b 
79
          2 , ..., 
80
          b 
81
          
82
            n 
83
           ] is 
84
          max(|b 
85
          1 - 
86
          a 
87
          1 |, | 
88
          b 
89
          2 - 
90
          a 
91
          2 |, ..., | 
92
          b 
93
          
94
            n 
95
           - 
96
          a 
97
          
98
            n 
99
           |), see also Equation 3. It will be shown that this
100
          stipulation leads to the definition of spaces where
101
          distance is inversely proportional to sequence
102
          similarity, independent of position. In this respect, USM
103
          departs from previous attempts to generalize Chaos Game
104
          Representation that conserve the bi-dimensionality of the
105
          original CGR representation [ 8 15 16 17 ] .
106
          B. The iterative positioning is performed in both
107
          directions. Therefore, there will be two sets of
108
          coordinates, the result of forward and backward iterative
109
          operations. It will be shown that, by adding backward and
110
          forward map distances between two positions, the number
111
          of identical units in the encompassing sequences can be
112
          extracted directly from the USM coordinates. As a
113
          consequence, two arbitrary positions can be compared, and
114
          the number of contiguous similar units is extracted by an
115
          algebraic operation that relies solely on the USM
116
          coordinates of those very two positions.
117
        
118
        
119
          Implementation of USM algorithm
120
          The algorithm will be first illustrated for the first
121
          and last stanzas of Wendy Cope's poem "The Uncertainty of
122
          the Poet" (14), respectively, "I am a poet. I am very
123
          fond of bananas." and "I am of very fond bananas. Am I a
124
          poet?". The procedure includes four steps:
125
          1. Identification of unique sequence units - e.g.
126
          these two stanzas have 19 unique characters, (table 1),
127
          i.e. 
128
          uu = 19. 
129
          2. Replacement of each unique unit (in this case units
130
          are alphabetic characters) by a unique binary number -
131
          e.g. in table 1each of the 19 unique units is replaced by
132
          its rank order minus one, represented as a binary number.
133
          Other arrangements are possible leading to the same final
134
          result as discussed below. The minimum number of
135
          dimensions necessary to accommodate 
136
          uu unique units, 
137
          n, is the upper integer of the
138
          length of its binary representation: 
139
          n = ceil(log 
140
          2 
141
          (uu)). For W. Cope's stanzas, 
142
          n = ceil(log 
143
          2 
144
          (19)) = 5. The binary reference
145
          coordinates for the unique units are defined by the
146
          numerals of the binary code - for example, 
147
          a will be assigned to the position 
148
          U 
149
          ' 
150
          a ' = 
151
          [0,0,1,0,1]. Each symbol is
152
          represented as a corner in a n-dimensional cube (Table
153
          1). The purpose of these first two steps is to guarantee
154
          that the reference positions for each unique sequence
155
          unit component are equidistant (stipulation A) in the 
156
          n-metric defined above. Any other
157
          procedure resulting in equidistant unique positions will
158
          lead to the same final results independently of the
159
          actual binary numbers used or the number of dimensions
160
          used to contain them.
161
          3. The CGR procedure [ 5 ] (Eq. 1) is applied
162
          independently to each coordinate, 
163
          j = 1,2, ...,n, for each unit, 
164
          i, in the sequence of length 
165
          k, u 
166
          
167
            j 
168
           
169
          (i) with 
170
          i = 1,2, ...,k, and starting with a
171
          random map position taken from a uniform distribution in
172
          [0,1] 
173
          n , i.e. 
174
          Unif([0,1] 
175
          n ). The random seed is not
176
          fundamentally different from using the middle position in
177
          the map as is conventional in CGR and it has the added
178
          feature that it prevents the invalidation of the inverse
179
          logarithmic proportionality of 
180
          n-distance to sequence similarity [
181
          12 ] for sequences that start or end with the same
182
          motif.
183
          For a sequence with 
184
          k units, the USM positions 
185
          i = 1,..., k for the 
186
          j = 1,..., n dimensions are
187
          determined as follows:
188
          
189
          4. The previous step generated 
190
          k positions in a 
191
          n -dimension space by processing
192
          the sequence forward (Eq. 1). This subsequent step adds
193
          an additional set of 
194
          n dimensions by implementing the
195
          same procedure backward (Eq. 2), again starting at random
196
          positions for each coordinate. Consequently the first 
197
          n dimensions of USM will be
198
          referred as defining 
199
          a forward map and the second set of
200
          
201
          n dimensions will define a 
202
          backward map. Put together, the
203
          bidirectional USM map defines a 
204
          2n-unit block structure. 
205
          The 
206
          n additional backward coordinates
207
          are determined as follows:
208
          
209
          The forward USM map for genomic sequences, where 
210
          uu = 4, and, consequently, 
211
          n = 2, is the same as the result
212
          generated by CGR. However, by freeing the iterative map
213
          from the dual-dimensional constraint of conventional CGR,
214
          the USM forward map alone achieved the goal of producing
215
          a scale independent representation of sequences of
216
          arbitrary number of unique units. These properties will
217
          be briefly illustrated with W Cope's example. The 16
218
          thunit of the first stanza, "I am a poet. I am very fond
219
          of bananas.", has USM coordinates 
220
          USM 
221
          
222
            [1,...,2n] 
223
           
224
          (16) = 
225
          [0.02 0.01 0.63 0.00 0.53 0.07 0.30
226
          0.52 0.27 0.57]. The first 
227
          n = 5 coordinates, the position in
228
          the forward map, can now be used, by reversing equation 1
229
          [ 12 13 ] , not only to extract the identity the unit 
230
          i = 16 but also the identity of the
231
          preceding units:
232
          - using forward coordinates alone 
233
          [0.0156 0.0138 0.6314 0.0001
234
          0.5338] 
235
          
236
          The same procedure can be applied to the remaining 
237
          n = 5 coordinates, the position in
238
          the backward map, to extract the identity of the
239
          succeeding units, now ordered backwards.
240
          - using backward coordinates alone 
241
          [0.0703 0.3004 0.5169 0.2742
242
          0.5652] 
243
          
244
          The length of the sequence that can be recovered from
245
          a position in the CGR or USM space is only as long as the
246
          resolution, in bits, of the coordinates themselves. In
247
          addition, the relevance of these iterative techniques is
248
          not associated with the property of recovering sequences
249
          as much as with the ability to recover the succession
250
          schemes, e.g. the Markov probability tables. It has been
251
          recognized for almost a decade that the density of
252
          positions in unidirectional, bi-dimensional, iterated CGR
253
          maps (e.g. of genomic sequences, 
254
          uu = 4 -> 
255
          n = 2 ) defines a Markov table [ 12
256
          13 ] . The complete accommodation of Markov chains in
257
          unidirectional USM (i.e. either forward or backward,
258
          which is an equivalent to a multidimensional solution for
259
          CGR) can be quickly established by noting that the
260
          identity of a quadrant is set by its middle coordinates [
261
          13 ] . In order to extract the Markov format, for an
262
          arbitrary integer order 
263
          ord, each of the two n-unit
264
          hypercubes, the set of 
265
          n forward or backward coordinates,
266
          would be divided in 
267
          q = 2 
268
          n.(ord+1) equal quadrants and the
269
          quadrant frequencies rearranged [ 12 ] . The use of 
270
          quadrant to designate what is in
271
          fact a sub-unit hypercube is a consonance with the
272
          preceding work on bidimensional CGR maps [ 12 ] , where
273
          it was shown that since any number of subdivisions can be
274
          considered in a continuous domain, the density
275
          distribution becomes an order-free Markov Table that
276
          accommodates both integer and fractal memory lengths. The
277
          extraction of Markov chain transition tables from USM
278
          representations, both forward and backward, is included
279
          in the accompanying web-based application (see
280
          Abstract).
281
          Above, the USM procedure was shown to allow for the
282
          representation of sequences as multidimensional objects
283
          without loss of identity or context. These objects can
284
          now be analyzed to characterize the sequences for
285
          quantities such as similarity between segments or entropy
286
          [ 18 19 ] within the sequence. In figure 1the
287
          10-dimensional object defined by the USM positions of the
288
          two stanzas was projected in 3-dimensions by principal
289
          component analysis. The dimensionality reduction by
290
          principal factor extraction has visualization purposes
291
          only. As established above, the minimum necessary
292
          dimensionality of the USM state space is set by the
293
          binary logarithm of the number of unique units.
294
          Nevertheless, the sequence variance associated with each
295
          component is provided in the figure legend. In figure 1a,
296
          the segments " 
297
          very fond of" in the two stanzas
298
          are linked by solid lines to highlight the fact that
299
          sequence similarity is reflected by spatial proximity of
300
          USM coordinates. The representation is repeated in Figure
301
          1bwith solid lining of the segment " 
302
          bananas ". The matching of the two
303
          segments of the second stanza (light) to the similar
304
          segments of the first stanza (dark) is, again, visually
305
          apparent.
306
          The USM algorithm determines that similar sequences,
307
          or segments of sequences, will have converging iterated
308
          trajectories: the distance will be cut in half for every
309
          consecutive similar unit. This property was notice before
310
          for CGR of genomic sequences [ 12 ] , and will be further
311
          explored here for USM generalization. In that preceding
312
          work it was shown that the number of similar consecutive
313
          units can be approximated by a symmetrical logarithmic
314
          transformation of the maximum distance between two
315
          positions in either of the dimensions ( 
316
          n-distance ), 
317
          d. 
318
          
319
          d = -log 
320
          2 (Max|Δ 
321
          USM 
322
          
323
            undirectional 
324
           |)     (Eq. 3)
325
          Since the USM coordinates include two CGR iterations
326
          per dimension, one forward and another backward, two
327
          distances can be extracted. The first 1,..., 
328
          n coordinates define a forward
329
          similarity estimate, 
330
          d 
331
          
332
            f 
333
           , and the second 
334
          n+1,..., 2n coordinates can be used
335
          to estimate backward similarity, 
336
          d 
337
          
338
            b 
339
           . The former measures similarity with regard to the
340
          units preceding the one being compared and the latter
341
          does the same for those the succeeding that same units.
342
          Therefore, the forward and backward distances between the
343
          positions 
344
          i and 
345
          j of two sequences, 
346
          a and 
347
          b, with a length of 
348
          k 
349
          
350
            a 
351
           and 
352
          k 
353
          
354
            b 
355
           , respectively, would be calculated as described by
356
          equation 4, defining two rectangular matrices, 
357
          d 
358
          
359
            f 
360
           and 
361
          d 
362
          
363
            b 
364
           , of size 
365
          k 
366
          
367
            a 
368
           × 
369
          k 
370
          
371
            b 
372
           (Fig. 2a,2b).
373
          
374
          However, the values of d necessarily overestimate the
375
          number of similar contiguous units preceding ( 
376
          d 
377
          
378
            f 
379
           , illustration for stanza comparison in Fig. 2a) or
380
          succeeding ( 
381
          d 
382
          
383
            b 
384
           , illustration for stanza comparison in Fig. 2b) the
385
          positions being compared. The value of 
386
          d would be the exact number of
387
          contiguous similar units, 
388
          h, if the starting positions for
389
          the similar segments where at a 
390
          n-distance of 1, e.g. if they were
391
          in different corners of the unit hyper-dimensional USM
392
          cube. Since the initial distance is always somewhat
393
          smaller, the homology, 
394
          h, measured as the number of
395
          consecutive similar units, will be smaller than 
396
          d (Eq. 5).
397
          
398
          The contribution of φ to the similarity distance, 
399
          d, can be estimated from the
400
          distribution of positions in the USM map of a random
401
          sequence. A uniformly random sequence [ 3 19 20 ] will
402
          occupy the USM space uniformly, and, for that matter, so
403
          will the random seed of forward and backward iterative
404
          mapping, respectively equations 1 and 2. Therefore, a
405
          uniform distribution is an appropriated starting point to
406
          estimate the effect of ( 
407
          p, the over-determination of 
408
          h by 
409
          d (Eq.5). Accordingly, for a given 
410
          x∈ [0,1], the probability, 
411
          P 
412
          
413
            o 
414
           , that any two coordinates, 
415
          x 
416
          1 and 
417
          x 
418
          2 , are located within a radius 
419
          r ∈ (0,1) is given by Equation
420
          6.
421
          
422
          Since 
423
          P 
424
          
425
            o 
426
           ( 
427
          r ) is the probability of two
428
          points chosen randomly from a uniform distribution 
429
          Unif([0,1]) being at a distance
430
          less than 
431
          r from each other, for any set of 
432
          n coordinates in the USM, the
433
          likelihood of finding another position within a block
434
          distance of 
435
          r would be described by raising
436
          equation 6 to the 
437
          n exponent. Finally, recalling from
438
          equation 3 that sequence similarity can be obtained by a
439
          logarithmic transformation of r, the probability that the
440
          unidirectional coordinates of two random sequences are at
441
          a similar length 
442
          d > φ is described by equation
443
          7. The simplicity of the expansion for higher dimensions
444
          highlights the order-statistics properties [ 21 ] of the 
445
          n-metric introduced above (Eq. 3).
446
          It is noteworthy that the model for the likelihood of
447
          over-determination is the null-model, e.g. the comparison
448
          of actual sequences is evaluated against the hypothesis
449
          that the similarity observed happened by chance
450
          alone.
451
          
452
          Finally, it is also relevant to recall that the null
453
          model for 
454
          d (Eq.7 for unidirectional
455
          comparisons, bi-directional null models are derived
456
          below) allows the generalization for non-integer
457
          dimensions. For example, the 19 unique unites found in
458
          the two stanzas (Table 1), define forward and backward
459
          USM maps in 5 dimensions each. However the 5 thdimension
460
          is not fully utilized, as that would require 
461
          2 5= 
462
          32 unique units. Therefore, if
463
          there is no requirement for an integer result, the
464
          effective value of 
465
          n for the two stanzas can be
466
          refined as being 
467
          n = log 
468
          2 
469
          (19) = 4.25. 
470
          An estimation of bi-directional similarity will now be
471
          introduced that adds the forward and backward distance
472
          measures 
473
          d 
474
          
475
            f 
476
           and 
477
          d 
478
          
479
            b 
480
           . The motivation for this new estimate is the the
481
          determination of the similar length of the entire similar
482
          segment between two sequences solely by comparing any two
483
          homologous units. Accordingly, since 
484
          d 
485
          
486
            f 
487
           is an estimate of preceding similarity and 
488
          d 
489
          
490
            b 
491
           provides the succeeding similarity equivalent the
492
          sum of the two similar distances, 
493
          D, (Eq. 8) will estimate of the
494
          bi-directional similarity, e.g. the length of the similar
495
          segment, 
496
          H. 
497
          
498
          As illustrated later in the implementation, for
499
          pairwise comparisons of homologous units of similar
500
          segments, all values of 
501
          D and, consequently, of φ, are
502
          exactly the same. This result could possibly have been
503
          anticipated from the preceding work [ 12 ] by noting that
504
          the value of 
505
          d between two adjacent homologous
506
          units differs exactly by one unit. However, this result
507
          was in fact a surprise and one with far reaching
508
          fundamental and practical implications.
509
          Similarly to unidirectional similarity estimation, 
510
          d, the bi-directional estimate, 
511
          D, being the sum of two
512
          overestimates, is also overestimated by a quantity to be
513
          defined, φ (Eq. 8). The derivation of an expression for
514
          the bi-directional overestimation will require the
515
          decomposition of 
516
          P 
517
          1 (Eq. 7) for two cases, comparison
518
          between unidirectional coordinates of similar quadrants, 
519
          P 
520
          1 
521
          a  , and of opposite quadrants,
522
          
523
          P 
524
          1 
525
          b  , as described in equation
526
          9. Recalling from equation 2, positions in the same
527
          quadrant correspond to sequence units with the same
528
          identity, and positions in opposite quadrants correspond
529
          to comparison between coordinates of units with a
530
          different identity.
531
          
532
          The need for the distinction between same and opposite
533
          quadrant comparison, which is to say between similar and
534
          between dissimilar sequence units, is caused by the fact
535
          that same quadrant comparisons are more likely to lead to
536
          higher values of 
537
          d. As illustrated above for the 16
538
          thunit of the first stanza, the forward and backward
539
          coordinates must fall in the same quadrant. Consequently,
540
          the similar pattern of same and opposite quadrant
541
          comparisons for each dimension will be reflected as a
542
          bias in the bi-directional overestimation. The
543
          determination of probability, 
544
          P 
545
          2 , of over-determination between sums
546
          of independent unidirectional similarity estimates is
547
          derived in equation 10.
548
          
549
          The probability of bi-directional over-determination,
550
          can now be established by using the same and opposite
551
          unidirectional comparison expressions presented in
552
          Equation 9. The resulting expression for similarity
553
          over-determination by the distance between bidirectional
554
          USM coordinates, 
555
          P 
556
          3 , is presented in equation 11.
557
          
558
          In figure 3, the probability distribution for both
559
          unidirectional ( 
560
          P 
561
          1 , in gray) and bidirectional ( 
562
          P 
563
          3 , in black) comparisons is
564
          represented for different dimensions, 
565
          n. It is clearly apparent that the
566
          over-determination becomes much less significant as
567
          dimensionality increases. From a practical point of view,
568
          the over-determination is of little consequence because
569
          the computational load of comparing sequences corresponds
570
          mostly to the identification of candidate pairing
571
          combinations. The fact that the n-metric unidirectional
572
          distances, 
573
          d 
574
          
575
            f 
576
           and 
577
          d 
578
          
579
            b 
580
           , defined in Equation 4, and bidirectional 
581
          D, defined in Eq. 8, are
582
          over-determined implies that the identification of
583
          similar segments between two sequences will include false
584
          positives but will not generate false negatives. The
585
          false positive identifications can be readily recognized
586
          by comparing the sequences extracted from the
587
          coordinates, as demonstrated above for the 16 thunit of
588
          the first stanza. Nevertheless, since over-determination
589
          will necessarily occur, its probability distribution was
590
          identified (Eq. 11, Fig. 3). This can also be achieved
591
          for individual values by solving Eq. 11 for the value of
592
          φ observed. For example, for the conditions of the two
593
          stanzas, the value of (φ 
594
          
595
          p 1 = 0.5, 
596
          n = 4.25 is 0.71 sequence units,
597
          which is the expected median unidirectional
598
          over-determination, 
599
          P 
600
          1 , of 
601
          d 
602
          
603
            f 
604
           and 
605
          d 
606
          
607
            b 
608
           (Eqs. 5, 7). The corresponding probability of
609
          bi-directional overdetermination, 
610
          P 
611
          3 , should be somewhat above twice
612
          that value. Using equation 11, the value obtained is 1.67
613
          similar units. Finally, it is worthy to stress that the
614
          expressions for calculation of likelihood of arbitrary
615
          levels of over-determination (Eq. 5-11) can be inverted
616
          to anticipate the level of over-determination for
617
          arbitrary probability levels. This use of the null random
618
          model is also included in the accompanying online tool
619
          (see Abstract for URL).
620
        
621
      
622
      
623
        Discussion
624
        
625
          USM of biological sequences
626
          The representation of biological information as
627
          discrete sequences is dominated by the fact that genomes
628
          are sequences of discrete units and so are the products
629
          of its transcription and translation. However, not all
630
          biological sequences are composed of units that are
631
          functionally equally distinct from each other, as is the
632
          case of proteomic data and Multi-locus sequence typing
633
          [MLST, 4]. To avoid the issue of unit inequality and
634
          highlight the general applicability of the USM procedure,
635
          stanzas of a poem were used to illustrate the
636
          implementation instead. Nevertheless the original
637
          motivation of analyzing biological sequences is now
638
          recalled.
639
          In the preceding report the authors have illustrated
640
          the properties of unidirectional 
641
          n-metric estimation of similarity
642
          for the threonine operon of 
643
          E. coli [ 12 ] . The same two two
644
          regions of 
645
          thrA and 
646
          thrB sequences of 
647
          E. coli K-12 MG1655 are compared in
648
          Figure 5to highlight the advancement achieved by USM. It
649
          should be recalled that the particular dimensionality of
650
          DNA sequences, 
651
          n = 2, allows a very convenient
652
          unidirectional bi-dimensional representation, which is in
653
          fact the Chaos Game Representation procedure (CGR) [ 5 ]
654
          . Consequently, CGR is a particular case of USM, obtained
655
          when 
656
          n = 2 and only the forward
657
          coordinates are determined. This can also be verified by
658
          comparing Figure 5awith a similar representation reported
659
          before [ 12 ] , obtained with the same data using CGR
660
          [Fig. 10 of that report]. The advantageous properties of
661
          full (bi-directional) USM become apparent when Fig. 5ais
662
          compared with Fig. 5b. It is clearly apparent for
663
          bi-directional USM (Fig. 5b) that all pair-wise
664
          comparisons of units of identical segments now have the
665
          same 
666
          D values. This coverts any
667
          individual homologous pair-wise comparison into an
668
          estimation of the length of the entire similar segment.
669
          The conservation of statistical properties by the
670
          distances obtained, 
671
          D, can also be confirmed by
672
          comparing observed values with the corresponding null
673
          models (Fig. 4). For the analysis of this figure it is
674
          noteworthy to recall that the statistical properties of
675
          prokaryote DNA are often undistinguishable from uniform
676
          randomness [ 12 18 19 ] . The genomic sequence of the
677
          first gene of the threonine operon of 
678
          E. coli, thrA, is compared with
679
          that of the second, 
680
          thrB. The distribution of the
681
          resulting 
682
          D values is represented in figure
683
          4(solid black line), alongside with the null model for
684
          that dimensionality (Eq. 11, with 
685
          n = log 
686
          2 
687
          (4) = 2 , gray dotted line). The
688
          genomic sequences of 
689
          thrA and 
690
          thrB were translated into proteomic
691
          sequences using SwissProt's on line translator, applied
692
          to the 5'3' first frame
693
          http://www.expasy.ch/tools/dna.html. Similarly, the
694
          distribution of 
695
          D values for the comparison of the
696
          proteomic 
697
          thrA and 
698
          thrB sequences is also represented
699
          in Figure 4, alongside with the null model, Eq. 11, for
700
          its dimensionality ( 
701
          n = log 
702
          2 ( 
703
          uu = 20 possible aminoacids) =
704
          4.32 ), which is graphically nearly undistiguishable
705
          from that of the comparison between the stanzas, with 
706
          n = log 
707
          2 ( 
708
          uu = 19 possible letters) =
709
          4.25 (dotted gray line for the rounded value, 
710
          n = 4.3). Both the genomic and the
711
          proteomic distribution of 
712
          D values is observed to be
713
          contained by the null model, unlike the comparison
714
          between the stanzas discussed above, where the existence
715
          of structure is clearly reflected by its distribution.
716
          The genomic and proteomic of 
717
          thrA and 
718
          thraB, used to illustrate this
719
          discussion, are provided with the web-based
720
          implementation of USM (see Methods for URL).
721
        
722
      
723
      
724
        Conclusions
725
        The mounting quantity and complexity of biological
726
        sequence data being produced [ 22 ] commands the
727
        investigation of new approaches to sequence analysis. In
728
        particular, the need for scale independent methodologies
729
        becomes even more necessary as the limitations of
730
        conventional Markov chains are increasingly noted [ 6 ] .
731
        These limitations are bound to become overwhelming when
732
        signals such as succession schemes of the expression of
733
        over 30,000 human genes [ 23 ] become available. This
734
        particular signal would be conveniently packaged within a
735
        30 dimension USM unit block ( 
736
        n = ceil(log 
737
        2 ( 
738
        3 10 3) = 15).
739
        In addition, the advances in statistical mechanics for
740
        the study of complex systems, particularly in non-linear
741
        dynamics, have not been fully utilizable for the analysis
742
        of sequences due to the missing formal link between
743
        discrete sequences and trajectories in continuous spaces.
744
        The properties of USM reported above suggest that this may
745
        indeed be such a bridge. For example, the embedding of
746
        dimensions, a technique at the foundations of many
747
        time-series analysis techniques offers a good example of
748
        the completeness of USM representation of sequences. By
749
        embedding the forward and backward coordinates separately,
750
        at the relevant memory length, the resulting embedded USM
751
        is exactly what would be obtained by applying USM technique
752
        to the embedded dimeric sequence itself.
753
      
754
      
755
        Methods
756
        
757
          Computation
758
          The algorithms described in this manuscript were coded
759
          using MATLAB™ 6.0 language (Release 12), licensed by The
760
          MathWorks Inc http://www.mathworks.com. An internet
761
          interface was also developed to make them freely
762
          accessible through user-friendly web-pages
763
          http://bioinformatics.musc.edu/~jonas/usm/.
764
        
765
        
766
          Source code
767
          In order to facilitate the development of sequence
768
          analysis applications based on the USM state space, the
769
          software library of functions written to calculate the
770
          USM coordinates is provided with the web-based
771
          implementation (see address above). The code is provided
772
          in MATLAB format, which is general enough as to be easily
773
          ported into other environments. These functions process
774
          sequences provided as text files in FASTA format. In
775
          addition to the functions, the test datasets and a brief
776
          readme.txt documentation file are also included.
777
        
778
        
779
          Test data
780
          The USM mapping proposed is applicable to any discrete
781
          sequence, even if the primary goal is the analysis of
782
          biological sequences. For ease of illustration and to
783
          emphasize USM's general validity, the test dataset used
784
          to describe implementation of the algorithm consists of
785
          two stanzas of a Poem by Wendy Cope, "The Uncertainty of
786
          the Poet" [ 14 ] . In the Discussion section, USM was
787
          also applied to the DNA sequence of the threonine operon
788
          of 
789
          Escherichia coli K-12 MG1655,
790
          obtained from the University of Winsconsin 
791
          E. coli Genome Project
792
          http:/www.genetics.wisc.edu, and to its 5'3' first frame
793
          proteomic translation obtained by using SwissProt on line
794
          translator http://www.expasy.ch/tools/dna.html. The three
795
          test sequence datasets are also included in the web-based
796
          USM application.
797
        
798
      
799
    
800
  
801

802
Product

Resources

Company