CoCalc -- stringlib_find_two_way

GitHub Repository: allendowney/cpython
Path: blob/main/Objects/stringlib/stringlib_find_two_way_notes.txt
¹² views
1
This document explains Crochemore and Perrin's Two-Way string matching
2
algorithm, in which a smaller string (the "pattern" or "needle")
3
is searched for in a longer string (the "text" or "haystack"),
4
determining whether the needle is a substring of the haystack, and if
5
so, at what index(es). It is to be used by Python's string
6
(and bytes-like) objects when calling `find`, `index`, `__contains__`,
7
or implicitly in methods like `replace` or `partition`.
8

9
This is essentially a re-telling of the paper
10

11
    Crochemore M., Perrin D., 1991, Two-way string-matching,
12
        Journal of the ACM 38(3):651-675.
13

14
focused more on understanding and examples than on rigor. See also
15
the code sample here:
16

17
    http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
18

19
The algorithm runs in O(len(needle) + len(haystack)) time and with
20
O(1) space. However, since there is a larger preprocessing cost than
21
simpler algorithms, this Two-Way algorithm is to be used only when the
22
needle and haystack lengths meet certain thresholds.
23

24

25
These are the basic steps of the algorithm:
26

27
    * "Very carefully" cut the needle in two.
28
    * For each alignment attempted:
29
        1. match the right part
30
            * On failure, jump by the amount matched + 1
31
        2. then match the left part.
32
            * On failure jump by max(len(left), len(right)) + 1
33
    * If the needle is periodic, don't re-do comparisons; maintain
34
      a "memory" of how many characters you already know match.
35

36

37
-------- Matching the right part --------
38

39
We first scan the right part of the needle to check if it matches the
40
the aligned characters in the haystack. We scan left-to-right,
41
and if a mismatch occurs, we jump ahead by the amount matched plus 1.
42

43
Example:
44

45
       text:    ........EFGX...................
46
    pattern:    ....abcdEFGH....
47
        cut:        <<<<>>>>
48

49
Matched 3, so jump ahead by 4:
50

51
       text:    ........EFGX...................
52
    pattern:        ....abcdEFGH....
53
        cut:            <<<<>>>>
54

55
Why are we allowed to do this? Because we cut the needle very
56
carefully, in such a way that if the cut is ...abcd + EFGH... then
57
we have
58

59
        d != E
60
       cd != EF
61
      bcd != EFG
62
     abcd != EFGH
63
          ... and so on.
64

65
If this is true for every pair of equal-length substrings around the
66
cut, then the following alignments do not work, so we can skip them:
67

68
       text:    ........EFG....................
69
    pattern:     ....abcdEFGH....
70
                        ^   (Bad because d != E)
71
       text:    ........EFG....................
72
    pattern:      ....abcdEFGH....
73
                        ^^   (Bad because cd != EF)
74
       text:    ........EFG....................
75
    pattern:       ....abcdEFGH....
76
                        ^^^   (Bad because bcd != EFG)
77

78
Skip 3 alignments => increment alignment by 4.
79

80

81
-------- If len(left_part) < len(right_part) --------
82

83
Above is the core idea, and it begins to suggest how the algorithm can
84
be linear-time. There is one bit of subtlety involving what to do
85
around the end of the needle: if the left half is shorter than the
86
right, then we could run into something like this:
87

88
       text:    .....EFG......
89
    pattern:       cdEFGH
90

91
The same argument holds that we can skip ahead by 4, so long as
92

93
       d != E
94
      cd != EF
95
     ?cd != EFG
96
    ??cd != EFGH
97
         etc.
98

99
The question marks represent "wildcards" that always match; they're
100
outside the limits of the needle, so there's no way for them to
101
invalidate a match. To ensure that the inequalities above are always
102
true, we need them to be true for all possible '?' values. We thus
103
need cd != FG and cd != GH, etc.
104

105

106
-------- Matching the left part --------
107

108
Once we have ensured the right part matches, we scan the left part
109
(order doesn't matter, but traditionally right-to-left), and if we
110
find a mismatch, we jump ahead by
111
max(len(left_part), len(right_part)) + 1. That we can jump by
112
at least len(right_part) + 1 we have already seen:
113

114
       text: .....EFG.....
115
    pattern:  abcdEFG
116
    Matched 3, so jump by 4,
117
    using the fact that d != E, cd != EF, and bcd != EFG.
118

119
But we can also jump by at least len(left_part) + 1:
120

121
       text: ....cdEF.....
122
    pattern:   abcdEF
123
    Jump by len('abcd') + 1 = 5.
124

125
    Skip the alignments:
126
       text: ....cdEF.....
127
    pattern:    abcdEF
128
       text: ....cdEF.....
129
    pattern:     abcdEF
130
       text: ....cdEF.....
131
    pattern:      abcdEF
132
       text: ....cdEF.....
133
    pattern:       abcdEF
134

135
This requires the following facts:
136
       d != E
137
      cd != EF
138
     bcd != EF?
139
    abcd != EF??
140
         etc., for all values of ?s, as above.
141

142
If we have both sets of inequalities, then we can indeed jump by
143
max(len(left_part), len(right_part)) + 1. Under the assumption of such
144
a nice splitting of the needle, we now have enough to prove linear
145
time for the search: consider the forward-progress/comparisons ratio
146
at each alignment position. If a mismatch occurs in the right part,
147
the ratio is 1 position forward per comparison. On the other hand,
148
if a mismatch occurs in the left half, we advance by more than
149
len(needle)//2 positions for at most len(needle) comparisons,
150
so this ratio is more than 1/2. This average "movement speed" is
151
bounded below by the constant "1 position per 2 comparisons", so we
152
have linear time.
153

154

155
-------- The periodic case --------
156

157
The sets of inequalities listed so far seem too good to be true in
158
the general case. Indeed, they fail when a needle is periodic:
159
there's no way to split 'AAbAAbAAbA' in two such that
160

161
    (the stuff n characters to the left of the split)
162
    cannot equal
163
    (the stuff n characters to the right of the split)
164
    for all n.
165

166
This is because no matter how you cut it, you'll get
167
s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the
168
needle in two so that n can be as big as possible. If we were to
169
split it as
170

171
    AAbA + AbAAbA
172

173
then A == A at the split, so this is bad (we failed at length 1), but
174
if we split it as
175

176
    AA + bAAbAAbA
177

178
we at least have A != b and AA != bA, and we fail at length 3
179
since ?AA == bAA. We already knew that a cut to make length-3
180
mismatch was impossible due to the period, but we now see that the
181
bound is sharp; we can get length-1 and length-2 to mismatch.
182

183
This is exactly the content of the *critical factorization theorem*:
184
that no matter the period of the original needle, you can cut it in
185
such a way that (with the appropriate question marks),
186
needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period.
187

188
Even "non-periodic" strings are periodic with a period equal to
189
their length, so for such needles, the CFT already guarantees that
190
the algorithm described so far will work, since we can cut the needle
191
so that the length-k chunks on either side of the cut mismatch for all
192
k < len(needle). Looking closer at the algorithm, we only actually
193
require that k go up to max(len(left_part), len(right_part)).
194
So long as the period exceeds that, we're good.
195

196
The more general shorter-period case is a bit harder. The essentials
197
are the same, except we use the periodicity to our advantage by
198
"remembering" periods that we've already compared. In our running
199
example, say we're computing
200

201
    "AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA".
202

203
We cut as AA + bAAbAAbA, and then the algorithm runs as follows:
204

205
    First alignment:
206
    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
207
    AAbAAbAAbA
208
      ^^X
209
    - Mismatch at third position, so jump by 3.
210
    - This requires that A!=b and AA != bA.
211

212
    Second alignment:
213
    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
214
       AAbAAbAAbA
215
         ^^^^^^^^
216
        X
217
    - Matched entire right part
218
    - Mismatch at left part.
219
    - Jump forward a period, remembering the existing comparisons
220

221
    Third alignment:
222
    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
223
          AAbAAbAAbA
224
          mmmmmmm^^X
225
    - There's "memory": a bunch of characters were already matched.
226
    - Two more characters match beyond that.
227
    - The 8th character of the right part mismatched, so jump by 8
228
    - The above rule is more complicated than usual: we don't have
229
      the right inequalities for lengths 1 through 7, but we do have
230
      shifted copies of the length-1 and length-2 inequalities,
231
      along with knowledge of the mismatch. We can skip all of these
232
      alignments at once:
233

234
        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
235
               AAbAAbAAbA
236
                ~                   A != b at the cut
237
        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
238
                AAbAAbAAbA
239
                ~~                  AA != bA at the cut
240
        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
241
                 AAbAAbAAbA
242
                   ^^^^X            7-3=4 match, and the 5th misses.
243
        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
244
                  AAbAAbAAbA
245
                   ~                A != b at the cut
246
        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
247
                   AAbAAbAAbA
248
                   ~~               AA != bA at the cut
249
        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
250
                    AAbAAbAAbA
251
                      ^X            7-3-3=1 match and the 2nd misses.
252
        bbbAbbAAbAAbAAbbbAAbAAbAAbAA
253
                     AAbAAbAAbA
254
                      ~             A != b at the cut
255

256
    Fourth alignment:
257
    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
258
                 AAbAAbAAbA
259
                   ^X
260
    - Second character mismatches, so jump by 2.
261

262
    Fifth alignment:
263
    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
264
                  AAbAAbAAbA
265
                    ^^^^^^^^
266
                   X
267
    - Right half matches, so use memory and skip ahead by period=3
268

269
    Sixth alignment:
270
    bbbAbbAAbAAbAAbbbAAbAAbAAbAA
271
                     AAbAAbAAbA
272
                     mmmmmmmm^^
273
    - Right part matches, left part is remembered, found a match!
274

275
The one tricky skip by 8 here generalizes: if we have a period of p,
276
then the CFT says we can ensure the cut has the inequality property
277
for lengths 1 through p-1, and jumping by p would line up the
278
matching characters and mismatched character one period earlier.
279
Inductively, this proves that we can skip by the number of characters
280
matched in the right half, plus 1, just as in the original algorithm.
281

282
To make it explicit, the memory is set whenever the entire right part
283
is matched and is then used as a starting point in the next alignment.
284
In such a case, the alignment jumps forward one period, and the right
285
half matches all except possibly the last period. Additionally,
286
if we cut so that the left part has a length strictly less than the
287
period (we always can!), then we can know that the left part already
288
matches. The memory is reset to 0 whenever there is a mismatch in the
289
right part.
290

291
To prove linearity for the periodic case, note that if a right-part
292
character mismatches, then we advance forward 1 unit per comparison.
293
On the other hand, if the entire right part matches, then the skipping
294
forward by one period "defers" some of the comparisons to the next
295
alignment, where they will then be spent at the usual rate of
296
one comparison per step forward. Even if left-half comparisons
297
are always "wasted", they constitute less than half of all
298
comparisons, so the average rate is certainly at least 1 move forward
299
per 2 comparisons.
300

301

302
-------- When to choose the periodic algorithm ---------
303

304
The periodic algorithm is always valid but has an overhead of one
305
more "memory" register and some memory computation steps, so the
306
here-described-first non-periodic/long-period algorithm -- skipping by
307
max(len(left_part), len(right_part)) + 1 rather than the period --
308
should be preferred when possible.
309

310
Interestingly, the long-period algorithm does not require an exact
311
computation of the period; it works even with some long-period, but
312
undeniably "periodic" needles:
313

314
    Cut: AbcdefAbc == Abcde + fAbc
315

316
This cut gives these inequalities:
317

318
                 e != f
319
                de != fA
320
               cde != fAb
321
              bcde != fAbc
322
             Abcde != fAbc?
323
    The first failure is a period long, per the CFT:
324
            ?Abcde == fAbc??
325

326
A sufficient condition for using the long-period algorithm is having
327
the period of the needle be greater than
328
max(len(left_part), len(right_part)). This way, after choosing a good
329
split, we get all of the max(len(left_part), len(right_part))
330
inequalities around the cut that were required in the long-period
331
version of the algorithm.
332

333
With all of this in mind, here's how we choose:
334

335
    (1) Choose a "critical factorization" of the needle -- a cut
336
        where we have period minus 1 inequalities in a row.
337
        More specifically, choose a cut so that the left_part
338
        is less than one period long.
339
    (2) Determine the period P_r of the right_part.
340
    (3) Check if the left part is just an extension of the pattern of
341
        the right part, so that the whole needle has period P_r.
342
        Explicitly, check if
343
            needle[0:cut] == needle[0+P_r:cut+P_r]
344
        If so, we use the periodic algorithm. If not equal, we use the
345
        long-period algorithm.
346

347
Note that if equality holds in (3), then the period of the whole
348
string is P_r. On the other hand, suppose equality does not hold.
349
The period of the needle is then strictly greater than P_r. Here's
350
a general fact:
351

352
    If p is a substring of s and p has period r, then the period
353
    of s is either equal to r or greater than len(p).
354

355
We know that needle_period != P_r,
356
and therefore needle_period > len(right_part).
357
Additionally, we'll choose the cut (see below)
358
so that len(left_part) < needle_period.
359

360
Thus, in the case where equality does not hold, we have that
361
needle_period >= max(len(left_part), len(right_part)) + 1,
362
so the long-period algorithm works, but otherwise, we know the period
363
of the needle.
364

365
Note that this decision process doesn't always require an exact
366
computation of the period -- we can get away with only computing P_r!
367

368

369
-------- Computing the cut --------
370

371
Our remaining tasks are now to compute a cut of the needle with as
372
many inequalities as possible, ensuring that cut < needle_period.
373
Meanwhile, we must also compute the period P_r of the right_part.
374

375
The computation is relatively simple, essentially doing this:
376

377
    suffix1 = max(needle[i:] for i in range(len(needle)))
378
    suffix2 = ... # the same as above, but invert the alphabet
379
    cut1 = len(needle) - len(suffix1)
380
    cut2 = len(needle) - len(suffix2)
381
    cut = max(cut1, cut2) # the later cut
382

383
For cut2, "invert the alphabet" is different than saying min(...),
384
since in lexicographic order, we still put "py" < "python", even
385
if the alphabet is inverted. Computing these, along with the method
386
of computing the period of the right half, is easiest to read directly
387
from the source code in fastsearch.h, in which these are computed
388
in linear time.
389

390
Crochemore & Perrin's Theorem 3.1 give that "cut" above is a
391
critical factorization less than the period, but a very brief sketch
392
of their proof goes something like this (this is far from complete):
393

394
    * If this cut splits the needle as some
395
      needle == (a + w) + (w + b), meaning there's a bad equality
396
      w == w, it's impossible for w + b to be bigger than both
397
      b and w + w + b, so this can't happen. We thus have all of
398
      the inequalities with no question marks.
399
    * By maximality, the right part is not a substring of the left
400
      part. Thus, we have all of the inequalities involving no
401
      left-side question marks.
402
    * If you have all of the inequalities without right-side question
403
      marks, we have a critical factorization.
404
    * If one such inequality fails, then there's a smaller period,
405
      but the factorization is nonetheless critical. Here's where
406
      you need the redundancy coming from computing both cuts and
407
      choosing the later one.
408

409

410
-------- Some more Bells and Whistles --------
411

412
Beyond Crochemore & Perrin's original algorithm, we can use a couple
413
more tricks for speed in fastsearch.h:
414

415
    1. Even though C&P has a best-case O(n/m) time, this doesn't occur
416
       very often, so we add a Boyer-Moore bad character table to
417
       achieve sublinear time in more cases.
418

419
    2. The prework of computing the cut/period is expensive per
420
       needle character, so we shouldn't do it if it won't pay off.
421
       For this reason, if the needle and haystack are long enough,
422
       only automatically start with two-way if the needle's length
423
       is a small percentage of the length of the haystack.
424

425
    3. In cases where the needle and haystack are large but the needle
426
       makes up a significant percentage of the length of the
427
       haystack, don't pay the expensive two-way preprocessing cost
428
       if you don't need to. Instead, keep track of how many
429
       character comparisons are equal, and if that exceeds
430
       O(len(needle)), then pay that cost, since the simpler algorithm
431
       isn't doing very well.
432

433
Product

Resources

Company