Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
allendowney
GitHub Repository: allendowney/cpython
Path: blob/main/Objects/stringlib/stringlib_find_two_way_notes.txt
12 views
1
This document explains Crochemore and Perrin's Two-Way string matching
2
algorithm, in which a smaller string (the "pattern" or "needle")
3
is searched for in a longer string (the "text" or "haystack"),
4
determining whether the needle is a substring of the haystack, and if
5
so, at what index(es). It is to be used by Python's string
6
(and bytes-like) objects when calling `find`, `index`, `__contains__`,
7
or implicitly in methods like `replace` or `partition`.
8
9
This is essentially a re-telling of the paper
10
11
Crochemore M., Perrin D., 1991, Two-way string-matching,
12
Journal of the ACM 38(3):651-675.
13
14
focused more on understanding and examples than on rigor. See also
15
the code sample here:
16
17
http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
18
19
The algorithm runs in O(len(needle) + len(haystack)) time and with
20
O(1) space. However, since there is a larger preprocessing cost than
21
simpler algorithms, this Two-Way algorithm is to be used only when the
22
needle and haystack lengths meet certain thresholds.
23
24
25
These are the basic steps of the algorithm:
26
27
* "Very carefully" cut the needle in two.
28
* For each alignment attempted:
29
1. match the right part
30
* On failure, jump by the amount matched + 1
31
2. then match the left part.
32
* On failure jump by max(len(left), len(right)) + 1
33
* If the needle is periodic, don't re-do comparisons; maintain
34
a "memory" of how many characters you already know match.
35
36
37
-------- Matching the right part --------
38
39
We first scan the right part of the needle to check if it matches the
40
the aligned characters in the haystack. We scan left-to-right,
41
and if a mismatch occurs, we jump ahead by the amount matched plus 1.
42
43
Example:
44
45
text: ........EFGX...................
46
pattern: ....abcdEFGH....
47
cut: <<<<>>>>
48
49
Matched 3, so jump ahead by 4:
50
51
text: ........EFGX...................
52
pattern: ....abcdEFGH....
53
cut: <<<<>>>>
54
55
Why are we allowed to do this? Because we cut the needle very
56
carefully, in such a way that if the cut is ...abcd + EFGH... then
57
we have
58
59
d != E
60
cd != EF
61
bcd != EFG
62
abcd != EFGH
63
... and so on.
64
65
If this is true for every pair of equal-length substrings around the
66
cut, then the following alignments do not work, so we can skip them:
67
68
text: ........EFG....................
69
pattern: ....abcdEFGH....
70
^ (Bad because d != E)
71
text: ........EFG....................
72
pattern: ....abcdEFGH....
73
^^ (Bad because cd != EF)
74
text: ........EFG....................
75
pattern: ....abcdEFGH....
76
^^^ (Bad because bcd != EFG)
77
78
Skip 3 alignments => increment alignment by 4.
79
80
81
-------- If len(left_part) < len(right_part) --------
82
83
Above is the core idea, and it begins to suggest how the algorithm can
84
be linear-time. There is one bit of subtlety involving what to do
85
around the end of the needle: if the left half is shorter than the
86
right, then we could run into something like this:
87
88
text: .....EFG......
89
pattern: cdEFGH
90
91
The same argument holds that we can skip ahead by 4, so long as
92
93
d != E
94
cd != EF
95
?cd != EFG
96
??cd != EFGH
97
etc.
98
99
The question marks represent "wildcards" that always match; they're
100
outside the limits of the needle, so there's no way for them to
101
invalidate a match. To ensure that the inequalities above are always
102
true, we need them to be true for all possible '?' values. We thus
103
need cd != FG and cd != GH, etc.
104
105
106
-------- Matching the left part --------
107
108
Once we have ensured the right part matches, we scan the left part
109
(order doesn't matter, but traditionally right-to-left), and if we
110
find a mismatch, we jump ahead by
111
max(len(left_part), len(right_part)) + 1. That we can jump by
112
at least len(right_part) + 1 we have already seen:
113
114
text: .....EFG.....
115
pattern: abcdEFG
116
Matched 3, so jump by 4,
117
using the fact that d != E, cd != EF, and bcd != EFG.
118
119
But we can also jump by at least len(left_part) + 1:
120
121
text: ....cdEF.....
122
pattern: abcdEF
123
Jump by len('abcd') + 1 = 5.
124
125
Skip the alignments:
126
text: ....cdEF.....
127
pattern: abcdEF
128
text: ....cdEF.....
129
pattern: abcdEF
130
text: ....cdEF.....
131
pattern: abcdEF
132
text: ....cdEF.....
133
pattern: abcdEF
134
135
This requires the following facts:
136
d != E
137
cd != EF
138
bcd != EF?
139
abcd != EF??
140
etc., for all values of ?s, as above.
141
142
If we have both sets of inequalities, then we can indeed jump by
143
max(len(left_part), len(right_part)) + 1. Under the assumption of such
144
a nice splitting of the needle, we now have enough to prove linear
145
time for the search: consider the forward-progress/comparisons ratio
146
at each alignment position. If a mismatch occurs in the right part,
147
the ratio is 1 position forward per comparison. On the other hand,
148
if a mismatch occurs in the left half, we advance by more than
149
len(needle)//2 positions for at most len(needle) comparisons,
150
so this ratio is more than 1/2. This average "movement speed" is
151
bounded below by the constant "1 position per 2 comparisons", so we
152
have linear time.
153
154
155
-------- The periodic case --------
156
157
The sets of inequalities listed so far seem too good to be true in
158
the general case. Indeed, they fail when a needle is periodic:
159
there's no way to split 'AAbAAbAAbA' in two such that
160
161
(the stuff n characters to the left of the split)
162
cannot equal
163
(the stuff n characters to the right of the split)
164
for all n.
165
166
This is because no matter how you cut it, you'll get
167
s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the
168
needle in two so that n can be as big as possible. If we were to
169
split it as
170
171
AAbA + AbAAbA
172
173
then A == A at the split, so this is bad (we failed at length 1), but
174
if we split it as
175
176
AA + bAAbAAbA
177
178
we at least have A != b and AA != bA, and we fail at length 3
179
since ?AA == bAA. We already knew that a cut to make length-3
180
mismatch was impossible due to the period, but we now see that the
181
bound is sharp; we can get length-1 and length-2 to mismatch.
182
183
This is exactly the content of the *critical factorization theorem*:
184
that no matter the period of the original needle, you can cut it in
185
such a way that (with the appropriate question marks),
186
needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period.
187
188
Even "non-periodic" strings are periodic with a period equal to
189
their length, so for such needles, the CFT already guarantees that
190
the algorithm described so far will work, since we can cut the needle
191
so that the length-k chunks on either side of the cut mismatch for all
192
k < len(needle). Looking closer at the algorithm, we only actually
193
require that k go up to max(len(left_part), len(right_part)).
194
So long as the period exceeds that, we're good.
195
196
The more general shorter-period case is a bit harder. The essentials
197
are the same, except we use the periodicity to our advantage by
198
"remembering" periods that we've already compared. In our running
199
example, say we're computing
200
201
"AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA".
202
203
We cut as AA + bAAbAAbA, and then the algorithm runs as follows:
204
205
First alignment:
206
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
207
AAbAAbAAbA
208
^^X
209
- Mismatch at third position, so jump by 3.
210
- This requires that A!=b and AA != bA.
211
212
Second alignment:
213
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
214
AAbAAbAAbA
215
^^^^^^^^
216
X
217
- Matched entire right part
218
- Mismatch at left part.
219
- Jump forward a period, remembering the existing comparisons
220
221
Third alignment:
222
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
223
AAbAAbAAbA
224
mmmmmmm^^X
225
- There's "memory": a bunch of characters were already matched.
226
- Two more characters match beyond that.
227
- The 8th character of the right part mismatched, so jump by 8
228
- The above rule is more complicated than usual: we don't have
229
the right inequalities for lengths 1 through 7, but we do have
230
shifted copies of the length-1 and length-2 inequalities,
231
along with knowledge of the mismatch. We can skip all of these
232
alignments at once:
233
234
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
235
AAbAAbAAbA
236
~ A != b at the cut
237
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
238
AAbAAbAAbA
239
~~ AA != bA at the cut
240
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
241
AAbAAbAAbA
242
^^^^X 7-3=4 match, and the 5th misses.
243
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
244
AAbAAbAAbA
245
~ A != b at the cut
246
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
247
AAbAAbAAbA
248
~~ AA != bA at the cut
249
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
250
AAbAAbAAbA
251
^X 7-3-3=1 match and the 2nd misses.
252
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
253
AAbAAbAAbA
254
~ A != b at the cut
255
256
Fourth alignment:
257
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
258
AAbAAbAAbA
259
^X
260
- Second character mismatches, so jump by 2.
261
262
Fifth alignment:
263
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
264
AAbAAbAAbA
265
^^^^^^^^
266
X
267
- Right half matches, so use memory and skip ahead by period=3
268
269
Sixth alignment:
270
bbbAbbAAbAAbAAbbbAAbAAbAAbAA
271
AAbAAbAAbA
272
mmmmmmmm^^
273
- Right part matches, left part is remembered, found a match!
274
275
The one tricky skip by 8 here generalizes: if we have a period of p,
276
then the CFT says we can ensure the cut has the inequality property
277
for lengths 1 through p-1, and jumping by p would line up the
278
matching characters and mismatched character one period earlier.
279
Inductively, this proves that we can skip by the number of characters
280
matched in the right half, plus 1, just as in the original algorithm.
281
282
To make it explicit, the memory is set whenever the entire right part
283
is matched and is then used as a starting point in the next alignment.
284
In such a case, the alignment jumps forward one period, and the right
285
half matches all except possibly the last period. Additionally,
286
if we cut so that the left part has a length strictly less than the
287
period (we always can!), then we can know that the left part already
288
matches. The memory is reset to 0 whenever there is a mismatch in the
289
right part.
290
291
To prove linearity for the periodic case, note that if a right-part
292
character mismatches, then we advance forward 1 unit per comparison.
293
On the other hand, if the entire right part matches, then the skipping
294
forward by one period "defers" some of the comparisons to the next
295
alignment, where they will then be spent at the usual rate of
296
one comparison per step forward. Even if left-half comparisons
297
are always "wasted", they constitute less than half of all
298
comparisons, so the average rate is certainly at least 1 move forward
299
per 2 comparisons.
300
301
302
-------- When to choose the periodic algorithm ---------
303
304
The periodic algorithm is always valid but has an overhead of one
305
more "memory" register and some memory computation steps, so the
306
here-described-first non-periodic/long-period algorithm -- skipping by
307
max(len(left_part), len(right_part)) + 1 rather than the period --
308
should be preferred when possible.
309
310
Interestingly, the long-period algorithm does not require an exact
311
computation of the period; it works even with some long-period, but
312
undeniably "periodic" needles:
313
314
Cut: AbcdefAbc == Abcde + fAbc
315
316
This cut gives these inequalities:
317
318
e != f
319
de != fA
320
cde != fAb
321
bcde != fAbc
322
Abcde != fAbc?
323
The first failure is a period long, per the CFT:
324
?Abcde == fAbc??
325
326
A sufficient condition for using the long-period algorithm is having
327
the period of the needle be greater than
328
max(len(left_part), len(right_part)). This way, after choosing a good
329
split, we get all of the max(len(left_part), len(right_part))
330
inequalities around the cut that were required in the long-period
331
version of the algorithm.
332
333
With all of this in mind, here's how we choose:
334
335
(1) Choose a "critical factorization" of the needle -- a cut
336
where we have period minus 1 inequalities in a row.
337
More specifically, choose a cut so that the left_part
338
is less than one period long.
339
(2) Determine the period P_r of the right_part.
340
(3) Check if the left part is just an extension of the pattern of
341
the right part, so that the whole needle has period P_r.
342
Explicitly, check if
343
needle[0:cut] == needle[0+P_r:cut+P_r]
344
If so, we use the periodic algorithm. If not equal, we use the
345
long-period algorithm.
346
347
Note that if equality holds in (3), then the period of the whole
348
string is P_r. On the other hand, suppose equality does not hold.
349
The period of the needle is then strictly greater than P_r. Here's
350
a general fact:
351
352
If p is a substring of s and p has period r, then the period
353
of s is either equal to r or greater than len(p).
354
355
We know that needle_period != P_r,
356
and therefore needle_period > len(right_part).
357
Additionally, we'll choose the cut (see below)
358
so that len(left_part) < needle_period.
359
360
Thus, in the case where equality does not hold, we have that
361
needle_period >= max(len(left_part), len(right_part)) + 1,
362
so the long-period algorithm works, but otherwise, we know the period
363
of the needle.
364
365
Note that this decision process doesn't always require an exact
366
computation of the period -- we can get away with only computing P_r!
367
368
369
-------- Computing the cut --------
370
371
Our remaining tasks are now to compute a cut of the needle with as
372
many inequalities as possible, ensuring that cut < needle_period.
373
Meanwhile, we must also compute the period P_r of the right_part.
374
375
The computation is relatively simple, essentially doing this:
376
377
suffix1 = max(needle[i:] for i in range(len(needle)))
378
suffix2 = ... # the same as above, but invert the alphabet
379
cut1 = len(needle) - len(suffix1)
380
cut2 = len(needle) - len(suffix2)
381
cut = max(cut1, cut2) # the later cut
382
383
For cut2, "invert the alphabet" is different than saying min(...),
384
since in lexicographic order, we still put "py" < "python", even
385
if the alphabet is inverted. Computing these, along with the method
386
of computing the period of the right half, is easiest to read directly
387
from the source code in fastsearch.h, in which these are computed
388
in linear time.
389
390
Crochemore & Perrin's Theorem 3.1 give that "cut" above is a
391
critical factorization less than the period, but a very brief sketch
392
of their proof goes something like this (this is far from complete):
393
394
* If this cut splits the needle as some
395
needle == (a + w) + (w + b), meaning there's a bad equality
396
w == w, it's impossible for w + b to be bigger than both
397
b and w + w + b, so this can't happen. We thus have all of
398
the inequalities with no question marks.
399
* By maximality, the right part is not a substring of the left
400
part. Thus, we have all of the inequalities involving no
401
left-side question marks.
402
* If you have all of the inequalities without right-side question
403
marks, we have a critical factorization.
404
* If one such inequality fails, then there's a smaller period,
405
but the factorization is nonetheless critical. Here's where
406
you need the redundancy coming from computing both cuts and
407
choosing the later one.
408
409
410
-------- Some more Bells and Whistles --------
411
412
Beyond Crochemore & Perrin's original algorithm, we can use a couple
413
more tricks for speed in fastsearch.h:
414
415
1. Even though C&P has a best-case O(n/m) time, this doesn't occur
416
very often, so we add a Boyer-Moore bad character table to
417
achieve sublinear time in more cases.
418
419
2. The prework of computing the cut/period is expensive per
420
needle character, so we shouldn't do it if it won't pay off.
421
For this reason, if the needle and haystack are long enough,
422
only automatically start with two-way if the needle's length
423
is a small percentage of the length of the haystack.
424
425
3. In cases where the needle and haystack are large but the needle
426
makes up a significant percentage of the length of the
427
haystack, don't pay the expensive two-way preprocessing cost
428
if you don't need to. Instead, keep track of how many
429
character comparisons are equal, and if that exceeds
430
O(len(needle)), then pay that cost, since the simpler algorithm
431
isn't doing very well.
432
433