Path: blob/main/Objects/stringlib/stringlib_find_two_way_notes.txt
12 views
This document explains Crochemore and Perrin's Two-Way string matching1algorithm, in which a smaller string (the "pattern" or "needle")2is searched for in a longer string (the "text" or "haystack"),3determining whether the needle is a substring of the haystack, and if4so, at what index(es). It is to be used by Python's string5(and bytes-like) objects when calling `find`, `index`, `__contains__`,6or implicitly in methods like `replace` or `partition`.78This is essentially a re-telling of the paper910Crochemore M., Perrin D., 1991, Two-way string-matching,11Journal of the ACM 38(3):651-675.1213focused more on understanding and examples than on rigor. See also14the code sample here:1516http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION002601718The algorithm runs in O(len(needle) + len(haystack)) time and with19O(1) space. However, since there is a larger preprocessing cost than20simpler algorithms, this Two-Way algorithm is to be used only when the21needle and haystack lengths meet certain thresholds.222324These are the basic steps of the algorithm:2526* "Very carefully" cut the needle in two.27* For each alignment attempted:281. match the right part29* On failure, jump by the amount matched + 1302. then match the left part.31* On failure jump by max(len(left), len(right)) + 132* If the needle is periodic, don't re-do comparisons; maintain33a "memory" of how many characters you already know match.343536-------- Matching the right part --------3738We first scan the right part of the needle to check if it matches the39the aligned characters in the haystack. We scan left-to-right,40and if a mismatch occurs, we jump ahead by the amount matched plus 1.4142Example:4344text: ........EFGX...................45pattern: ....abcdEFGH....46cut: <<<<>>>>4748Matched 3, so jump ahead by 4:4950text: ........EFGX...................51pattern: ....abcdEFGH....52cut: <<<<>>>>5354Why are we allowed to do this? Because we cut the needle very55carefully, in such a way that if the cut is ...abcd + EFGH... then56we have5758d != E59cd != EF60bcd != EFG61abcd != EFGH62... and so on.6364If this is true for every pair of equal-length substrings around the65cut, then the following alignments do not work, so we can skip them:6667text: ........EFG....................68pattern: ....abcdEFGH....69^ (Bad because d != E)70text: ........EFG....................71pattern: ....abcdEFGH....72^^ (Bad because cd != EF)73text: ........EFG....................74pattern: ....abcdEFGH....75^^^ (Bad because bcd != EFG)7677Skip 3 alignments => increment alignment by 4.787980-------- If len(left_part) < len(right_part) --------8182Above is the core idea, and it begins to suggest how the algorithm can83be linear-time. There is one bit of subtlety involving what to do84around the end of the needle: if the left half is shorter than the85right, then we could run into something like this:8687text: .....EFG......88pattern: cdEFGH8990The same argument holds that we can skip ahead by 4, so long as9192d != E93cd != EF94?cd != EFG95??cd != EFGH96etc.9798The question marks represent "wildcards" that always match; they're99outside the limits of the needle, so there's no way for them to100invalidate a match. To ensure that the inequalities above are always101true, we need them to be true for all possible '?' values. We thus102need cd != FG and cd != GH, etc.103104105-------- Matching the left part --------106107Once we have ensured the right part matches, we scan the left part108(order doesn't matter, but traditionally right-to-left), and if we109find a mismatch, we jump ahead by110max(len(left_part), len(right_part)) + 1. That we can jump by111at least len(right_part) + 1 we have already seen:112113text: .....EFG.....114pattern: abcdEFG115Matched 3, so jump by 4,116using the fact that d != E, cd != EF, and bcd != EFG.117118But we can also jump by at least len(left_part) + 1:119120text: ....cdEF.....121pattern: abcdEF122Jump by len('abcd') + 1 = 5.123124Skip the alignments:125text: ....cdEF.....126pattern: abcdEF127text: ....cdEF.....128pattern: abcdEF129text: ....cdEF.....130pattern: abcdEF131text: ....cdEF.....132pattern: abcdEF133134This requires the following facts:135d != E136cd != EF137bcd != EF?138abcd != EF??139etc., for all values of ?s, as above.140141If we have both sets of inequalities, then we can indeed jump by142max(len(left_part), len(right_part)) + 1. Under the assumption of such143a nice splitting of the needle, we now have enough to prove linear144time for the search: consider the forward-progress/comparisons ratio145at each alignment position. If a mismatch occurs in the right part,146the ratio is 1 position forward per comparison. On the other hand,147if a mismatch occurs in the left half, we advance by more than148len(needle)//2 positions for at most len(needle) comparisons,149so this ratio is more than 1/2. This average "movement speed" is150bounded below by the constant "1 position per 2 comparisons", so we151have linear time.152153154-------- The periodic case --------155156The sets of inequalities listed so far seem too good to be true in157the general case. Indeed, they fail when a needle is periodic:158there's no way to split 'AAbAAbAAbA' in two such that159160(the stuff n characters to the left of the split)161cannot equal162(the stuff n characters to the right of the split)163for all n.164165This is because no matter how you cut it, you'll get166s[cut-3:cut] == s[cut:cut+3]. So what do we do? We still cut the167needle in two so that n can be as big as possible. If we were to168split it as169170AAbA + AbAAbA171172then A == A at the split, so this is bad (we failed at length 1), but173if we split it as174175AA + bAAbAAbA176177we at least have A != b and AA != bA, and we fail at length 3178since ?AA == bAA. We already knew that a cut to make length-3179mismatch was impossible due to the period, but we now see that the180bound is sharp; we can get length-1 and length-2 to mismatch.181182This is exactly the content of the *critical factorization theorem*:183that no matter the period of the original needle, you can cut it in184such a way that (with the appropriate question marks),185needle[cut-k:cut] mismatches needle[cut:cut+k] for all k < the period.186187Even "non-periodic" strings are periodic with a period equal to188their length, so for such needles, the CFT already guarantees that189the algorithm described so far will work, since we can cut the needle190so that the length-k chunks on either side of the cut mismatch for all191k < len(needle). Looking closer at the algorithm, we only actually192require that k go up to max(len(left_part), len(right_part)).193So long as the period exceeds that, we're good.194195The more general shorter-period case is a bit harder. The essentials196are the same, except we use the periodicity to our advantage by197"remembering" periods that we've already compared. In our running198example, say we're computing199200"AAbAAbAAbA" in "bbbAbbAAbAAbAAbbbAAbAAbAAbAA".201202We cut as AA + bAAbAAbA, and then the algorithm runs as follows:203204First alignment:205bbbAbbAAbAAbAAbbbAAbAAbAAbAA206AAbAAbAAbA207^^X208- Mismatch at third position, so jump by 3.209- This requires that A!=b and AA != bA.210211Second alignment:212bbbAbbAAbAAbAAbbbAAbAAbAAbAA213AAbAAbAAbA214^^^^^^^^215X216- Matched entire right part217- Mismatch at left part.218- Jump forward a period, remembering the existing comparisons219220Third alignment:221bbbAbbAAbAAbAAbbbAAbAAbAAbAA222AAbAAbAAbA223mmmmmmm^^X224- There's "memory": a bunch of characters were already matched.225- Two more characters match beyond that.226- The 8th character of the right part mismatched, so jump by 8227- The above rule is more complicated than usual: we don't have228the right inequalities for lengths 1 through 7, but we do have229shifted copies of the length-1 and length-2 inequalities,230along with knowledge of the mismatch. We can skip all of these231alignments at once:232233bbbAbbAAbAAbAAbbbAAbAAbAAbAA234AAbAAbAAbA235~ A != b at the cut236bbbAbbAAbAAbAAbbbAAbAAbAAbAA237AAbAAbAAbA238~~ AA != bA at the cut239bbbAbbAAbAAbAAbbbAAbAAbAAbAA240AAbAAbAAbA241^^^^X 7-3=4 match, and the 5th misses.242bbbAbbAAbAAbAAbbbAAbAAbAAbAA243AAbAAbAAbA244~ A != b at the cut245bbbAbbAAbAAbAAbbbAAbAAbAAbAA246AAbAAbAAbA247~~ AA != bA at the cut248bbbAbbAAbAAbAAbbbAAbAAbAAbAA249AAbAAbAAbA250^X 7-3-3=1 match and the 2nd misses.251bbbAbbAAbAAbAAbbbAAbAAbAAbAA252AAbAAbAAbA253~ A != b at the cut254255Fourth alignment:256bbbAbbAAbAAbAAbbbAAbAAbAAbAA257AAbAAbAAbA258^X259- Second character mismatches, so jump by 2.260261Fifth alignment:262bbbAbbAAbAAbAAbbbAAbAAbAAbAA263AAbAAbAAbA264^^^^^^^^265X266- Right half matches, so use memory and skip ahead by period=3267268Sixth alignment:269bbbAbbAAbAAbAAbbbAAbAAbAAbAA270AAbAAbAAbA271mmmmmmmm^^272- Right part matches, left part is remembered, found a match!273274The one tricky skip by 8 here generalizes: if we have a period of p,275then the CFT says we can ensure the cut has the inequality property276for lengths 1 through p-1, and jumping by p would line up the277matching characters and mismatched character one period earlier.278Inductively, this proves that we can skip by the number of characters279matched in the right half, plus 1, just as in the original algorithm.280281To make it explicit, the memory is set whenever the entire right part282is matched and is then used as a starting point in the next alignment.283In such a case, the alignment jumps forward one period, and the right284half matches all except possibly the last period. Additionally,285if we cut so that the left part has a length strictly less than the286period (we always can!), then we can know that the left part already287matches. The memory is reset to 0 whenever there is a mismatch in the288right part.289290To prove linearity for the periodic case, note that if a right-part291character mismatches, then we advance forward 1 unit per comparison.292On the other hand, if the entire right part matches, then the skipping293forward by one period "defers" some of the comparisons to the next294alignment, where they will then be spent at the usual rate of295one comparison per step forward. Even if left-half comparisons296are always "wasted", they constitute less than half of all297comparisons, so the average rate is certainly at least 1 move forward298per 2 comparisons.299300301-------- When to choose the periodic algorithm ---------302303The periodic algorithm is always valid but has an overhead of one304more "memory" register and some memory computation steps, so the305here-described-first non-periodic/long-period algorithm -- skipping by306max(len(left_part), len(right_part)) + 1 rather than the period --307should be preferred when possible.308309Interestingly, the long-period algorithm does not require an exact310computation of the period; it works even with some long-period, but311undeniably "periodic" needles:312313Cut: AbcdefAbc == Abcde + fAbc314315This cut gives these inequalities:316317e != f318de != fA319cde != fAb320bcde != fAbc321Abcde != fAbc?322The first failure is a period long, per the CFT:323?Abcde == fAbc??324325A sufficient condition for using the long-period algorithm is having326the period of the needle be greater than327max(len(left_part), len(right_part)). This way, after choosing a good328split, we get all of the max(len(left_part), len(right_part))329inequalities around the cut that were required in the long-period330version of the algorithm.331332With all of this in mind, here's how we choose:333334(1) Choose a "critical factorization" of the needle -- a cut335where we have period minus 1 inequalities in a row.336More specifically, choose a cut so that the left_part337is less than one period long.338(2) Determine the period P_r of the right_part.339(3) Check if the left part is just an extension of the pattern of340the right part, so that the whole needle has period P_r.341Explicitly, check if342needle[0:cut] == needle[0+P_r:cut+P_r]343If so, we use the periodic algorithm. If not equal, we use the344long-period algorithm.345346Note that if equality holds in (3), then the period of the whole347string is P_r. On the other hand, suppose equality does not hold.348The period of the needle is then strictly greater than P_r. Here's349a general fact:350351If p is a substring of s and p has period r, then the period352of s is either equal to r or greater than len(p).353354We know that needle_period != P_r,355and therefore needle_period > len(right_part).356Additionally, we'll choose the cut (see below)357so that len(left_part) < needle_period.358359Thus, in the case where equality does not hold, we have that360needle_period >= max(len(left_part), len(right_part)) + 1,361so the long-period algorithm works, but otherwise, we know the period362of the needle.363364Note that this decision process doesn't always require an exact365computation of the period -- we can get away with only computing P_r!366367368-------- Computing the cut --------369370Our remaining tasks are now to compute a cut of the needle with as371many inequalities as possible, ensuring that cut < needle_period.372Meanwhile, we must also compute the period P_r of the right_part.373374The computation is relatively simple, essentially doing this:375376suffix1 = max(needle[i:] for i in range(len(needle)))377suffix2 = ... # the same as above, but invert the alphabet378cut1 = len(needle) - len(suffix1)379cut2 = len(needle) - len(suffix2)380cut = max(cut1, cut2) # the later cut381382For cut2, "invert the alphabet" is different than saying min(...),383since in lexicographic order, we still put "py" < "python", even384if the alphabet is inverted. Computing these, along with the method385of computing the period of the right half, is easiest to read directly386from the source code in fastsearch.h, in which these are computed387in linear time.388389Crochemore & Perrin's Theorem 3.1 give that "cut" above is a390critical factorization less than the period, but a very brief sketch391of their proof goes something like this (this is far from complete):392393* If this cut splits the needle as some394needle == (a + w) + (w + b), meaning there's a bad equality395w == w, it's impossible for w + b to be bigger than both396b and w + w + b, so this can't happen. We thus have all of397the inequalities with no question marks.398* By maximality, the right part is not a substring of the left399part. Thus, we have all of the inequalities involving no400left-side question marks.401* If you have all of the inequalities without right-side question402marks, we have a critical factorization.403* If one such inequality fails, then there's a smaller period,404but the factorization is nonetheless critical. Here's where405you need the redundancy coming from computing both cuts and406choosing the later one.407408409-------- Some more Bells and Whistles --------410411Beyond Crochemore & Perrin's original algorithm, we can use a couple412more tricks for speed in fastsearch.h:4134141. Even though C&P has a best-case O(n/m) time, this doesn't occur415very often, so we add a Boyer-Moore bad character table to416achieve sublinear time in more cases.4174182. The prework of computing the cut/period is expensive per419needle character, so we shouldn't do it if it won't pay off.420For this reason, if the needle and haystack are long enough,421only automatically start with two-way if the needle's length422is a small percentage of the length of the haystack.4234243. In cases where the needle and haystack are large but the needle425makes up a significant percentage of the length of the426haystack, don't pay the expensive two-way preprocessing cost427if you don't need to. Instead, keep track of how many428character comparisons are equal, and if that exceeds429O(len(needle)), then pay that cost, since the simpler algorithm430isn't doing very well.431432433