Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
allendowney
GitHub Repository: allendowney/cpython
Path: blob/main/Objects/listsort.txt
12 views
1
Intro
2
-----
3
This describes an adaptive, stable, natural mergesort, modestly called
4
timsort (hey, I earned it <wink>). It has supernatural performance on many
5
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
6
as few as N-1), yet as fast as Python's previous highly tuned samplesort
7
hybrid on random arrays.
8
9
In a nutshell, the main routine marches over the array once, left to right,
10
alternately identifying the next run, then merging it into the previous
11
runs "intelligently". Everything else is complication for speed, and some
12
hard-won measure of memory efficiency.
13
14
15
Comparison with Python's Samplesort Hybrid
16
------------------------------------------
17
+ timsort can require a temp array containing as many as N//2 pointers,
18
which means as many as 2*N extra bytes on 32-bit boxes. It can be
19
expected to require a temp array this large when sorting random data; on
20
data with significant structure, it may get away without using any extra
21
heap memory. This appears to be the strongest argument against it, but
22
compared to the size of an object, 2 temp bytes worst-case (also expected-
23
case for random data) doesn't scare me much.
24
25
It turns out that Perl is moving to a stable mergesort, and the code for
26
that appears always to require a temp array with room for at least N
27
pointers. (Note that I wouldn't want to do that even if space weren't an
28
issue; I believe its efforts at memory frugality also save timsort
29
significant pointer-copying costs, and allow it to have a smaller working
30
set.)
31
32
+ Across about four hours of generating random arrays, and sorting them
33
under both methods, samplesort required about 1.5% more comparisons
34
(the program is at the end of this file).
35
36
+ In real life, this may be faster or slower on random arrays than
37
samplesort was, depending on platform quirks. Since it does fewer
38
comparisons on average, it can be expected to do better the more
39
expensive a comparison function is. OTOH, it does more data movement
40
(pointer copying) than samplesort, and that may negate its small
41
comparison advantage (depending on platform quirks) unless comparison
42
is very expensive.
43
44
+ On arrays with many kinds of pre-existing order, this blows samplesort out
45
of the water. It's significantly faster than samplesort even on some
46
cases samplesort was special-casing the snot out of. I believe that lists
47
very often do have exploitable partial order in real life, and this is the
48
strongest argument in favor of timsort (indeed, samplesort's special cases
49
for extreme partial order are appreciated by real users, and timsort goes
50
much deeper than those, in particular naturally covering every case where
51
someone has suggested "and it would be cool if list.sort() had a special
52
case for this too ... and for that ...").
53
54
+ Here are exact comparison counts across all the tests in sortperf.py,
55
when run with arguments "15 20 1".
56
57
Column Key:
58
*sort: random data
59
\sort: descending data
60
/sort: ascending data
61
3sort: ascending, then 3 random exchanges
62
+sort: ascending, then 10 random at the end
63
%sort: ascending, then randomly replace 1% of elements w/ random values
64
~sort: many duplicates
65
=sort: all equal
66
!sort: worst case scenario
67
68
First the trivial cases, trivial for samplesort because it special-cased
69
them, and trivial for timsort because it naturally works on runs. Within
70
an "n" block, the first line gives the # of compares done by samplesort,
71
the second line by timsort, and the third line is the percentage by
72
which the samplesort count exceeds the timsort count:
73
74
n \sort /sort =sort
75
------- ------ ------ ------
76
32768 32768 32767 32767 samplesort
77
32767 32767 32767 timsort
78
0.00% 0.00% 0.00% (samplesort - timsort) / timsort
79
80
65536 65536 65535 65535
81
65535 65535 65535
82
0.00% 0.00% 0.00%
83
84
131072 131072 131071 131071
85
131071 131071 131071
86
0.00% 0.00% 0.00%
87
88
262144 262144 262143 262143
89
262143 262143 262143
90
0.00% 0.00% 0.00%
91
92
524288 524288 524287 524287
93
524287 524287 524287
94
0.00% 0.00% 0.00%
95
96
1048576 1048576 1048575 1048575
97
1048575 1048575 1048575
98
0.00% 0.00% 0.00%
99
100
The algorithms are effectively identical in these cases, except that
101
timsort does one less compare in \sort.
102
103
Now for the more interesting cases. Where lg(x) is the logarithm of x to
104
the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
105
the best any comparison-based sorting algorithm can do on average (across
106
all permutations). When a method gets significantly below that, it's
107
either astronomically lucky, or is finding exploitable structure in the
108
data.
109
110
111
n lg(n!) *sort 3sort +sort %sort ~sort !sort
112
------- ------- ------ ------- ------- ------ ------- --------
113
32768 444255 453096 453614 32908 452871 130491 469141 old
114
448885 33016 33007 50426 182083 65534 new
115
0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new
116
117
65536 954037 972699 981940 65686 973104 260029 1004607
118
962991 65821 65808 101667 364341 131070
119
1.01% 1391.83% -0.19% 857.15% -28.63% 666.47%
120
121
131072 2039137 2101881 2091491 131232 2092894 554790 2161379
122
2057533 131410 131361 206193 728871 262142
123
2.16% 1491.58% -0.10% 915.02% -23.88% 724.51%
124
125
262144 4340409 4464460 4403233 262314 4445884 1107842 4584560
126
4377402 262437 262459 416347 1457945 524286
127
1.99% 1577.82% -0.06% 967.83% -24.01% 774.44%
128
129
524288 9205096 9453356 9408463 524468 9441930 2218577 9692015
130
9278734 524580 524633 837947 2916107 1048574
131
1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30%
132
133
1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212
134
19606028 1048958 1048941 1694896 5832445 2097150
135
1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38%
136
137
Discussion of cases:
138
139
*sort: There's no structure in random data to exploit, so the theoretical
140
limit is lg(n!). Both methods get close to that, and timsort is hugging
141
it (indeed, in a *marginal* sense, it's a spectacular improvement --
142
there's only about 1% left before hitting the wall, and timsort knows
143
darned well it's doing compares that won't pay on random data -- but so
144
does the samplesort hybrid). For contrast, Hoare's original random-pivot
145
quicksort does about 39% more compares than the limit, and the median-of-3
146
variant about 19% more.
147
148
3sort, %sort, and !sort: No contest; there's structure in this data, but
149
not of the specific kinds samplesort special-cases. Note that structure
150
in !sort wasn't put there on purpose -- it was crafted as a worst case for
151
a previous quicksort implementation. That timsort nails it came as a
152
surprise to me (although it's obvious in retrospect).
153
154
+sort: samplesort special-cases this data, and does a few less compares
155
than timsort. However, timsort runs this case significantly faster on all
156
boxes we have timings for, because timsort is in the business of merging
157
runs efficiently, while samplesort does much more data movement in this
158
(for it) special case.
159
160
~sort: samplesort's special cases for large masses of equal elements are
161
extremely effective on ~sort's specific data pattern, and timsort just
162
isn't going to get close to that, despite that it's clearly getting a
163
great deal of benefit out of the duplicates (the # of compares is much less
164
than lg(n!)). ~sort has a perfectly uniform distribution of just 4
165
distinct values, and as the distribution gets more skewed, samplesort's
166
equal-element gimmicks become less effective, while timsort's adaptive
167
strategies find more to exploit; in a database supplied by Kevin Altis, a
168
sort on its highly skewed "on which stock exchange does this company's
169
stock trade?" field ran over twice as fast under timsort.
170
171
However, despite that timsort does many more comparisons on ~sort, and
172
that on several platforms ~sort runs highly significantly slower under
173
timsort, on other platforms ~sort runs highly significantly faster under
174
timsort. No other kind of data has shown this wild x-platform behavior,
175
and we don't have an explanation for it. The only thing I can think of
176
that could transform what "should be" highly significant slowdowns into
177
highly significant speedups on some boxes are catastrophic cache effects
178
in samplesort.
179
180
But timsort "should be" slower than samplesort on ~sort, so it's hard
181
to count that it isn't on some boxes as a strike against it <wink>.
182
183
+ Here's the highwater mark for the number of heap-based temp slots (4
184
bytes each on this box) needed by each test, again with arguments
185
"15 20 1":
186
187
2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort
188
32768 16384 0 0 6256 0 10821 12288 0 16383
189
65536 32766 0 0 21652 0 31276 24576 0 32767
190
131072 65534 0 0 17258 0 58112 49152 0 65535
191
262144 131072 0 0 35660 0 123561 98304 0 131071
192
524288 262142 0 0 31302 0 212057 196608 0 262143
193
1048576 524286 0 0 312438 0 484942 393216 0 524287
194
195
Discussion: The tests that end up doing (close to) perfectly balanced
196
merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort
197
also ends up doing balanced merges, but systematically benefits a lot from
198
the preliminary pre-merge searches described under "Merge Memory" later.
199
%sort approaches having a balanced merge at the end because the random
200
selection of elements to replace is expected to produce an out-of-order
201
element near the midpoint. \sort, /sort, =sort are the trivial one-run
202
cases, needing no merging at all. +sort ends up having one very long run
203
and one very short, and so gets all the temp space it needs from the small
204
temparray member of the MergeState struct (note that the same would be
205
true if the new random elements were prefixed to the sorted list instead,
206
but not if they appeared "in the middle"). 3sort approaches N//3 temp
207
slots twice, but the run lengths that remain after 3 random exchanges
208
clearly has very high variance.
209
210
211
A detailed description of timsort follows.
212
213
Runs
214
----
215
count_run() returns the # of elements in the next run. A run is either
216
"ascending", which means non-decreasing:
217
218
a0 <= a1 <= a2 <= ...
219
220
or "descending", which means strictly decreasing:
221
222
a0 > a1 > a2 > ...
223
224
Note that a run is always at least 2 long, unless we start at the array's
225
last element.
226
227
The definition of descending is strict, because the main routine reverses
228
a descending run in-place, transforming a descending run into an ascending
229
run. Reversal is done via the obvious fast "swap elements starting at each
230
end, and converge at the middle" method, and that can violate stability if
231
the slice contains any equal elements. Using a strict definition of
232
descending ensures that a descending run contains distinct elements.
233
234
If an array is random, it's very unlikely we'll see long runs. If a natural
235
run contains less than minrun elements (see next section), the main loop
236
artificially boosts it to minrun elements, via a stable binary insertion sort
237
applied to the right number of array elements following the short natural
238
run. In a random array, *all* runs are likely to be minrun long as a
239
result. This has two primary good effects:
240
241
1. Random data strongly tends then toward perfectly balanced (both runs have
242
the same length) merges, which is the most efficient way to proceed when
243
data is random.
244
245
2. Because runs are never very short, the rest of the code doesn't make
246
heroic efforts to shave a few cycles off per-merge overheads. For
247
example, reasonable use of function calls is made, rather than trying to
248
inline everything. Since there are no more than N/minrun runs to begin
249
with, a few "extra" function calls per merge is barely measurable.
250
251
252
Computing minrun
253
----------------
254
If N < 64, minrun is N. IOW, binary insertion sort is used for the whole
255
array then; it's hard to beat that given the overheads of trying something
256
fancier (see note BINSORT).
257
258
When N is a power of 2, testing on random data showed that minrun values of
259
16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
260
in binary insertion sort clearly hurt, and at 8 the increase in the number
261
of function calls clearly hurt. Picking *some* power of 2 is important
262
here, so that the merges end up perfectly balanced (see next section). We
263
pick 32 as a good value in the sweet range; picking a value at the low end
264
allows the adaptive gimmicks more opportunity to exploit shorter natural
265
runs.
266
267
Because sortperf.py only tries powers of 2, it took a long time to notice
268
that 32 isn't a good choice for the general case! Consider N=2112:
269
270
>>> divmod(2112, 32)
271
(66, 0)
272
>>>
273
274
If the data is randomly ordered, we're very likely to end up with 66 runs
275
each of length 32. The first 64 of these trigger a sequence of perfectly
276
balanced merges (see next section), leaving runs of lengths 2048 and 64 to
277
merge at the end. The adaptive gimmicks can do that with fewer than 2048+64
278
compares, but it's still more compares than necessary, and-- mergesort's
279
bugaboo relative to samplesort --a lot more data movement (O(N) copies just
280
to get 64 elements into place).
281
282
If we take minrun=33 in this case, then we're very likely to end up with 64
283
runs each of length 33, and then all merges are perfectly balanced. Better!
284
285
What we want to avoid is picking minrun such that in
286
287
q, r = divmod(N, minrun)
288
289
q is a power of 2 and r>0 (then the last merge only gets r elements into
290
place, and r < minrun is small compared to N), or q a little larger than a
291
power of 2 regardless of r (then we've got a case similar to "2112", again
292
leaving too little work for the last merge to do).
293
294
Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
295
power of 2, or if that isn't possible, is close to, but strictly less than,
296
a power of 2. This is easier to do than it may sound: take the first 6
297
bits of N, and add 1 if any of the remaining bits are set. In fact, that
298
rule covers every case in this section, including small N and exact powers
299
of 2; merge_compute_minrun() is a deceptively simple function.
300
301
302
The Merge Pattern
303
-----------------
304
In order to exploit regularities in the data, we're merging on natural
305
run lengths, and they can become wildly unbalanced. That's a Good Thing
306
for this sort! It means we have to find a way to manage an assortment of
307
potentially very different run lengths, though.
308
309
Stability constrains permissible merging patterns. For example, if we have
310
3 consecutive runs of lengths
311
312
A:10000 B:20000 C:10000
313
314
we dare not merge A with C first, because if A, B and C happen to contain
315
a common element, it would get out of order wrt its occurrence(s) in B. The
316
merging must be done as (A+B)+C or A+(B+C) instead.
317
318
So merging is always done on two consecutive runs at a time, and in-place,
319
although this may require some temp memory (more on that later).
320
321
When a run is identified, its length is passed to found_new_run() to
322
potentially merge runs on a stack of pending runs. We would like to delay
323
merging as long as possible in order to exploit patterns that may come up
324
later, but we like even more to do merging as soon as possible to exploit
325
that the run just found is still high in the memory hierarchy. We also can't
326
delay merging "too long" because it consumes memory to remember the runs that
327
are still unmerged, and the stack has a fixed size.
328
329
The original version of this code used the first thing I made up that didn't
330
obviously suck ;-) It was loosely based on invariants involving the Fibonacci
331
sequence.
332
333
It worked OK, but it was hard to reason about, and was subtle enough that the
334
intended invariants weren't actually preserved. Researchers discovered that
335
when trying to complete a computer-generated correctness proof. That was
336
easily-enough repaired, but the discovery spurred quite a bit of academic
337
interest in truly good ways to manage incremental merging on the fly.
338
339
At least a dozen different approaches were developed, some provably having
340
near-optimal worst case behavior with respect to the entropy of the
341
distribution of run lengths. Some details can be found in bpo-34561.
342
343
The code now uses the "powersort" merge strategy from:
344
345
"Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods
346
That Optimally Adapt to Existing Runs"
347
J. Ian Munro and Sebastian Wild
348
349
The code is pretty simple, but the justification is quite involved, as it's
350
based on fast approximations to optimal binary search trees, which are
351
substantial topics on their own.
352
353
Here we'll just cover some pragmatic details:
354
355
The `powerloop()` function computes a run's "power". Say two adjacent runs
356
begin at index s1. The first run has length n1, and the second run (starting
357
at index s1+n1, called "s2" below) has length n2. The list has total length n.
358
The "power" of the first run is a small integer, the depth of the node
359
connecting the two runs in an ideal binary merge tree, where power 1 is the
360
root node, and the power increases by 1 for each level deeper in the tree.
361
362
The power is the least integer L such that the "midpoint interval" contains
363
a rational number of the form J/2**L. The midpoint interval is the semi-
364
closed interval:
365
366
((s1 + n1/2)/n, (s2 + n2/2)/n]
367
368
Yes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and
369
(s2 + n2/2)/n are computed to infinite precision in binary, the power L is
370
the first position at which the 2**-L bit differs between the expansions.
371
Since the left end of the interval is less than the right end, the first
372
differing bit must be a 0 bit in the left quotient and a 1 bit in the right
373
quotient.
374
375
`powerloop()` emulates these divisions, 1 bit at a time, using comparisons,
376
subtractions, and shifts in a loop.
377
378
You'll notice the paper uses an O(1) method instead, but that relies on two
379
things we don't have:
380
381
- An O(1) "count leading zeroes" primitive. We can find such a thing as a C
382
extension on most platforms, but not all, and there's no uniform spelling
383
on the platforms that support it.
384
385
- Integer division on an integer type twice as wide as needed to hold the
386
list length. But the latter is Py_ssize_t for us, and is typically the
387
widest native signed integer type the platform supports.
388
389
But since runs in our algorithm are almost never very short, the once-per-run
390
overhead of `powerloop()` seems lost in the noise.
391
392
Detail: why is Py_ssize_t "wide enough" in `powerloop()`? We do, after all,
393
shift integers of that width left by 1. How do we know that won't spill into
394
the sign bit? The trick is that we have some slop. `n` (the total list
395
length) is the number of list elements, which is at most 4 times (on a 32-box,
396
with 4-byte pointers) smaller than than the largest size_t. So at least the
397
leading two bits of the integers we're using are clear.
398
399
Since we can't compute a run's power before seeing the run that follows it,
400
the most-recently identified run is never merged by `found_new_run()`.
401
Instead a new run is only used to compute the 2nd-most-recent run's power.
402
Then adjacent runs are merged so long as their saved power (tree depth) is
403
greater than that newly computed power. When found_new_run() returns, only
404
then is a new run pushed on to the stack of pending runs.
405
406
A key invariant is that powers on the run stack are strictly decreasing
407
(starting from the run at the top of the stack).
408
409
Note that even powersort's strategy isn't always truly optimal. It can't be.
410
Computing an optimal merge sequence can be done in time quadratic in the
411
number of runs, which is very much slower, and also requires finding &
412
remembering _all_ the runs' lengths (of which there may be billions) in
413
advance. It's remarkable, though, how close to optimal this strategy gets.
414
415
Curious factoid: of all the alternatives I've seen in the literature,
416
powersort's is the only one that's always truly optimal for a collection of 3
417
run lengths (for three lengths A B C, it's always optimal to first merge the
418
shorter of A and C with B).
419
420
421
Merge Memory
422
------------
423
Merging adjacent runs of lengths A and B in-place, and in linear time, is
424
difficult. Theoretical constructions are known that can do it, but they're
425
too difficult and slow for practical use. But if we have temp memory equal
426
to min(A, B), it's easy.
427
428
If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
429
and then we can do the obvious merge algorithm left to right, from the temp
430
area and B, starting the stores into where A used to live. There's always a
431
free area in the original area comprising a number of elements equal to the
432
number not yet merged from the temp array (trivially true at the start;
433
proceed by induction). The only tricky bit is that if a comparison raises an
434
exception, we have to remember to copy the remaining elements back in from
435
the temp area, lest the array end up with duplicate entries from B. But
436
that's exactly the same thing we need to do if we reach the end of B first,
437
so the exit code is pleasantly common to both the normal and error cases.
438
439
If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
440
much the same, except that we need to merge right to left, copying B into a
441
temp array and starting the stores at the right end of where B used to live.
442
443
A refinement: When we're about to merge adjacent runs A and B, we first do
444
a form of binary search (more on that later) to see where B[0] should end up
445
in A. Elements in A preceding that point are already in their final
446
positions, effectively shrinking the size of A. Likewise we also search to
447
see where A[-1] should end up in B, and elements of B after that point can
448
also be ignored. This cuts the amount of temp memory needed by the same
449
amount.
450
451
These preliminary searches may not pay off, and can be expected *not* to
452
repay their cost if the data is random. But they can win huge in all of
453
time, copying, and memory savings when they do pay, so this is one of the
454
"per-merge overheads" mentioned above that we're happy to endure because
455
there is at most one very short run. It's generally true in this algorithm
456
that we're willing to gamble a little to win a lot, even though the net
457
expectation is negative for random data.
458
459
460
Merge Algorithms
461
----------------
462
merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo
463
deals with runs where A <= B, and merge_hi where A > B. They don't know
464
whether the data is clustered or uniform, but a lovely thing about merging
465
is that many kinds of clustering "reveal themselves" by how many times in a
466
row the winning merge element comes from the same run. We'll only discuss
467
merge_lo here; merge_hi is exactly analogous.
468
469
Merging begins in the usual, obvious way, comparing the first element of A
470
to the first of B, and moving B[0] to the merge area if it's less than A[0],
471
else moving A[0] to the merge area. Call that the "one pair at a time"
472
mode. The only twist here is keeping track of how many times in a row "the
473
winner" comes from the same run.
474
475
If that count reaches MIN_GALLOP, we switch to "galloping mode". Here
476
we *search* B for where A[0] belongs, and move over all the B's before
477
that point in one chunk to the merge area, then move A[0] to the merge
478
area. Then we search A for where B[0] belongs, and similarly move a
479
slice of A in one chunk. Then back to searching B for where A[0] belongs,
480
etc. We stay in galloping mode until both searches find slices to copy
481
less than MIN_GALLOP elements long, at which point we go back to one-pair-
482
at-a-time mode.
483
484
A refinement: The MergeState struct contains the value of min_gallop that
485
controls when we enter galloping mode, initialized to MIN_GALLOP.
486
merge_lo() and merge_hi() adjust this higher when galloping isn't paying
487
off, and lower when it is.
488
489
490
Galloping
491
---------
492
Still without loss of generality, assume A is the shorter run. In galloping
493
mode, we first look for A[0] in B. We do this via "galloping", comparing
494
A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
495
the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most
496
roughly lg(B) comparisons, and, unlike a straight binary search, favors
497
finding the right spot early in B (more on that later).
498
499
After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
500
consecutive elements, and a straight binary search requires exactly k-1
501
additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we
502
copy all the B's up to that point in one chunk, and then copy A[0]. Note
503
that no matter where A[0] belongs in B, the combination of galloping + binary
504
search finds it in no more than about 2*lg(B) comparisons.
505
506
If we did a straight binary search, we could find it in no more than
507
ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
508
comparisons no matter where A[0] belongs. Straight binary search thus loses
509
to galloping unless the run is quite long, and we simply can't guess
510
whether it is in advance.
511
512
If data is random and runs have the same length, A[0] belongs at B[0] half
513
the time, at B[1] a quarter of the time, and so on: a consecutive winning
514
sub-run in B of length k occurs with probability 1/2**(k+1). So long
515
winning sub-runs are extremely unlikely in random data, and guessing that a
516
winning sub-run is going to be long is a dangerous game.
517
518
OTOH, if data is lopsided or lumpy or contains many duplicates, long
519
stretches of winning sub-runs are very likely, and cutting the number of
520
comparisons needed to find one from O(B) to O(log B) is a huge win.
521
522
Galloping compromises by getting out fast if there isn't a long winning
523
sub-run, yet finding such very efficiently when they exist.
524
525
I first learned about the galloping strategy in a related context; see:
526
527
"Adaptive Set Intersections, Unions, and Differences" (2000)
528
Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
529
530
and its followup(s). An earlier paper called the same strategy
531
"exponential search":
532
533
"Optimistic Sorting and Information Theoretic Complexity"
534
Peter McIlroy
535
SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
536
467-474, Austin, Texas, 25-27 January 1993.
537
538
and it probably dates back to an earlier paper by Bentley and Yao. The
539
McIlroy paper in particular has good analysis of a mergesort that's
540
probably strongly related to this one in its galloping strategy.
541
542
543
Galloping with a Broken Leg
544
---------------------------
545
So why don't we always gallop? Because it can lose, on two counts:
546
547
1. While we're willing to endure small per-merge overheads, per-comparison
548
overheads are a different story. Calling Yet Another Function per
549
comparison is expensive, and gallop_left() and gallop_right() are
550
too long-winded for sane inlining.
551
552
2. Galloping can-- alas --require more comparisons than linear one-at-time
553
search, depending on the data.
554
555
#2 requires details. If A[0] belongs before B[0], galloping requires 1
556
compare to determine that, same as linear search, except it costs more
557
to call the gallop function. If A[0] belongs right before B[1], galloping
558
requires 2 compares, again same as linear search. On the third compare,
559
galloping checks A[0] against B[3], and if it's <=, requires one more
560
compare to determine whether A[0] belongs at B[2] or B[3]. That's a total
561
of 4 compares, but if A[0] does belong at B[2], linear search would have
562
discovered that in only 3 compares, and that's a huge loss! Really. It's
563
an increase of 33% in the number of compares needed, and comparisons are
564
expensive in Python.
565
566
index in B where # compares linear # gallop # binary gallop
567
A[0] belongs search needs compares compares total
568
---------------- ----------------- -------- -------- ------
569
0 1 1 0 1
570
571
1 2 2 0 2
572
573
2 3 3 1 4
574
3 4 3 1 4
575
576
4 5 4 2 6
577
5 6 4 2 6
578
6 7 4 2 6
579
7 8 4 2 6
580
581
8 9 5 3 8
582
9 10 5 3 8
583
10 11 5 3 8
584
11 12 5 3 8
585
...
586
587
In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
588
to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
589
The advantage of galloping is unbounded as i grows, but it doesn't win at
590
all until i=6. Before then, it loses twice (at i=2 and i=4), and ties
591
at the other values. At and after i=6, galloping always wins.
592
593
We can't guess in advance when it's going to win, though, so we do one pair
594
at a time until the evidence seems strong that galloping may pay. MIN_GALLOP
595
is 7, and that's pretty strong evidence. However, if the data is random, it
596
simply will trigger galloping mode purely by luck every now and again, and
597
it's quite likely to hit one of the losing cases next. On the other hand,
598
in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
599
"should be" then. So the MergeState struct keeps a min_gallop variable
600
that merge_lo and merge_hi adjust: the longer we stay in galloping mode,
601
the smaller min_gallop gets, making it easier to transition back to
602
galloping mode (if we ever leave it in the current merge, and at the
603
start of the next merge). But whenever the gallop loop doesn't pay,
604
min_gallop is increased by one, making it harder to transition back
605
to galloping mode (and again both within a merge and across merges). For
606
random data, this all but eliminates the gallop penalty: min_gallop grows
607
large enough that we almost never get into galloping mode. And for cases
608
like ~sort, min_gallop can fall to as low as 1. This seems to work well,
609
but in all it's a minor improvement over using a fixed MIN_GALLOP value.
610
611
612
Galloping Complication
613
----------------------
614
The description above was for merge_lo. merge_hi has to merge "from the
615
other end", and really needs to gallop starting at the last element in a run
616
instead of the first. Galloping from the first still works, but does more
617
comparisons than it should (this is significant -- I timed it both ways). For
618
this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
619
functions have a "hint" argument, which is the index at which galloping
620
should begin. So galloping can actually start at any index, and proceed at
621
offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
622
623
In the code as I type it's always called with either 0 or n-1 (where n is
624
the # of elements in a run). It's tempting to try to do something fancier,
625
melding galloping with some form of interpolation search; for example, if
626
we're merging a run of length 1 with a run of length 10000, index 5000 is
627
probably a better guess at the final result than either 0 or 9999. But
628
it's unclear how to generalize that intuition usefully, and merging of
629
wildly unbalanced runs already enjoys excellent performance.
630
631
~sort is a good example of when balanced runs could benefit from a better
632
hint value: to the extent possible, this would like to use a starting
633
offset equal to the previous value of acount/bcount. Doing so saves about
634
10% of the compares in ~sort. However, doing so is also a mixed bag,
635
hurting other cases.
636
637
638
Comparing Average # of Compares on Random Arrays
639
------------------------------------------------
640
[NOTE: This was done when the new algorithm used about 0.1% more compares
641
on random data than does its current incarnation.]
642
643
Here list.sort() is samplesort, and list.msort() this sort:
644
645
"""
646
import random
647
from time import clock as now
648
649
def fill(n):
650
from random import random
651
return [random() for i in range(n)]
652
653
def mycmp(x, y):
654
global ncmp
655
ncmp += 1
656
return cmp(x, y)
657
658
def timeit(values, method):
659
global ncmp
660
X = values[:]
661
bound = getattr(X, method)
662
ncmp = 0
663
t1 = now()
664
bound(mycmp)
665
t2 = now()
666
return t2-t1, ncmp
667
668
format = "%5s %9.2f %11d"
669
f2 = "%5s %9.2f %11.2f"
670
671
def drive():
672
count = sst = sscmp = mst = mscmp = nelts = 0
673
while True:
674
n = random.randrange(100000)
675
nelts += n
676
x = fill(n)
677
678
t, c = timeit(x, 'sort')
679
sst += t
680
sscmp += c
681
682
t, c = timeit(x, 'msort')
683
mst += t
684
mscmp += c
685
686
count += 1
687
if count % 10:
688
continue
689
690
print "count", count, "nelts", nelts
691
print format % ("sort", sst, sscmp)
692
print format % ("msort", mst, mscmp)
693
print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
694
695
drive()
696
"""
697
698
I ran this on Windows and kept using the computer lightly while it was
699
running. time.clock() is wall-clock time on Windows, with better than
700
microsecond resolution. samplesort started with a 1.52% #-of-comparisons
701
disadvantage, fell quickly to 1.48%, and then fluctuated within that small
702
range. Here's the last chunk of output before I killed the job:
703
704
count 2630 nelts 130906543
705
sort 6110.80 1937887573
706
msort 6002.78 1909389381
707
1.80 1.49
708
709
We've done nearly 2 billion comparisons apiece at Python speed there, and
710
that's enough <wink>.
711
712
For random arrays of size 2 (yes, there are only 2 interesting ones),
713
samplesort has a 50%(!) comparison disadvantage. This is a consequence of
714
samplesort special-casing at most one ascending run at the start, then
715
falling back to the general case if it doesn't find an ascending run
716
immediately. The consequence is that it ends up using two compares to sort
717
[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be
718
taught how to deal with mixtures of ascending and descending runs
719
efficiently in all cases.
720
721
722
NOTES
723
-----
724
725
BINSORT
726
A "binary insertion sort" is just like a textbook insertion sort, but instead
727
of locating the correct position of the next item via linear (one at a time)
728
search, an equivalent to Python's bisect.bisect_right is used to find the
729
correct position in logarithmic time. Most texts don't mention this
730
variation, and those that do usually say it's not worth the bother: insertion
731
sort remains quadratic (expected and worst cases) either way. Speeding the
732
search doesn't reduce the quadratic data movement costs.
733
734
But in CPython's case, comparisons are extraordinarily expensive compared to
735
moving data, and the details matter. Moving objects is just copying
736
pointers. Comparisons can be arbitrarily expensive (can invoke arbitrary
737
user-supplied Python code), but even in simple cases (like 3 < 4) _all_
738
decisions are made at runtime: what's the type of the left comparand? the
739
type of the right? do they need to be coerced to a common type? where's the
740
code to compare these types? And so on. Even the simplest Python comparison
741
triggers a large pile of C-level pointer dereferences, conditionals, and
742
function calls.
743
744
So cutting the number of compares is almost always measurably helpful in
745
CPython, and the savings swamp the quadratic-time data movement costs for
746
reasonable minrun values.
747
748
749
LEFT OR RIGHT
750
gallop_left() and gallop_right() are akin to the Python bisect module's
751
bisect_left() and bisect_right(): they're the same unless the slice they're
752
searching contains a (at least one) value equal to the value being searched
753
for. In that case, gallop_left() returns the position immediately before the
754
leftmost equal value, and gallop_right() the position immediately after the
755
rightmost equal value. The distinction is needed to preserve stability. In
756
general, when merging adjacent runs A and B, gallop_left is used to search
757
thru B for where an element from A belongs, and gallop_right to search thru A
758
for where an element from B belongs.
759
760
761
REGION OF UNCERTAINTY
762
Two kinds of confusion seem to be common about the claim that after finding
763
a k such that
764
765
B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
766
767
then a binary search requires exactly k-1 tries to find A[0]'s proper
768
location. For concreteness, say k=3, so B[3] < A[0] <= B[7].
769
770
The first confusion takes the form "OK, then the region of uncertainty is at
771
indices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 =
772
3"; or the region is viewed as a Python slice and the objection is "but that's
773
the slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already
774
compared A[0] against B[3] and against B[7], so A[0]'s correct location is
775
already known wrt _both_ endpoints. What remains is to find A[0]'s correct
776
location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the
777
slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1
778
inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has
779
(2**k-1)-1 - 2**(k-1) + 1 =
780
2**k-1 - 2**(k-1) =
781
2*2**(k-1)-1 - 2**(k-1) =
782
(2-1)*2**(k-1) - 1 =
783
2**(k-1) - 1
784
elements.
785
786
The second confusion: "k-1 = 2 binary searches can find the correct location
787
among 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we
788
could make this more efficient by arranging for the region of uncertainty to
789
span 2**(k-1) elements." Resolution: that confuses "elements" with
790
"locations". In a slice with N elements, there are N+1 _locations_. In the
791
example, with the region of uncertainty B[4], B[5], B[6], there are 4
792
locations: before B[4], between B[4] and B[5], between B[5] and B[6], and
793
after B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1)
794
locations. That's why k-1 binary searches are necessary and sufficient.
795
796
OPTIMIZATION OF INDIVIDUAL COMPARISONS
797
As noted above, even the simplest Python comparison triggers a large pile of
798
C-level pointer dereferences, conditionals, and function calls. This can be
799
partially mitigated by pre-scanning the data to determine whether the data is
800
homogeneous with respect to type. If so, it is sometimes possible to
801
substitute faster type-specific comparisons for the slower, generic
802
PyObject_RichCompareBool.
803
804