mirror of
https://github.com/python/cpython
synced 2024-11-02 09:15:30 +00:00
8383915031
Rewrote binarysort() for clarity. Also changed the signature to be more coherent (it was mixing sortslice with raw pointers). No change in method or functionality. However, I left some experiments in, disabled for now via `#if` tricks. Since this code was first written, some kinds of comparisons have gotten enormously faster (like for lists of floats), which changes the tradeoffs. For example, plain insertion sort's simpler innermost loop and highly predictable branches leave it very competitive (even beating, by a bit) binary insertion when comparisons are very cheap, despite that it can do many more compares. And it wins big on runs that are already sorted (moving the next one in takes only 1 compare then). So I left code for a plain insertion sort, to make future experimenting easier. Also made the maximum value of minrun a `#define` (``MAX_MINRUN`) to make experimenting with that easier too. And another bit of `#if``-disabled code rewrites binary insertion's innermost loop to remove its unpredictable branch. Surprisingly, this doesn't really seem to help overall. I'm unclear on why not. It certainly adds more instructions, but they're very simple, and it's hard to be believe they cost as much as a branch miss.
822 lines
39 KiB
Text
822 lines
39 KiB
Text
Intro
|
|
-----
|
|
This describes an adaptive, stable, natural mergesort, modestly called
|
|
timsort (hey, I earned it <wink>). It has supernatural performance on many
|
|
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
|
|
as few as N-1), yet as fast as Python's previous highly tuned samplesort
|
|
hybrid on random arrays.
|
|
|
|
In a nutshell, the main routine marches over the array once, left to right,
|
|
alternately identifying the next run, then merging it into the previous
|
|
runs "intelligently". Everything else is complication for speed, and some
|
|
hard-won measure of memory efficiency.
|
|
|
|
|
|
Comparison with Python's Samplesort Hybrid
|
|
------------------------------------------
|
|
+ timsort can require a temp array containing as many as N//2 pointers,
|
|
which means as many as 2*N extra bytes on 32-bit boxes. It can be
|
|
expected to require a temp array this large when sorting random data; on
|
|
data with significant structure, it may get away without using any extra
|
|
heap memory. This appears to be the strongest argument against it, but
|
|
compared to the size of an object, 2 temp bytes worst-case (also expected-
|
|
case for random data) doesn't scare me much.
|
|
|
|
It turns out that Perl is moving to a stable mergesort, and the code for
|
|
that appears always to require a temp array with room for at least N
|
|
pointers. (Note that I wouldn't want to do that even if space weren't an
|
|
issue; I believe its efforts at memory frugality also save timsort
|
|
significant pointer-copying costs, and allow it to have a smaller working
|
|
set.)
|
|
|
|
+ Across about four hours of generating random arrays, and sorting them
|
|
under both methods, samplesort required about 1.5% more comparisons
|
|
(the program is at the end of this file).
|
|
|
|
+ In real life, this may be faster or slower on random arrays than
|
|
samplesort was, depending on platform quirks. Since it does fewer
|
|
comparisons on average, it can be expected to do better the more
|
|
expensive a comparison function is. OTOH, it does more data movement
|
|
(pointer copying) than samplesort, and that may negate its small
|
|
comparison advantage (depending on platform quirks) unless comparison
|
|
is very expensive.
|
|
|
|
+ On arrays with many kinds of pre-existing order, this blows samplesort out
|
|
of the water. It's significantly faster than samplesort even on some
|
|
cases samplesort was special-casing the snot out of. I believe that lists
|
|
very often do have exploitable partial order in real life, and this is the
|
|
strongest argument in favor of timsort (indeed, samplesort's special cases
|
|
for extreme partial order are appreciated by real users, and timsort goes
|
|
much deeper than those, in particular naturally covering every case where
|
|
someone has suggested "and it would be cool if list.sort() had a special
|
|
case for this too ... and for that ...").
|
|
|
|
+ Here are exact comparison counts across all the tests in sortperf.py,
|
|
when run with arguments "15 20 1".
|
|
|
|
Column Key:
|
|
*sort: random data
|
|
\sort: descending data
|
|
/sort: ascending data
|
|
3sort: ascending, then 3 random exchanges
|
|
+sort: ascending, then 10 random at the end
|
|
%sort: ascending, then randomly replace 1% of elements w/ random values
|
|
~sort: many duplicates
|
|
=sort: all equal
|
|
!sort: worst case scenario
|
|
|
|
First the trivial cases, trivial for samplesort because it special-cased
|
|
them, and trivial for timsort because it naturally works on runs. Within
|
|
an "n" block, the first line gives the # of compares done by samplesort,
|
|
the second line by timsort, and the third line is the percentage by
|
|
which the samplesort count exceeds the timsort count:
|
|
|
|
n \sort /sort =sort
|
|
------- ------ ------ ------
|
|
32768 32768 32767 32767 samplesort
|
|
32767 32767 32767 timsort
|
|
0.00% 0.00% 0.00% (samplesort - timsort) / timsort
|
|
|
|
65536 65536 65535 65535
|
|
65535 65535 65535
|
|
0.00% 0.00% 0.00%
|
|
|
|
131072 131072 131071 131071
|
|
131071 131071 131071
|
|
0.00% 0.00% 0.00%
|
|
|
|
262144 262144 262143 262143
|
|
262143 262143 262143
|
|
0.00% 0.00% 0.00%
|
|
|
|
524288 524288 524287 524287
|
|
524287 524287 524287
|
|
0.00% 0.00% 0.00%
|
|
|
|
1048576 1048576 1048575 1048575
|
|
1048575 1048575 1048575
|
|
0.00% 0.00% 0.00%
|
|
|
|
The algorithms are effectively identical in these cases, except that
|
|
timsort does one less compare in \sort.
|
|
|
|
Now for the more interesting cases. Where lg(x) is the logarithm of x to
|
|
the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
|
|
the best any comparison-based sorting algorithm can do on average (across
|
|
all permutations). When a method gets significantly below that, it's
|
|
either astronomically lucky, or is finding exploitable structure in the
|
|
data.
|
|
|
|
|
|
n lg(n!) *sort 3sort +sort %sort ~sort !sort
|
|
------- ------- ------ ------- ------- ------ ------- --------
|
|
32768 444255 453096 453614 32908 452871 130491 469141 old
|
|
448885 33016 33007 50426 182083 65534 new
|
|
0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new
|
|
|
|
65536 954037 972699 981940 65686 973104 260029 1004607
|
|
962991 65821 65808 101667 364341 131070
|
|
1.01% 1391.83% -0.19% 857.15% -28.63% 666.47%
|
|
|
|
131072 2039137 2101881 2091491 131232 2092894 554790 2161379
|
|
2057533 131410 131361 206193 728871 262142
|
|
2.16% 1491.58% -0.10% 915.02% -23.88% 724.51%
|
|
|
|
262144 4340409 4464460 4403233 262314 4445884 1107842 4584560
|
|
4377402 262437 262459 416347 1457945 524286
|
|
1.99% 1577.82% -0.06% 967.83% -24.01% 774.44%
|
|
|
|
524288 9205096 9453356 9408463 524468 9441930 2218577 9692015
|
|
9278734 524580 524633 837947 2916107 1048574
|
|
1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30%
|
|
|
|
1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212
|
|
19606028 1048958 1048941 1694896 5832445 2097150
|
|
1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38%
|
|
|
|
Discussion of cases:
|
|
|
|
*sort: There's no structure in random data to exploit, so the theoretical
|
|
limit is lg(n!). Both methods get close to that, and timsort is hugging
|
|
it (indeed, in a *marginal* sense, it's a spectacular improvement --
|
|
there's only about 1% left before hitting the wall, and timsort knows
|
|
darned well it's doing compares that won't pay on random data -- but so
|
|
does the samplesort hybrid). For contrast, Hoare's original random-pivot
|
|
quicksort does about 39% more compares than the limit, and the median-of-3
|
|
variant about 19% more.
|
|
|
|
3sort, %sort, and !sort: No contest; there's structure in this data, but
|
|
not of the specific kinds samplesort special-cases. Note that structure
|
|
in !sort wasn't put there on purpose -- it was crafted as a worst case for
|
|
a previous quicksort implementation. That timsort nails it came as a
|
|
surprise to me (although it's obvious in retrospect).
|
|
|
|
+sort: samplesort special-cases this data, and does a few less compares
|
|
than timsort. However, timsort runs this case significantly faster on all
|
|
boxes we have timings for, because timsort is in the business of merging
|
|
runs efficiently, while samplesort does much more data movement in this
|
|
(for it) special case.
|
|
|
|
~sort: samplesort's special cases for large masses of equal elements are
|
|
extremely effective on ~sort's specific data pattern, and timsort just
|
|
isn't going to get close to that, despite that it's clearly getting a
|
|
great deal of benefit out of the duplicates (the # of compares is much less
|
|
than lg(n!)). ~sort has a perfectly uniform distribution of just 4
|
|
distinct values, and as the distribution gets more skewed, samplesort's
|
|
equal-element gimmicks become less effective, while timsort's adaptive
|
|
strategies find more to exploit; in a database supplied by Kevin Altis, a
|
|
sort on its highly skewed "on which stock exchange does this company's
|
|
stock trade?" field ran over twice as fast under timsort.
|
|
|
|
However, despite that timsort does many more comparisons on ~sort, and
|
|
that on several platforms ~sort runs highly significantly slower under
|
|
timsort, on other platforms ~sort runs highly significantly faster under
|
|
timsort. No other kind of data has shown this wild x-platform behavior,
|
|
and we don't have an explanation for it. The only thing I can think of
|
|
that could transform what "should be" highly significant slowdowns into
|
|
highly significant speedups on some boxes are catastrophic cache effects
|
|
in samplesort.
|
|
|
|
But timsort "should be" slower than samplesort on ~sort, so it's hard
|
|
to count that it isn't on some boxes as a strike against it <wink>.
|
|
|
|
+ Here's the highwater mark for the number of heap-based temp slots (4
|
|
bytes each on this box) needed by each test, again with arguments
|
|
"15 20 1":
|
|
|
|
2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort
|
|
32768 16384 0 0 6256 0 10821 12288 0 16383
|
|
65536 32766 0 0 21652 0 31276 24576 0 32767
|
|
131072 65534 0 0 17258 0 58112 49152 0 65535
|
|
262144 131072 0 0 35660 0 123561 98304 0 131071
|
|
524288 262142 0 0 31302 0 212057 196608 0 262143
|
|
1048576 524286 0 0 312438 0 484942 393216 0 524287
|
|
|
|
Discussion: The tests that end up doing (close to) perfectly balanced
|
|
merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort
|
|
also ends up doing balanced merges, but systematically benefits a lot from
|
|
the preliminary pre-merge searches described under "Merge Memory" later.
|
|
%sort approaches having a balanced merge at the end because the random
|
|
selection of elements to replace is expected to produce an out-of-order
|
|
element near the midpoint. \sort, /sort, =sort are the trivial one-run
|
|
cases, needing no merging at all. +sort ends up having one very long run
|
|
and one very short, and so gets all the temp space it needs from the small
|
|
temparray member of the MergeState struct (note that the same would be
|
|
true if the new random elements were prefixed to the sorted list instead,
|
|
but not if they appeared "in the middle"). 3sort approaches N//3 temp
|
|
slots twice, but the run lengths that remain after 3 random exchanges
|
|
clearly has very high variance.
|
|
|
|
|
|
A detailed description of timsort follows.
|
|
|
|
Runs
|
|
----
|
|
count_run() returns the # of elements in the next run, and, if it's a
|
|
descending run, reverses it in-place. A run is either "ascending", which
|
|
means non-decreasing:
|
|
|
|
a0 <= a1 <= a2 <= ...
|
|
|
|
or "descending", which means non-increasing:
|
|
|
|
a0 >= a1 >= a2 >= ...
|
|
|
|
Note that a run is always at least 2 long, unless we start at the array's
|
|
last element. If all elements in the array are equal, it can be viewed as
|
|
both ascending and descending. Upon return, the run count_run() identifies
|
|
is always ascending.
|
|
|
|
Reversal is done via the obvious fast "swap elements starting at each
|
|
end, and converge at the middle" method. That can violate stability if
|
|
the slice contains any equal elements. For that reason, for a long time
|
|
the code used strict inequality (">" rather than ">=") in its definition
|
|
of descending.
|
|
|
|
Removing that restriction required some complication: when processing a
|
|
descending run, all-equal sub-runs of elements are reversed in-place, on the
|
|
fly. Their original relative order is restored "by magic" via the final
|
|
"reverse the entire run" step.
|
|
|
|
This makes processing descending runs a little more costly. We only use
|
|
`__lt__` comparisons, so that `x == y` has to be deduced from
|
|
`not x < y and not y < x`. But so long as a run remains strictly decreasing,
|
|
only one of those compares needs to be done per loop iteration. So the primsry
|
|
extra cost is paid only when there are equal elements, and they get some
|
|
compensating benefit by not needing to end the descending run.
|
|
|
|
There's one more trick added since the original: after reversing a descending
|
|
run, it's possible that it can be extended by an adjacent ascending run. For
|
|
example, given [3, 2, 1, 3, 4, 5, 0], the 3-element descending prefix is
|
|
reversed in-place, and then extended by [3, 4, 5].
|
|
|
|
If an array is random, it's very unlikely we'll see long runs. If a natural
|
|
run contains less than minrun elements (see next section), the main loop
|
|
artificially boosts it to minrun elements, via a stable binary insertion sort
|
|
applied to the right number of array elements following the short natural
|
|
run. In a random array, *all* runs are likely to be minrun long as a
|
|
result. This has two primary good effects:
|
|
|
|
1. Random data strongly tends then toward perfectly balanced (both runs have
|
|
the same length) merges, which is the most efficient way to proceed when
|
|
data is random.
|
|
|
|
2. Because runs are never very short, the rest of the code doesn't make
|
|
heroic efforts to shave a few cycles off per-merge overheads. For
|
|
example, reasonable use of function calls is made, rather than trying to
|
|
inline everything. Since there are no more than N/minrun runs to begin
|
|
with, a few "extra" function calls per merge is barely measurable.
|
|
|
|
|
|
Computing minrun
|
|
----------------
|
|
If N < MAX_MINRUN, minrun is N. IOW, binary insertion sort is used for the
|
|
whole array then; it's hard to beat that given the overheads of trying
|
|
something fancier (see note BINSORT).
|
|
|
|
When N is a power of 2, testing on random data showed that minrun values of
|
|
16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
|
|
in binary insertion sort clearly hurt, and at 8 the increase in the number
|
|
of function calls clearly hurt. Picking *some* power of 2 is important
|
|
here, so that the merges end up perfectly balanced (see next section). We
|
|
pick 32 as a good value in the sweet range; picking a value at the low end
|
|
allows the adaptive gimmicks more opportunity to exploit shorter natural
|
|
runs.
|
|
|
|
Because sortperf.py only tries powers of 2, it took a long time to notice
|
|
that 32 isn't a good choice for the general case! Consider N=2112:
|
|
|
|
>>> divmod(2112, 32)
|
|
(66, 0)
|
|
>>>
|
|
|
|
If the data is randomly ordered, we're very likely to end up with 66 runs
|
|
each of length 32. The first 64 of these trigger a sequence of perfectly
|
|
balanced merges (see next section), leaving runs of lengths 2048 and 64 to
|
|
merge at the end. The adaptive gimmicks can do that with fewer than 2048+64
|
|
compares, but it's still more compares than necessary, and-- mergesort's
|
|
bugaboo relative to samplesort --a lot more data movement (O(N) copies just
|
|
to get 64 elements into place).
|
|
|
|
If we take minrun=33 in this case, then we're very likely to end up with 64
|
|
runs each of length 33, and then all merges are perfectly balanced. Better!
|
|
|
|
What we want to avoid is picking minrun such that in
|
|
|
|
q, r = divmod(N, minrun)
|
|
|
|
q is a power of 2 and r>0 (then the last merge only gets r elements into
|
|
place, and r < minrun is small compared to N), or q a little larger than a
|
|
power of 2 regardless of r (then we've got a case similar to "2112", again
|
|
leaving too little work for the last merge to do).
|
|
|
|
Instead we pick a minrun in range(MAX_MINRUN / 2, MAX_MINRUN + 1) such that
|
|
N/minrun is exactly a power of 2, or if that isn't possible, is close to, but
|
|
strictly less than, a power of 2. This is easier to do than it may sound:
|
|
take the first log2(MAX_MINRUN) bits of N, and add 1 if any of the remaining
|
|
bits are set. In fact, that rule covers every case in this section, including
|
|
small N and exact powers of 2; merge_compute_minrun() is a deceptively simple
|
|
function.
|
|
|
|
|
|
The Merge Pattern
|
|
-----------------
|
|
In order to exploit regularities in the data, we're merging on natural
|
|
run lengths, and they can become wildly unbalanced. That's a Good Thing
|
|
for this sort! It means we have to find a way to manage an assortment of
|
|
potentially very different run lengths, though.
|
|
|
|
Stability constrains permissible merging patterns. For example, if we have
|
|
3 consecutive runs of lengths
|
|
|
|
A:10000 B:20000 C:10000
|
|
|
|
we dare not merge A with C first, because if A, B and C happen to contain
|
|
a common element, it would get out of order wrt its occurrence(s) in B. The
|
|
merging must be done as (A+B)+C or A+(B+C) instead.
|
|
|
|
So merging is always done on two consecutive runs at a time, and in-place,
|
|
although this may require some temp memory (more on that later).
|
|
|
|
When a run is identified, its length is passed to found_new_run() to
|
|
potentially merge runs on a stack of pending runs. We would like to delay
|
|
merging as long as possible in order to exploit patterns that may come up
|
|
later, but we like even more to do merging as soon as possible to exploit
|
|
that the run just found is still high in the memory hierarchy. We also can't
|
|
delay merging "too long" because it consumes memory to remember the runs that
|
|
are still unmerged, and the stack has a fixed size.
|
|
|
|
The original version of this code used the first thing I made up that didn't
|
|
obviously suck ;-) It was loosely based on invariants involving the Fibonacci
|
|
sequence.
|
|
|
|
It worked OK, but it was hard to reason about, and was subtle enough that the
|
|
intended invariants weren't actually preserved. Researchers discovered that
|
|
when trying to complete a computer-generated correctness proof. That was
|
|
easily-enough repaired, but the discovery spurred quite a bit of academic
|
|
interest in truly good ways to manage incremental merging on the fly.
|
|
|
|
At least a dozen different approaches were developed, some provably having
|
|
near-optimal worst case behavior with respect to the entropy of the
|
|
distribution of run lengths. Some details can be found in bpo-34561.
|
|
|
|
The code now uses the "powersort" merge strategy from:
|
|
|
|
"Nearly-Optimal Mergesorts: Fast, Practical Sorting Methods
|
|
That Optimally Adapt to Existing Runs"
|
|
J. Ian Munro and Sebastian Wild
|
|
|
|
The code is pretty simple, but the justification is quite involved, as it's
|
|
based on fast approximations to optimal binary search trees, which are
|
|
substantial topics on their own.
|
|
|
|
Here we'll just cover some pragmatic details:
|
|
|
|
The `powerloop()` function computes a run's "power". Say two adjacent runs
|
|
begin at index s1. The first run has length n1, and the second run (starting
|
|
at index s1+n1, called "s2" below) has length n2. The list has total length n.
|
|
The "power" of the first run is a small integer, the depth of the node
|
|
connecting the two runs in an ideal binary merge tree, where power 1 is the
|
|
root node, and the power increases by 1 for each level deeper in the tree.
|
|
|
|
The power is the least integer L such that the "midpoint interval" contains
|
|
a rational number of the form J/2**L. The midpoint interval is the semi-
|
|
closed interval:
|
|
|
|
((s1 + n1/2)/n, (s2 + n2/2)/n]
|
|
|
|
Yes, that's brain-busting at first ;-) Concretely, if (s1 + n1/2)/n and
|
|
(s2 + n2/2)/n are computed to infinite precision in binary, the power L is
|
|
the first position at which the 2**-L bit differs between the expansions.
|
|
Since the left end of the interval is less than the right end, the first
|
|
differing bit must be a 0 bit in the left quotient and a 1 bit in the right
|
|
quotient.
|
|
|
|
`powerloop()` emulates these divisions, 1 bit at a time, using comparisons,
|
|
subtractions, and shifts in a loop.
|
|
|
|
You'll notice the paper uses an O(1) method instead, but that relies on two
|
|
things we don't have:
|
|
|
|
- An O(1) "count leading zeroes" primitive. We can find such a thing as a C
|
|
extension on most platforms, but not all, and there's no uniform spelling
|
|
on the platforms that support it.
|
|
|
|
- Integer division on an integer type twice as wide as needed to hold the
|
|
list length. But the latter is Py_ssize_t for us, and is typically the
|
|
widest native signed integer type the platform supports.
|
|
|
|
But since runs in our algorithm are almost never very short, the once-per-run
|
|
overhead of `powerloop()` seems lost in the noise.
|
|
|
|
Detail: why is Py_ssize_t "wide enough" in `powerloop()`? We do, after all,
|
|
shift integers of that width left by 1. How do we know that won't spill into
|
|
the sign bit? The trick is that we have some slop. `n` (the total list
|
|
length) is the number of list elements, which is at most 4 times (on a 32-box,
|
|
with 4-byte pointers) smaller than than the largest size_t. So at least the
|
|
leading two bits of the integers we're using are clear.
|
|
|
|
Since we can't compute a run's power before seeing the run that follows it,
|
|
the most-recently identified run is never merged by `found_new_run()`.
|
|
Instead a new run is only used to compute the 2nd-most-recent run's power.
|
|
Then adjacent runs are merged so long as their saved power (tree depth) is
|
|
greater than that newly computed power. When found_new_run() returns, only
|
|
then is a new run pushed on to the stack of pending runs.
|
|
|
|
A key invariant is that powers on the run stack are strictly decreasing
|
|
(starting from the run at the top of the stack).
|
|
|
|
Note that even powersort's strategy isn't always truly optimal. It can't be.
|
|
Computing an optimal merge sequence can be done in time quadratic in the
|
|
number of runs, which is very much slower, and also requires finding &
|
|
remembering _all_ the runs' lengths (of which there may be billions) in
|
|
advance. It's remarkable, though, how close to optimal this strategy gets.
|
|
|
|
Curious factoid: of all the alternatives I've seen in the literature,
|
|
powersort's is the only one that's always truly optimal for a collection of 3
|
|
run lengths (for three lengths A B C, it's always optimal to first merge the
|
|
shorter of A and C with B).
|
|
|
|
|
|
Merge Memory
|
|
------------
|
|
Merging adjacent runs of lengths A and B in-place, and in linear time, is
|
|
difficult. Theoretical constructions are known that can do it, but they're
|
|
too difficult and slow for practical use. But if we have temp memory equal
|
|
to min(A, B), it's easy.
|
|
|
|
If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
|
|
and then we can do the obvious merge algorithm left to right, from the temp
|
|
area and B, starting the stores into where A used to live. There's always a
|
|
free area in the original area comprising a number of elements equal to the
|
|
number not yet merged from the temp array (trivially true at the start;
|
|
proceed by induction). The only tricky bit is that if a comparison raises an
|
|
exception, we have to remember to copy the remaining elements back in from
|
|
the temp area, lest the array end up with duplicate entries from B. But
|
|
that's exactly the same thing we need to do if we reach the end of B first,
|
|
so the exit code is pleasantly common to both the normal and error cases.
|
|
|
|
If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
|
|
much the same, except that we need to merge right to left, copying B into a
|
|
temp array and starting the stores at the right end of where B used to live.
|
|
|
|
A refinement: When we're about to merge adjacent runs A and B, we first do
|
|
a form of binary search (more on that later) to see where B[0] should end up
|
|
in A. Elements in A preceding that point are already in their final
|
|
positions, effectively shrinking the size of A. Likewise we also search to
|
|
see where A[-1] should end up in B, and elements of B after that point can
|
|
also be ignored. This cuts the amount of temp memory needed by the same
|
|
amount.
|
|
|
|
These preliminary searches may not pay off, and can be expected *not* to
|
|
repay their cost if the data is random. But they can win huge in all of
|
|
time, copying, and memory savings when they do pay, so this is one of the
|
|
"per-merge overheads" mentioned above that we're happy to endure because
|
|
there is at most one very short run. It's generally true in this algorithm
|
|
that we're willing to gamble a little to win a lot, even though the net
|
|
expectation is negative for random data.
|
|
|
|
|
|
Merge Algorithms
|
|
----------------
|
|
merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo
|
|
deals with runs where A <= B, and merge_hi where A > B. They don't know
|
|
whether the data is clustered or uniform, but a lovely thing about merging
|
|
is that many kinds of clustering "reveal themselves" by how many times in a
|
|
row the winning merge element comes from the same run. We'll only discuss
|
|
merge_lo here; merge_hi is exactly analogous.
|
|
|
|
Merging begins in the usual, obvious way, comparing the first element of A
|
|
to the first of B, and moving B[0] to the merge area if it's less than A[0],
|
|
else moving A[0] to the merge area. Call that the "one pair at a time"
|
|
mode. The only twist here is keeping track of how many times in a row "the
|
|
winner" comes from the same run.
|
|
|
|
If that count reaches MIN_GALLOP, we switch to "galloping mode". Here
|
|
we *search* B for where A[0] belongs, and move over all the B's before
|
|
that point in one chunk to the merge area, then move A[0] to the merge
|
|
area. Then we search A for where B[0] belongs, and similarly move a
|
|
slice of A in one chunk. Then back to searching B for where A[0] belongs,
|
|
etc. We stay in galloping mode until both searches find slices to copy
|
|
less than MIN_GALLOP elements long, at which point we go back to one-pair-
|
|
at-a-time mode.
|
|
|
|
A refinement: The MergeState struct contains the value of min_gallop that
|
|
controls when we enter galloping mode, initialized to MIN_GALLOP.
|
|
merge_lo() and merge_hi() adjust this higher when galloping isn't paying
|
|
off, and lower when it is.
|
|
|
|
|
|
Galloping
|
|
---------
|
|
Still without loss of generality, assume A is the shorter run. In galloping
|
|
mode, we first look for A[0] in B. We do this via "galloping", comparing
|
|
A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
|
|
the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most
|
|
roughly lg(B) comparisons, and, unlike a straight binary search, favors
|
|
finding the right spot early in B (more on that later).
|
|
|
|
After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
|
|
consecutive elements, and a straight binary search requires exactly k-1
|
|
additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we
|
|
copy all the B's up to that point in one chunk, and then copy A[0]. Note
|
|
that no matter where A[0] belongs in B, the combination of galloping + binary
|
|
search finds it in no more than about 2*lg(B) comparisons.
|
|
|
|
If we did a straight binary search, we could find it in no more than
|
|
ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
|
|
comparisons no matter where A[0] belongs. Straight binary search thus loses
|
|
to galloping unless the run is quite long, and we simply can't guess
|
|
whether it is in advance.
|
|
|
|
If data is random and runs have the same length, A[0] belongs at B[0] half
|
|
the time, at B[1] a quarter of the time, and so on: a consecutive winning
|
|
sub-run in B of length k occurs with probability 1/2**(k+1). So long
|
|
winning sub-runs are extremely unlikely in random data, and guessing that a
|
|
winning sub-run is going to be long is a dangerous game.
|
|
|
|
OTOH, if data is lopsided or lumpy or contains many duplicates, long
|
|
stretches of winning sub-runs are very likely, and cutting the number of
|
|
comparisons needed to find one from O(B) to O(log B) is a huge win.
|
|
|
|
Galloping compromises by getting out fast if there isn't a long winning
|
|
sub-run, yet finding such very efficiently when they exist.
|
|
|
|
I first learned about the galloping strategy in a related context; see:
|
|
|
|
"Adaptive Set Intersections, Unions, and Differences" (2000)
|
|
Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
|
|
|
|
and its followup(s). An earlier paper called the same strategy
|
|
"exponential search":
|
|
|
|
"Optimistic Sorting and Information Theoretic Complexity"
|
|
Peter McIlroy
|
|
SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
|
|
467-474, Austin, Texas, 25-27 January 1993.
|
|
|
|
and it probably dates back to an earlier paper by Bentley and Yao. The
|
|
McIlroy paper in particular has good analysis of a mergesort that's
|
|
probably strongly related to this one in its galloping strategy.
|
|
|
|
|
|
Galloping with a Broken Leg
|
|
---------------------------
|
|
So why don't we always gallop? Because it can lose, on two counts:
|
|
|
|
1. While we're willing to endure small per-merge overheads, per-comparison
|
|
overheads are a different story. Calling Yet Another Function per
|
|
comparison is expensive, and gallop_left() and gallop_right() are
|
|
too long-winded for sane inlining.
|
|
|
|
2. Galloping can-- alas --require more comparisons than linear one-at-time
|
|
search, depending on the data.
|
|
|
|
#2 requires details. If A[0] belongs before B[0], galloping requires 1
|
|
compare to determine that, same as linear search, except it costs more
|
|
to call the gallop function. If A[0] belongs right before B[1], galloping
|
|
requires 2 compares, again same as linear search. On the third compare,
|
|
galloping checks A[0] against B[3], and if it's <=, requires one more
|
|
compare to determine whether A[0] belongs at B[2] or B[3]. That's a total
|
|
of 4 compares, but if A[0] does belong at B[2], linear search would have
|
|
discovered that in only 3 compares, and that's a huge loss! Really. It's
|
|
an increase of 33% in the number of compares needed, and comparisons are
|
|
expensive in Python.
|
|
|
|
index in B where # compares linear # gallop # binary gallop
|
|
A[0] belongs search needs compares compares total
|
|
---------------- ----------------- -------- -------- ------
|
|
0 1 1 0 1
|
|
|
|
1 2 2 0 2
|
|
|
|
2 3 3 1 4
|
|
3 4 3 1 4
|
|
|
|
4 5 4 2 6
|
|
5 6 4 2 6
|
|
6 7 4 2 6
|
|
7 8 4 2 6
|
|
|
|
8 9 5 3 8
|
|
9 10 5 3 8
|
|
10 11 5 3 8
|
|
11 12 5 3 8
|
|
...
|
|
|
|
In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
|
|
to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
|
|
The advantage of galloping is unbounded as i grows, but it doesn't win at
|
|
all until i=6. Before then, it loses twice (at i=2 and i=4), and ties
|
|
at the other values. At and after i=6, galloping always wins.
|
|
|
|
We can't guess in advance when it's going to win, though, so we do one pair
|
|
at a time until the evidence seems strong that galloping may pay. MIN_GALLOP
|
|
is 7, and that's pretty strong evidence. However, if the data is random, it
|
|
simply will trigger galloping mode purely by luck every now and again, and
|
|
it's quite likely to hit one of the losing cases next. On the other hand,
|
|
in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
|
|
"should be" then. So the MergeState struct keeps a min_gallop variable
|
|
that merge_lo and merge_hi adjust: the longer we stay in galloping mode,
|
|
the smaller min_gallop gets, making it easier to transition back to
|
|
galloping mode (if we ever leave it in the current merge, and at the
|
|
start of the next merge). But whenever the gallop loop doesn't pay,
|
|
min_gallop is increased by one, making it harder to transition back
|
|
to galloping mode (and again both within a merge and across merges). For
|
|
random data, this all but eliminates the gallop penalty: min_gallop grows
|
|
large enough that we almost never get into galloping mode. And for cases
|
|
like ~sort, min_gallop can fall to as low as 1. This seems to work well,
|
|
but in all it's a minor improvement over using a fixed MIN_GALLOP value.
|
|
|
|
|
|
Galloping Complication
|
|
----------------------
|
|
The description above was for merge_lo. merge_hi has to merge "from the
|
|
other end", and really needs to gallop starting at the last element in a run
|
|
instead of the first. Galloping from the first still works, but does more
|
|
comparisons than it should (this is significant -- I timed it both ways). For
|
|
this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
|
|
functions have a "hint" argument, which is the index at which galloping
|
|
should begin. So galloping can actually start at any index, and proceed at
|
|
offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
|
|
|
|
In the code as I type it's always called with either 0 or n-1 (where n is
|
|
the # of elements in a run). It's tempting to try to do something fancier,
|
|
melding galloping with some form of interpolation search; for example, if
|
|
we're merging a run of length 1 with a run of length 10000, index 5000 is
|
|
probably a better guess at the final result than either 0 or 9999. But
|
|
it's unclear how to generalize that intuition usefully, and merging of
|
|
wildly unbalanced runs already enjoys excellent performance.
|
|
|
|
~sort is a good example of when balanced runs could benefit from a better
|
|
hint value: to the extent possible, this would like to use a starting
|
|
offset equal to the previous value of acount/bcount. Doing so saves about
|
|
10% of the compares in ~sort. However, doing so is also a mixed bag,
|
|
hurting other cases.
|
|
|
|
|
|
Comparing Average # of Compares on Random Arrays
|
|
------------------------------------------------
|
|
[NOTE: This was done when the new algorithm used about 0.1% more compares
|
|
on random data than does its current incarnation.]
|
|
|
|
Here list.sort() is samplesort, and list.msort() this sort:
|
|
|
|
"""
|
|
import random
|
|
from time import clock as now
|
|
|
|
def fill(n):
|
|
from random import random
|
|
return [random() for i in range(n)]
|
|
|
|
def mycmp(x, y):
|
|
global ncmp
|
|
ncmp += 1
|
|
return cmp(x, y)
|
|
|
|
def timeit(values, method):
|
|
global ncmp
|
|
X = values[:]
|
|
bound = getattr(X, method)
|
|
ncmp = 0
|
|
t1 = now()
|
|
bound(mycmp)
|
|
t2 = now()
|
|
return t2-t1, ncmp
|
|
|
|
format = "%5s %9.2f %11d"
|
|
f2 = "%5s %9.2f %11.2f"
|
|
|
|
def drive():
|
|
count = sst = sscmp = mst = mscmp = nelts = 0
|
|
while True:
|
|
n = random.randrange(100000)
|
|
nelts += n
|
|
x = fill(n)
|
|
|
|
t, c = timeit(x, 'sort')
|
|
sst += t
|
|
sscmp += c
|
|
|
|
t, c = timeit(x, 'msort')
|
|
mst += t
|
|
mscmp += c
|
|
|
|
count += 1
|
|
if count % 10:
|
|
continue
|
|
|
|
print "count", count, "nelts", nelts
|
|
print format % ("sort", sst, sscmp)
|
|
print format % ("msort", mst, mscmp)
|
|
print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
|
|
|
|
drive()
|
|
"""
|
|
|
|
I ran this on Windows and kept using the computer lightly while it was
|
|
running. time.clock() is wall-clock time on Windows, with better than
|
|
microsecond resolution. samplesort started with a 1.52% #-of-comparisons
|
|
disadvantage, fell quickly to 1.48%, and then fluctuated within that small
|
|
range. Here's the last chunk of output before I killed the job:
|
|
|
|
count 2630 nelts 130906543
|
|
sort 6110.80 1937887573
|
|
msort 6002.78 1909389381
|
|
1.80 1.49
|
|
|
|
We've done nearly 2 billion comparisons apiece at Python speed there, and
|
|
that's enough <wink>.
|
|
|
|
For random arrays of size 2 (yes, there are only 2 interesting ones),
|
|
samplesort has a 50%(!) comparison disadvantage. This is a consequence of
|
|
samplesort special-casing at most one ascending run at the start, then
|
|
falling back to the general case if it doesn't find an ascending run
|
|
immediately. The consequence is that it ends up using two compares to sort
|
|
[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be
|
|
taught how to deal with mixtures of ascending and descending runs
|
|
efficiently in all cases.
|
|
|
|
|
|
NOTES
|
|
-----
|
|
|
|
BINSORT
|
|
A "binary insertion sort" is just like a textbook insertion sort, but instead
|
|
of locating the correct position of the next item via linear (one at a time)
|
|
search, an equivalent to Python's bisect.bisect_right is used to find the
|
|
correct position in logarithmic time. Most texts don't mention this
|
|
variation, and those that do usually say it's not worth the bother: insertion
|
|
sort remains quadratic (expected and worst cases) either way. Speeding the
|
|
search doesn't reduce the quadratic data movement costs.
|
|
|
|
But in CPython's case, comparisons are extraordinarily expensive compared to
|
|
moving data, and the details matter. Moving objects is just copying
|
|
pointers. Comparisons can be arbitrarily expensive (can invoke arbitrary
|
|
user-supplied Python code), but even in simple cases (like 3 < 4) _all_
|
|
decisions are made at runtime: what's the type of the left comparand? the
|
|
type of the right? do they need to be coerced to a common type? where's the
|
|
code to compare these types? And so on. Even the simplest Python comparison
|
|
triggers a large pile of C-level pointer dereferences, conditionals, and
|
|
function calls.
|
|
|
|
So cutting the number of compares is almost always measurably helpful in
|
|
CPython, and the savings swamp the quadratic-time data movement costs for
|
|
reasonable minrun values.
|
|
|
|
|
|
LEFT OR RIGHT
|
|
gallop_left() and gallop_right() are akin to the Python bisect module's
|
|
bisect_left() and bisect_right(): they're the same unless the slice they're
|
|
searching contains a (at least one) value equal to the value being searched
|
|
for. In that case, gallop_left() returns the position immediately before the
|
|
leftmost equal value, and gallop_right() the position immediately after the
|
|
rightmost equal value. The distinction is needed to preserve stability. In
|
|
general, when merging adjacent runs A and B, gallop_left is used to search
|
|
thru B for where an element from A belongs, and gallop_right to search thru A
|
|
for where an element from B belongs.
|
|
|
|
|
|
REGION OF UNCERTAINTY
|
|
Two kinds of confusion seem to be common about the claim that after finding
|
|
a k such that
|
|
|
|
B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
|
|
|
|
then a binary search requires exactly k-1 tries to find A[0]'s proper
|
|
location. For concreteness, say k=3, so B[3] < A[0] <= B[7].
|
|
|
|
The first confusion takes the form "OK, then the region of uncertainty is at
|
|
indices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 =
|
|
3"; or the region is viewed as a Python slice and the objection is "but that's
|
|
the slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already
|
|
compared A[0] against B[3] and against B[7], so A[0]'s correct location is
|
|
already known wrt _both_ endpoints. What remains is to find A[0]'s correct
|
|
location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the
|
|
slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1
|
|
inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has
|
|
(2**k-1)-1 - 2**(k-1) + 1 =
|
|
2**k-1 - 2**(k-1) =
|
|
2*2**(k-1)-1 - 2**(k-1) =
|
|
(2-1)*2**(k-1) - 1 =
|
|
2**(k-1) - 1
|
|
elements.
|
|
|
|
The second confusion: "k-1 = 2 binary searches can find the correct location
|
|
among 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we
|
|
could make this more efficient by arranging for the region of uncertainty to
|
|
span 2**(k-1) elements." Resolution: that confuses "elements" with
|
|
"locations". In a slice with N elements, there are N+1 _locations_. In the
|
|
example, with the region of uncertainty B[4], B[5], B[6], there are 4
|
|
locations: before B[4], between B[4] and B[5], between B[5] and B[6], and
|
|
after B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1)
|
|
locations. That's why k-1 binary searches are necessary and sufficient.
|
|
|
|
OPTIMIZATION OF INDIVIDUAL COMPARISONS
|
|
As noted above, even the simplest Python comparison triggers a large pile of
|
|
C-level pointer dereferences, conditionals, and function calls. This can be
|
|
partially mitigated by pre-scanning the data to determine whether the data is
|
|
homogeneous with respect to type. If so, it is sometimes possible to
|
|
substitute faster type-specific comparisons for the slower, generic
|
|
PyObject_RichCompareBool.
|