Objects/listsort.txt

   1 Intro
   2 -----
   3 This describes an adaptive, stable, natural mergesort, modestly called
   4 timsort (hey, I earned it <wink>).  It has supernatural performance on many
   5 kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
   6 as few as N-1), yet as fast as Python's previous highly tuned samplesort
   7 hybrid on random arrays.
   8
   9 In a nutshell, the main routine marches over the array once, left to right,
  10 alternately identifying the next run, then merging it into the previous
  11 runs "intelligently".  Everything else is complication for speed, and some
  12 hard-won measure of memory efficiency.
  13
  14
  15 Comparison with Python's Samplesort Hybrid
  16 ------------------------------------------
  17 + timsort can require a temp array containing as many as N//2 pointers,
  18   which means as many as 2*N extra bytes on 32-bit boxes.  It can be
  19   expected to require a temp array this large when sorting random data; on
  20   data with significant structure, it may get away without using any extra
  21   heap memory.  This appears to be the strongest argument against it, but
  22   compared to the size of an object, 2 temp bytes worst-case (also expected-
  23   case for random data) doesn't scare me much.
  24
  25   It turns out that Perl is moving to a stable mergesort, and the code for
  26   that appears always to require a temp array with room for at least N
  27   pointers. (Note that I wouldn't want to do that even if space weren't an
  28   issue; I believe its efforts at memory frugality also save timsort
  29   significant pointer-copying costs, and allow it to have a smaller working
  30   set.)
  31
  32 + Across about four hours of generating random arrays, and sorting them
  33   under both methods, samplesort required about 1.5% more comparisons
  34   (the program is at the end of this file).
  35
  36 + In real life, this may be faster or slower on random arrays than
  37   samplesort was, depending on platform quirks.  Since it does fewer
  38   comparisons on average, it can be expected to do better the more
  39   expensive a comparison function is.  OTOH, it does more data movement
  40   (pointer copying) than samplesort, and that may negate its small
  41   comparison advantage (depending on platform quirks) unless comparison
  42   is very expensive.
  43
  44 + On arrays with many kinds of pre-existing order, this blows samplesort out
  45   of the water.  It's significantly faster than samplesort even on some
  46   cases samplesort was special-casing the snot out of.  I believe that lists
  47   very often do have exploitable partial order in real life, and this is the
  48   strongest argument in favor of timsort (indeed, samplesort's special cases
  49   for extreme partial order are appreciated by real users, and timsort goes
  50   much deeper than those, in particular naturally covering every case where
  51   someone has suggested "and it would be cool if list.sort() had a special
  52   case for this too ... and for that ...").
  53
  54 + Here are exact comparison counts across all the tests in sortperf.py,
  55   when run with arguments "15 20 1".
  56
  57   Column Key:
  58       *sort: random data
  59       \sort: descending data
  60       /sort: ascending data
  61       3sort: ascending, then 3 random exchanges
  62       +sort: ascending, then 10 random at the end
  63       ~sort: many duplicates
  64       =sort: all equal
  65       !sort: worst case scenario
  66
  67   First the trivial cases, trivial for samplesort because it special-cased
  68   them, and trivial for timsort because it naturally works on runs.  Within
  69   an "n" block, the first line gives the # of compares done by samplesort,
  70   the second line by timsort, and the third line is the percentage by
  71   which the samplesort count exceeds the timsort count:
  72
  73       n   \sort   /sort   =sort
  74 -------  ------  ------  ------
  75   32768   32768   32767   32767  samplesort
  76           32767   32767   32767  timsort
  77           0.00%   0.00%   0.00%  (samplesort - timsort) / timsort
  78
  79   65536   65536   65535   65535
  80           65535   65535   65535
  81           0.00%   0.00%   0.00%
  82
  83  131072  131072  131071  131071
  84          131071  131071  131071
  85           0.00%   0.00%   0.00%
  86
  87  262144  262144  262143  262143
  88          262143  262143  262143
  89           0.00%   0.00%   0.00%
  90
  91  524288  524288  524287  524287
  92          524287  524287  524287
  93           0.00%   0.00%   0.00%
  94
  95 1048576 1048576 1048575 1048575
  96         1048575 1048575 1048575
  97           0.00%   0.00%   0.00%
  98
  99   The algorithms are effectively identical in these cases, except that
 100   timsort does one less compare in \sort.
 101
 102   Now for the more interesting cases.  lg(n!) is the information-theoretic
 103   limit for the best any comparison-based sorting algorithm can do on
 104   average (across all permutations).  When a method gets significantly
 105   below that, it's either astronomically lucky, or is finding exploitable
 106   structure in the data.
 107
 108       n   lg(n!)    *sort    3sort     +sort   %sort    ~sort     !sort
 109 -------  -------   ------   -------  -------  ------  -------  --------
 110   32768   444255   453096   453614    32908   452871   130491    469141 old
 111                    448885    33016    33007    50426   182083     65534 new
 112                     0.94% 1273.92%   -0.30%  798.09%  -28.33%   615.87% %ch from new
 113
 114   65536   954037   972699   981940    65686   973104   260029   1004607
 115                    962991    65821    65808   101667   364341    131070
 116                     1.01% 1391.83%   -0.19%  857.15%  -28.63%   666.47%
 117
 118  131072  2039137  2101881  2091491   131232  2092894   554790   2161379
 119                   2057533   131410   131361   206193   728871    262142
 120                     2.16% 1491.58%   -0.10%  915.02%  -23.88%   724.51%
 121
 122  262144  4340409  4464460  4403233   262314  4445884  1107842   4584560
 123                   4377402   262437   262459   416347  1457945    524286
 124                     1.99% 1577.82%   -0.06%  967.83%  -24.01%   774.44%
 125
 126  524288  9205096  9453356  9408463   524468  9441930  2218577   9692015
 127                   9278734   524580   524633   837947  2916107   1048574
 128                    1.88%  1693.52%   -0.03% 1026.79%  -23.92%   824.30%
 129
 130 1048576 19458756 19950272 19838588  1048766 19912134  4430649  20434212
 131                  19606028  1048958  1048941  1694896  5832445   2097150
 132                     1.76% 1791.27%   -0.02% 1074.83%  -24.03%   874.38%
 133
 134   Discussion of cases:
 135
 136   *sort:  There's no structure in random data to exploit, so the theoretical
 137   limit is lg(n!).  Both methods get close to that, and timsort is hugging
 138   it (indeed, in a *marginal* sense, it's a spectacular improvement --
 139   there's only about 1% left before hitting the wall, and timsort knows
 140   darned well it's doing compares that won't pay on random data -- but so
 141   does the samplesort hybrid).  For contrast, Hoare's original random-pivot
 142   quicksort does about 39% more compares than the limit, and the median-of-3
 143   variant about 19% more.
 144
 145   3sort, %sort, and !sort:  No contest; there's structure in this data, but
 146   not of the specific kinds samplesort special-cases.  Note that structure
 147   in !sort wasn't put there on purpose -- it was crafted as a worst case for
 148   a previous quicksort implementation.  That timsort nails it came as a
 149   surprise to me (although it's obvious in retrospect).
 150
 151   +sort:  samplesort special-cases this data, and does a few less compares
 152   than timsort.  However, timsort runs this case significantly faster on all
 153   boxes we have timings for, because timsort is in the business of merging
 154   runs efficiently, while samplesort does much more data movement in this
 155   (for it) special case.
 156
 157   ~sort:  samplesort's special cases for large masses of equal elements are
 158   extremely effective on ~sort's specific data pattern, and timsort just
 159   isn't going to get close to that, despite that it's clearly getting a
 160   great deal of benefit out of the duplicates (the # of compares is much less
 161   than lg(n!)).  ~sort has a perfectly uniform distribution of just 4
 162   distinct values, and as the distribution gets more skewed, samplesort's
 163   equal-element gimmicks become less effective, while timsort's adaptive
 164   strategies find more to exploit; in a database supplied by Kevin Altis, a
 165   sort on its highly skewed "on which stock exchange does this company's
 166   stock trade?" field ran over twice as fast under timsort.
 167
 168   However, despite that timsort does many more comparisons on ~sort, and
 169   that on several platforms ~sort runs highly significantly slower under
 170   timsort, on other platforms ~sort runs highly significantly faster under
 171   timsort.  No other kind of data has shown this wild x-platform behavior,
 172   and we don't have an explanation for it.  The only thing I can think of
 173   that could transform what "should be" highly significant slowdowns into
 174   highly significant speedups on some boxes are catastrophic cache effects
 175   in samplesort.
 176
 177   But timsort "should be" slower than samplesort on ~sort, so it's hard
 178   to count that it isn't on some boxes as a strike against it <wink>.
 179
 180 + Here's the highwater mark for the number of heap-based temp slots (4
 181   bytes each on this box) needed by each test, again with arguments
 182   "15 20 1":
 183
 184    2**i  *sort \sort /sort  3sort  +sort  %sort  ~sort  =sort  !sort
 185   32768  16384     0     0   6256      0  10821  12288      0  16383
 186   65536  32766     0     0  21652      0  31276  24576      0  32767
 187  131072  65534     0     0  17258      0  58112  49152      0  65535
 188  262144 131072     0     0  35660      0 123561  98304      0 131071
 189  524288 262142     0     0  31302      0 212057 196608      0 262143
 190 1048576 524286     0     0 312438      0 484942 393216      0 524287
 191
 192   Discussion:  The tests that end up doing (close to) perfectly balanced
 193   merges (*sort, !sort) need all N//2 temp slots (or almost all).  ~sort
 194   also ends up doing balanced merges, but systematically benefits a lot from
 195   the preliminary pre-merge searches described under "Merge Memory" later.
 196   %sort approaches having a balanced merge at the end because the random
 197   selection of elements to replace is expected to produce an out-of-order
 198   element near the midpoint.  \sort, /sort, =sort are the trivial one-run
 199   cases, needing no merging at all.  +sort ends up having one very long run
 200   and one very short, and so gets all the temp space it needs from the small
 201   temparray member of the MergeState struct (note that the same would be
 202   true if the new random elements were prefixed to the sorted list instead,
 203   but not if they appeared "in the middle").  3sort approaches N//3 temp
 204   slots twice, but the run lengths that remain after 3 random exchanges
 205   clearly has very high variance.
 206
 207
 208 A detailed description of timsort follows.
 209
 210 Runs
 211 ----
 212 count_run() returns the # of elements in the next run.  A run is either
 213 "ascending", which means non-decreasing:
 214
 215     a0 <= a1 <= a2 <= ...
 216
 217 or "descending", which means strictly decreasing:
 218
 219     a0 > a1 > a2 > ...
 220
 221 Note that a run is always at least 2 long, unless we start at the array's
 222 last element.
 223
 224 The definition of descending is strict, because the main routine reverses
 225 a descending run in-place, transforming a descending run into an ascending
 226 run.  Reversal is done via the obvious fast "swap elements starting at each
 227 end, and converge at the middle" method, and that can violate stability if
 228 the slice contains any equal elements.  Using a strict definition of
 229 descending ensures that a descending run contains distinct elements.
 230
 231 If an array is random, it's very unlikely we'll see long runs.  If a natural
 232 run contains less than minrun elements (see next section), the main loop
 233 artificially boosts it to minrun elements, via a stable binary insertion sort
 234 applied to the right number of array elements following the short natural
 235 run.  In a random array, *all* runs are likely to be minrun long as a
 236 result.  This has two primary good effects:
 237
 238 1. Random data strongly tends then toward perfectly balanced (both runs have
 239    the same length) merges, which is the most efficient way to proceed when
 240    data is random.
 241
 242 2. Because runs are never very short, the rest of the code doesn't make
 243    heroic efforts to shave a few cycles off per-merge overheads.  For
 244    example, reasonable use of function calls is made, rather than trying to
 245    inline everything.  Since there are no more than N/minrun runs to begin
 246    with, a few "extra" function calls per merge is barely measurable.
 247
 248
 249 Computing minrun
 250 ----------------
 251 If N < 64, minrun is N.  IOW, binary insertion sort is used for the whole
 252 array then; it's hard to beat that given the overheads of trying something
 253 fancier.
 254
 255 When N is a power of 2, testing on random data showed that minrun values of
 256 16, 32, 64 and 128 worked about equally well.  At 256 the data-movement cost
 257 in binary insertion sort clearly hurt, and at 8 the increase in the number
 258 of function calls clearly hurt.  Picking *some* power of 2 is important
 259 here, so that the merges end up perfectly balanced (see next section).  We
 260 pick 32 as a good value in the sweet range; picking a value at the low end
 261 allows the adaptive gimmicks more opportunity to exploit shorter natural
 262 runs.
 263
 264 Because sortperf.py only tries powers of 2, it took a long time to notice
 265 that 32 isn't a good choice for the general case!  Consider N=2112:
 266
 267 >>> divmod(2112, 32)
 268 (66, 0)
 269 >>>
 270
 271 If the data is randomly ordered, we're very likely to end up with 66 runs
 272 each of length 32.  The first 64 of these trigger a sequence of perfectly
 273 balanced merges (see next section), leaving runs of lengths 2048 and 64 to
 274 merge at the end.  The adaptive gimmicks can do that with fewer than 2048+64
 275 compares, but it's still more compares than necessary, and-- mergesort's
 276 bugaboo relative to samplesort --a lot more data movement (O(N) copies just
 277 to get 64 elements into place).
 278
 279 If we take minrun=33 in this case, then we're very likely to end up with 64
 280 runs each of length 33, and then all merges are perfectly balanced.  Better!
 281
 282 What we want to avoid is picking minrun such that in
 283
 284     q, r = divmod(N, minrun)
 285
 286 q is a power of 2 and r>0 (then the last merge only gets r elements into
 287 place, and r < minrun is small compared to N), or q a little larger than a
 288 power of 2 regardless of r (then we've got a case similar to "2112", again
 289 leaving too little work for the last merge to do).
 290
 291 Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
 292 power of 2, or if that isn't possible, is close to, but strictly less than,
 293 a power of 2.  This is easier to do than it may sound:  take the first 6
 294 bits of N, and add 1 if any of the remaining bits are set.  In fact, that
 295 rule covers every case in this section, including small N and exact powers
 296 of 2; merge_compute_minrun() is a deceptively simple function.
 297
 298
 299 The Merge Pattern
 300 -----------------
 301 In order to exploit regularities in the data, we're merging on natural
 302 run lengths, and they can become wildly unbalanced.  That's a Good Thing
 303 for this sort!  It means we have to find a way to manage an assortment of
 304 potentially very different run lengths, though.
 305
 306 Stability constrains permissible merging patterns.  For example, if we have
 307 3 consecutive runs of lengths
 308
 309     A:10000  B:20000  C:10000
 310
 311 we dare not merge A with C first, because if A, B and C happen to contain
 312 a common element, it would get out of order wrt its occurrence(s) in B.  The
 313 merging must be done as (A+B)+C or A+(B+C) instead.
 314
 315 So merging is always done on two consecutive runs at a time, and in-place,
 316 although this may require some temp memory (more on that later).
 317
 318 When a run is identified, its base address and length are pushed on a stack
 319 in the MergeState struct.  merge_collapse() is then called to see whether it
 320 should merge it with preceding run(s).  We would like to delay merging as
 321 long as possible in order to exploit patterns that may come up later, but we
 322 like even more to do merging as soon as possible to exploit that the run just
 323 found is still high in the memory hierarchy.  We also can't delay merging
 324 "too long" because it consumes memory to remember the runs that are still
 325 unmerged, and the stack has a fixed size.
 326
 327 What turned out to be a good compromise maintains two invariants on the
 328 stack entries, where A, B and C are the lengths of the three righmost not-yet
 329 merged slices:
 330
 331 1.  A > B+C
 332 2.  B > C
 333
 334 Note that, by induction, #2 implies the lengths of pending runs form a
 335 decreasing sequence.  #1 implies that, reading the lengths right to left,
 336 the pending-run lengths grow at least as fast as the Fibonacci numbers.
 337 Therefore the stack can never grow larger than about log_base_phi(N) entries,
 338 where phi = (1+sqrt(5))/2 ~= 1.618.  Thus a small # of stack slots suffice
 339 for very large arrays.
 340
 341 If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
 342 freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
 343 e.g., if the last 3 entries are
 344
 345     A:30  B:20  C:10
 346
 347 then B is merged with C, leaving
 348
 349     A:30  BC:30
 350
 351 on the stack.  Or if they were
 352
 353     A:500  B:400:  C:1000
 354
 355 then A is merged with B, leaving
 356
 357     AB:900  C:1000
 358
 359 on the stack.
 360
 361 In both examples, the stack configuration after the merge still violates
 362 invariant #2, and merge_collapse() goes on to continue merging runs until
 363 both invariants are satisfied.  As an extreme case, suppose we didn't do the
 364 minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
 365 and 2.  Nothing would get merged until the final 2 was seen, and that would
 366 trigger 7 perfectly balanced merges.
 367
 368 The thrust of these rules when they trigger merging is to balance the run
 369 lengths as closely as possible, while keeping a low bound on the number of
 370 runs we have to remember.  This is maximally effective for random data,
 371 where all runs are likely to be of (artificially forced) length minrun, and
 372 then we get a sequence of perfectly balanced merges (with, perhaps, some
 373 oddballs at the end).
 374
 375 OTOH, one reason this sort is so good for partly ordered data has to do
 376 with wildly unbalanced run lengths.
 377
 378
 379 Merge Memory
 380 ------------
 381 Merging adjacent runs of lengths A and B in-place is very difficult.
 382 Theoretical constructions are known that can do it, but they're too difficult
 383 and slow for practical use.  But if we have temp memory equal to min(A, B),
 384 it's easy.
 385
 386 If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
 387 and then we can do the obvious merge algorithm left to right, from the temp
 388 area and B, starting the stores into where A used to live.  There's always a
 389 free area in the original area comprising a number of elements equal to the
 390 number not yet merged from the temp array (trivially true at the start;
 391 proceed by induction).  The only tricky bit is that if a comparison raises an
 392 exception, we have to remember to copy the remaining elements back in from
 393 the temp area, lest the array end up with duplicate entries from B.  But
 394 that's exactly the same thing we need to do if we reach the end of B first,
 395 so the exit code is pleasantly common to both the normal and error cases.
 396
 397 If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
 398 much the same, except that we need to merge right to left, copying B into a
 399 temp array and starting the stores at the right end of where B used to live.
 400
 401 A refinement:  When we're about to merge adjacent runs A and B, we first do
 402 a form of binary search (more on that later) to see where B[0] should end up
 403 in A.  Elements in A preceding that point are already in their final
 404 positions, effectively shrinking the size of A.  Likewise we also search to
 405 see where A[-1] should end up in B, and elements of B after that point can
 406 also be ignored.  This cuts the amount of temp memory needed by the same
 407 amount.
 408
 409 These preliminary searches may not pay off, and can be expected *not* to
 410 repay their cost if the data is random.  But they can win huge in all of
 411 time, copying, and memory savings when they do pay, so this is one of the
 412 "per-merge overheads" mentioned above that we're happy to endure because
 413 there is at most one very short run.  It's generally true in this algorithm
 414 that we're willing to gamble a little to win a lot, even though the net
 415 expectation is negative for random data.
 416
 417
 418 Merge Algorithms
 419 ----------------
 420 merge_lo() and merge_hi() are where the bulk of the time is spent.  merge_lo
 421 deals with runs where A <= B, and merge_hi where A > B.  They don't know
 422 whether the data is clustered or uniform, but a lovely thing about merging
 423 is that many kinds of clustering "reveal themselves" by how many times in a
 424 row the winning merge element comes from the same run.  We'll only discuss
 425 merge_lo here; merge_hi is exactly analogous.
 426
 427 Merging begins in the usual, obvious way, comparing the first element of A
 428 to the first of B, and moving B[0] to the merge area if it's less than A[0],
 429 else moving A[0] to the merge area.  Call that the "one pair at a time"
 430 mode.  The only twist here is keeping track of how many times in a row "the
 431 winner" comes from the same run.
 432
 433 If that count reaches MIN_GALLOP, we switch to "galloping mode".  Here
 434 we *search* B for where A[0] belongs, and move over all the B's before
 435 that point in one chunk to the merge area, then move A[0] to the merge
 436 area.  Then we search A for where B[0] belongs, and similarly move a
 437 slice of A in one chunk.  Then back to searching B for where A[0] belongs,
 438 etc.  We stay in galloping mode until both searches find slices to copy
 439 less than MIN_GALLOP elements long, at which point we go back to one-pair-
 440 at-a-time mode.
 441
 442 A refinement:  The MergeState struct contains the value of min_gallop that
 443 controls when we enter galloping mode, initialized to MIN_GALLOP.
 444 merge_lo() and merge_hi() adjust this higher when galloping isn't paying
 445 off, and lower when it is.
 446
 447
 448 Galloping
 449 ---------
 450 Still without loss of generality, assume A is the shorter run.  In galloping
 451 mode, we first look for A[0] in B.  We do this via "galloping", comparing
 452 A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
 453 the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1].  This takes at most
 454 roughly lg(B) comparisons, and, unlike a straight binary search, favors
 455 finding the right spot early in B (more on that later).
 456
 457 After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
 458 consecutive elements, and a straight binary search requires exactly k-1
 459 additional comparisons to nail it.  Then we copy all the B's up to that
 460 point in one chunk, and then copy A[0].  Note that no matter where A[0]
 461 belongs in B, the combination of galloping + binary search finds it in no
 462 more than about 2*lg(B) comparisons.
 463
 464 If we did a straight binary search, we could find it in no more than
 465 ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
 466 comparisons no matter where A[0] belongs.  Straight binary search thus loses
 467 to galloping unless the run is quite long, and we simply can't guess
 468 whether it is in advance.
 469
 470 If data is random and runs have the same length, A[0] belongs at B[0] half
 471 the time, at B[1] a quarter of the time, and so on:  a consecutive winning
 472 sub-run in B of length k occurs with probability 1/2**(k+1).  So long
 473 winning sub-runs are extremely unlikely in random data, and guessing that a
 474 winning sub-run is going to be long is a dangerous game.
 475
 476 OTOH, if data is lopsided or lumpy or contains many duplicates, long
 477 stretches of winning sub-runs are very likely, and cutting the number of
 478 comparisons needed to find one from O(B) to O(log B) is a huge win.
 479
 480 Galloping compromises by getting out fast if there isn't a long winning
 481 sub-run, yet finding such very efficiently when they exist.
 482
 483 I first learned about the galloping strategy in a related context; see:
 484
 485     "Adaptive Set Intersections, Unions, and Differences" (2000)
 486     Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
 487
 488 and its followup(s).  An earlier paper called the same strategy
 489 "exponential search":
 490
 491    "Optimistic Sorting and Information Theoretic Complexity"
 492    Peter McIlroy
 493    SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
 494    467-474, Austin, Texas, 25-27 January 1993.
 495
 496 and it probably dates back to an earlier paper by Bentley and Yao.  The
 497 McIlroy paper in particular has good analysis of a mergesort that's
 498 probably strongly related to this one in its galloping strategy.
 499
 500
 501 Galloping with a Broken Leg
 502 ---------------------------
 503 So why don't we always gallop?  Because it can lose, on two counts:
 504
 505 1. While we're willing to endure small per-merge overheads, per-comparison
 506    overheads are a different story.  Calling Yet Another Function per
 507    comparison is expensive, and gallop_left() and gallop_right() are
 508    too long-winded for sane inlining.
 509
 510 2. Galloping can-- alas --require more comparisons than linear one-at-time
 511    search, depending on the data.
 512
 513 #2 requires details.  If A[0] belongs before B[0], galloping requires 1
 514 compare to determine that, same as linear search, except it costs more
 515 to call the gallop function.  If A[0] belongs right before B[1], galloping
 516 requires 2 compares, again same as linear search.  On the third compare,
 517 galloping checks A[0] against B[3], and if it's <=, requires one more
 518 compare to determine whether A[0] belongs at B[2] or B[3].  That's a total
 519 of 4 compares, but if A[0] does belong at B[2], linear search would have
 520 discovered that in only 3 compares, and that's a huge loss!  Really.  It's
 521 an increase of 33% in the number of compares needed, and comparisons are
 522 expensive in Python.
 523
 524 index in B where    # compares linear  # gallop  # binary  gallop
 525 A[0] belongs        search needs       compares  compares  total
 526 ----------------    -----------------  --------  --------  ------
 527                0                    1         1         0       1
 528
 529                1                    2         2         0       2
 530
 531                2                    3         3         1       4
 532                3                    4         3         1       4
 533
 534                4                    5         4         2       6
 535                5                    6         4         2       6
 536                6                    7         4         2       6
 537                7                    8         4         2       6
 538
 539                8                    9         5         3       8
 540                9                   10         5         3       8
 541               10                   11         5         3       8
 542               11                   12         5         3       8
 543                                         ...
 544
 545 In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
 546 to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
 547 The advantage of galloping is unbounded as i grows, but it doesn't win at
 548 all until i=6.  Before then, it loses twice (at i=2 and i=4), and ties
 549 at the other values.  At and after i=6, galloping always wins.
 550
 551 We can't guess in advance when it's going to win, though, so we do one pair
 552 at a time until the evidence seems strong that galloping may pay.  MIN_GALLOP
 553 is 7, and that's pretty strong evidence.  However, if the data is random, it
 554 simply will trigger galloping mode purely by luck every now and again, and
 555 it's quite likely to hit one of the losing cases next.  On the other hand,
 556 in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
 557 "should be" then.  So the MergeState struct keeps a min_gallop variable
 558 that merge_lo and merge_hi adjust:  the longer we stay in galloping mode,
 559 the smaller min_gallop gets, making it easier to transition back to
 560 galloping mode (if we ever leave it in the current merge, and at the
 561 start of the next merge).  But whenever the gallop loop doesn't pay,
 562 min_gallop is increased by one, making it harder to transition back
 563 to galloping mode (and again both within a merge and across merges).  For
 564 random data, this all but eliminates the gallop penalty:  min_gallop grows
 565 large enough that we almost never get into galloping mode.  And for cases
 566 like ~sort, min_gallop can fall to as low as 1.  This seems to work well,
 567 but in all it's a minor improvement over using a fixed MIN_GALLOP value.
 568
 569
 570 Galloping Complication
 571 ----------------------
 572 The description above was for merge_lo.  merge_hi has to merge "from the
 573 other end", and really needs to gallop starting at the last element in a run
 574 instead of the first.  Galloping from the first still works, but does more
 575 comparisons than it should (this is significant -- I timed it both ways).
 576 For this reason, the gallop_left() and gallop_right() functions have a
 577 "hint" argument, which is the index at which galloping should begin.  So
 578 galloping can actually start at any index, and proceed at offsets of 1, 3,
 579 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
 580
 581 In the code as I type it's always called with either 0 or n-1 (where n is
 582 the # of elements in a run).  It's tempting to try to do something fancier,
 583 melding galloping with some form of interpolation search; for example, if
 584 we're merging a run of length 1 with a run of length 10000, index 5000 is
 585 probably a better guess at the final result than either 0 or 9999.  But
 586 it's unclear how to generalize that intuition usefully, and merging of
 587 wildly unbalanced runs already enjoys excellent performance.
 588
 589 ~sort is a good example of when balanced runs could benefit from a better
 590 hint value:  to the extent possible, this would like to use a starting
 591 offset equal to the previous value of acount/bcount.  Doing so saves about
 592 10% of the compares in ~sort.  However, doing so is also a mixed bag,
 593 hurting other cases.
 594
 595
 596 Comparing Average # of Compares on Random Arrays
 597 ------------------------------------------------
 598 [NOTE:  This was done when the new algorithm used about 0.1% more compares
 599  on random data than does its current incarnation.]
 600
 601 Here list.sort() is samplesort, and list.msort() this sort:
 602
 603 """
 604 import random
 605 from time import clock as now
 606
 607 def fill(n):
 608     from random import random
 609     return [random() for i in xrange(n)]
 610
 611 def mycmp(x, y):
 612     global ncmp
 613     ncmp += 1
 614     return cmp(x, y)
 615
 616 def timeit(values, method):
 617     global ncmp
 618     X = values[:]
 619     bound = getattr(X, method)
 620     ncmp = 0
 621     t1 = now()
 622     bound(mycmp)
 623     t2 = now()
 624     return t2-t1, ncmp
 625
 626 format = "%5s  %9.2f  %11d"
 627 f2     = "%5s  %9.2f  %11.2f"
 628
 629 def drive():
 630     count = sst = sscmp = mst = mscmp = nelts = 0
 631     while True:
 632         n = random.randrange(100000)
 633         nelts += n
 634         x = fill(n)
 635
 636         t, c = timeit(x, 'sort')
 637         sst += t
 638         sscmp += c
 639
 640         t, c = timeit(x, 'msort')
 641         mst += t
 642         mscmp += c
 643
 644         count += 1
 645         if count % 10:
 646             continue
 647
 648         print "count", count, "nelts", nelts
 649         print format % ("sort",  sst, sscmp)
 650         print format % ("msort", mst, mscmp)
 651         print f2     % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
 652
 653 drive()
 654 """
 655
 656 I ran this on Windows and kept using the computer lightly while it was
 657 running.  time.clock() is wall-clock time on Windows, with better than
 658 microsecond resolution.  samplesort started with a 1.52% #-of-comparisons
 659 disadvantage, fell quickly to 1.48%, and then fluctuated within that small
 660 range.  Here's the last chunk of output before I killed the job:
 661
 662 count 2630 nelts 130906543
 663  sort    6110.80   1937887573
 664 msort    6002.78   1909389381
 665             1.80         1.49
 666
 667 We've done nearly 2 billion comparisons apiece at Python speed there, and
 668 that's enough <wink>.
 669
 670 For random arrays of size 2 (yes, there are only 2 interesting ones),
 671 samplesort has a 50%(!) comparison disadvantage.  This is a consequence of
 672 samplesort special-casing at most one ascending run at the start, then
 673 falling back to the general case if it doesn't find an ascending run
 674 immediately.  The consequence is that it ends up using two compares to sort
 675 [2, 1].  Gratifyingly, timsort doesn't do any special-casing, so had to be
 676 taught how to deal with mixtures of ascending and descending runs
 677 efficiently in all cases.