How to watch Timsort move elements around?
Asked Answered
N

2

6

I'd like to sort a list and observe/visualize how Python's sorting algorithm Timsort is moving the elements around.

First attempt: A subclass of list which prints itself after each change:

class List(list):
    def __setitem__(self, index, value):
        list.__setitem__(self, index, value)
        print(self)

That works when I change elements myself, but during sort ... nothing:

>>> a = List([None] * 2)
>>> a[0] = 'S'
['S', None]
>>> a[1] = 'P'
['S', 'P']
>>> a.sort()
>>>

Second attempt: Print the list (in global variable a) at each comparison of elements:

class Str(str):
    def __lt__(self, other):
        print(a)
        return other > self

That does something, but the list is always... empty?

>>> a = list(map(Str, 'test'))
>>> a.sort()
[]
[]
[]
[]
[]
[]
>>>

Why do these attempts fail, and is there a way to observe what Timsort is doing?

Numismatics answered 5/8, 2020 at 22:31 Comment(0)
N
11

Yes, it can be observed. I did, here's a visualization:

Timsort visualization

And another one with more elements:

Timsort visualization

Why do your attempts fail?

  • The list subclass fails because sort works "inside" the list at C code level, not going through the public Python __setitem__ interface.
  • The comparison attempt fails because the list is indeed made empty during the sort, as the source code explains:
     /* The list is temporarily made empty, so that mutations performed
     * by comparison functions can't affect the slice of memory we're
     * sorting (allowing mutations during sorting is a core-dump
     * factory, since ob_item may change).
    

But then how can we observe the list during the sorting, if it's empty? Easy: We don't. Not during the sorting. But after partial sortings.

Timsort is:

  • Deterministic, so sorting a certain input always results in the same steps.
  • Nice enough to not destroy the list if an exception occurs. Instead it just "stops where it is" and leaves the list in a not fully sorted state. As the code says:
    /* [...]  Even in case of error, the
     * list will be some permutation of its input state (nothing is lost or
     * duplicated).
    

So the idea is:

  • Sort, and raise an exception at the first comparison so that Timsort stops there. Catch the exception and print the partially sorted list.
  • Sort again, from the same initial state, and raise an exception at the second comparison so that Timsort stops there. Catch the exception and print the partially sorted list.
  • Again, but with an exception at the third comparison.
  • And so on, allowing more and more comparisons, until the list is fully sorted.

Code doing that (except it doesn't show duplicate states, which happens when Timsort makes several comparisons before the next move):

import string
from random import shuffle
from itertools import count

class Str(str):
    def __lt__(self, other):
        global allowed_comparisons
        if allowed_comparisons == 0:
            raise Exception
        allowed_comparisons -= 1
        return other > self

a = list(string.ascii_letters + string.digits + '<>')
shuffle(a)

shown = None
for allowed_comparisons in count():
    try:
        b = list(map(Str, a))
        b.sort()
    except:
        pass
    state = ''.join(b)
    if state != shown:
        print(state)
        shown = state
    if list(state) == sorted(state):
        break

Output, discussed below:

k1z5Qi>jCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1kz5Qi>jCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15kzQi>jCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15Qkzi>jCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15Qikz>jCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>QikzjCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>QijkzCVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQijkzVsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQVijkzsfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQVijkszfRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQVfijkszRbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQRVfijkszbBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>CQRVbfijkszBWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCQRVbfijkszWSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCQRVWbfijkszSnJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCQRSVWbfijksznJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCQRSVWbfijknszJOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCJQRSVWbfijknszOD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCJOQRSVWbfijknszD6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
15>BCDJOQRSVWbfijknsz6AZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
156>BCDJOQRSVWbfijknszAZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
156>ABCDJOQRSVWbfijknszZ3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
156>ABCDJOQRSVWZbfijknsz3cGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDJOQRSVWZbfijknszcGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDJOQRSVWZbcfijknszGTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGJOQRSVWZbcfijknszTaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGJOQRSTVWZbcfijknszaIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGJOQRSTVWZabcfijknszIrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknszrvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrszvw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvzw0pePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0wpePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0pwePH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0epwPH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0PepwH94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0HPepw94UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz09HPepw4UgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049HPepwUgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049HPUepwgqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049HPUegpwqEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049HPUegpqwEK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049EHPUegpqwK2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz049EHKPUegpqw2FNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0249EHKPUegpqwFNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0249EFHKPUegpqwNYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0249EFHKNPUegpqwYL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0249EFHKNPUYegpqwL7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz0249EFHKLNPUYegpqw7Xdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUYegpqwXdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUXYegpqwdltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUXYdegpqwltoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUXYdeglpqwtoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUXYdeglpqtwoxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz02479EFHKLNPUXYdeglopqtwxy8uM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789EFHKLNPUXYdeglopqtwxyuM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789EFHKLNPUXYdeglopqtuwxyM<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789EFHKLMNPUXYdeglopqtuwxy<hm
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789<EFHKLMNPUXYdeglopqtuwxyhm
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789<EFHKLMNPUXYdeghlopqtuwxym
1356>ABCDGIJOQRSTVWZabcfijknrsvz024789<EFHKLMNPUXYdeghlmopqtuwxy
01356>ABCDGIJOQRSTVWZabcfijknrsvz24789<EFHKLMNPUXYdeghlmopqtuwxy
012356>ABCDGIJOQRSTVWZabcfijknrsvz4789<EFHKLMNPUXYdeghlmopqtuwxy
0123456>ABCDGIJOQRSTVWZabcfijknrsvz789<EFHKLMNPUXYdeghlmopqtuwxy
01234567>ABCDGIJOQRSTVWZabcfijknrsvz89<EFHKLMNPUXYdeghlmopqtuwxy
012345678>ABCDGIJOQRSTVWZabcfijknrsvz9<EFHKLMNPUXYdeghlmopqtuwxy
0123456789>ABCDGIJOQRSTVWZabcfijknrsvz<EFHKLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDGIJOQRSTVWZabcfijknrsvzEFHKLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEGIJOQRSTVWZabcfijknrsvzFHKLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGIJOQRSTVWZabcfijknrsvzHKLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJOQRSTVWZabcfijknrsvzKLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKOQRSTVWZabcfijknrsvzLMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLOQRSTVWZabcfijknrsvzMNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMOQRSTVWZabcfijknrsvzNPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOQRSTVWZabcfijknrsvzPUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTVWZabcfijknrsvzUXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWZabcfijknrsvzXYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXZabcfijknrsvzYdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcfijknrsvzdeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdfijknrsvzeghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefijknrsvzghlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefgijknrsvzhlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijknrsvzlmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklnrsvzmopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnrsvzopqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnorsvzpqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnoprsvzqtuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrsvztuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstvzuwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvzwxy
0123456789<>ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

In the above output, note there are three stages:

  1. The first half (32 elements) gets sorted until it's 1356>ABCDGIJOQRSTVWZabcfijknrsvz. Timsort does that with binary insertion sort. Each line in the output corresponds to one insertion.
  2. The second half (32 elements) gets sorted until it's 024789<EFHKLMNPUXYdeghlmopqtuwxy. Again with binary insertion sort.
  3. Timsort merges the two halves. These lines in the output don't quite show real states during the sort. Let's look at the first step:
    1356>ABCDGIJOQRSTVWZabcfijknrsvz024789<EFHKLMNPUXYdeghlmopqtuwxy
    01356>ABCDGIJOQRSTVWZabcfijknrsvz24789<EFHKLMNPUXYdeghlmopqtuwxy
    The way Timsort really merges the two halves is to move the first half out, to temporary space. And then merge the two halves into the given list from left to right. So really after the 0 from the second half gets moved to the front, the list looks like this:
    0--------------------------------24789<EFHKLMNPUXYdeghlmopqtuwxy
    With all the dashes being unoccupied space. Now when I raise my exception, Timsort doesn't just leave the list like that, but nicely at least moves the first halve's remaining elements back into that unoccupied space. And that's why in my output, it looks like Timsort is moving the 0 to the left and shifting the entire first half to the right by one index. That would be inefficient, and it's not what happens when Timsort runs normally, without my exceptions.

You can also see these three stages in my animated image at the top. And in this "screensaver" version, where I scroll down through the above output. I think it looks cool (was hoping it would look somewhat like Matrix code rain) but I find it less clear:

Timsort visualization

Note that here the third stage merges from right to left (with the right half moved out to temp space), which Timsort does when that's better.

Code for generating that image using Pillow, after storing the states in list states instead of printing them:

from PIL import Image, ImageDraw

images = []
for i in range(len(states) - 14):
    img = Image.new('RGB', (384, 225), (0, 0, 0))
    text = '\n'.join(states[i:i+15])
    d = ImageDraw.Draw(img)
    d.multiline_text((0,0), text, fill=(0, 200, 0))
    img = img.resize((384*2, 225*2), 0)
    images.append(img)
images[0].save('timatrix.png', save_all=True, append_images=images[1:],
               optimize=False, duration=100, loop=0)
Numismatics answered 5/8, 2020 at 22:31 Comment(0)
F
1

If your goal is to gain some insight into the operation of the algorithm, then knowing the specific value and place of the elements is less important than watching the run lengths on the stack. These run lengths are visible in the saw tooths that appear in Stefan Pochmann's animation.

These run lengths reveal the logarithmic limit on growth of the stack of pointers (indices) into the data array which is an important feature of the basic timsort algorithm. The following is an example of a clearer presentation of this, that just shows the stack indices:

  entry [0, 53, 106]
   exit [0, 106]
  entry [0, 106, 159, 212]
   exit [0, 106, 212]
  entry [0, 106, 212]
   exit [0, 212]
  entry [0, 212, 265, 318]
   exit [0, 212, 318]
  entry [0, 212, 318, 371, 424]
   exit [0, 212, 318, 424]
  entry [0, 212, 318, 424]
   exit [0, 212, 424]
  entry [0, 212, 424]
   exit [0, 424]
  entry [0, 424, 477, 530]
   exit [0, 424, 530]
  entry [0, 424, 530, 583, 636]
   exit [0, 424, 530, 636]
  entry [0, 424, 530, 636]
   exit [0, 424, 636]
  entry [0, 424, 636, 689, 742]
   exit [0, 424, 636, 742]
  entry [0, 424, 636, 742, 795, 836]
   exit [0, 424, 636, 742, 836]
  entry [0, 424, 636, 742, 836]
   exit [0, 424, 636, 836]
  entry [0, 424, 636, 836]
   exit [0, 424, 836]
  entry [0, 424, 836]
   exit [0, 836]
Functional trial 0 sorted 836 random data items.

Data between indices has been sorted in ascending order. The label entry shows the stack state at the merge routine entry point and exit shows the state on exit. When entry equals the previous exit, multiple runs are being merged due to repeated failure of the invariant prior to proceeding with creation of the next run. Data from the stack top index to the size of the array is data that timsort has not yet examined in its linear scan of the array. When the stack top index equals the array size, timsort has finished scanning, but may need to finalize with some merging. In this example, timsort generates two runs before the invariant fails for the first time and a merge commences, and finalizes by merging five runs.

While this type of output could be generated by a more complex postmortem using Stefan Pochmann's code, this data was generated using the implementation of timsort found here:

https://gist.github.com/ruminations/89a045dc0ef7edfb92304a0de0752ee0

by uncommenting the two ##print ... statements in the merge method (lines 263 and 303) and adding a break statement after line 391 in the test suite at the end. The code is useful for playing around with the algorithm and gaining some intuitive understanding of its operation.

Fulmination answered 4/12, 2020 at 20:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.