Python hangs indefinitely trying to delete deeply recursive object
Asked Answered
S

2

13

I wrote a ternary search tree in Python and I've noticed that when the tree gets very deep, attempting to delete it causes Python to hang indefinitely. Here is a stripped version of the code that produces this behaviour:

import random
import sys
from collections import deque


class Node():
    __slots__ = ("char", "count", "lo", "eq", "hi")

    def __init__(self, char):
        self.char = char
        self.count = 0

        self.lo = None
        self.eq = None
        self.hi = None


class TernarySearchTree():
    """Ternary search tree that stores counts for n-grams
    and their subsequences.
    """

    def __init__(self, splitchar=None):
        self.root = None
        self.splitchar = splitchar

    def insert(self, string):
        self.root = self._insert(string, self.root)

    def _insert(self, string, node):
        """Insert string at a given node.
        """
        if not string:
            return node

        char, *rest = string

        if node is None:
            node = Node(char)

        if char == node.char:
            if not rest:
                node.count += 1
                return node
            else:
                if rest[0] == self.splitchar:
                    node.count += 1
                node.eq = self._insert(rest, node.eq)

        elif char < node.char:
            node.lo = self._insert(string, node.lo)

        else:
            node.hi = self._insert(string, node.hi)

        return node


def random_strings(num_strings):
    random.seed(2)
    symbols = "abcdefghijklmnopqrstuvwxyz"

    for i in range(num_strings):
        length = random.randint(5, 15)
        yield "".join(random.choices(symbols, k=length))


def train():
    tree = TernarySearchTree("#")
    grams = deque(maxlen=4)

    for token in random_strings(27_000_000):
        grams.append(token)
        tree.insert("#".join(grams))

    sys.stdout.write("This gets printed!\n")
    sys.stdout.flush()


def main():
    train()

    sys.stdout.write("This doesn't get printed\n")
    sys.stdout.flush()


if __name__ == "__main__":
    main()

This prints out "This gets printed", but not "This doesn't get printed". Trying to delete the object manually has the same effect. If I reduce the number of strings inserted from 27 million to 25 million, everything is fine - Python prints out both statements and then exits immediately. I tried to run GDB and this is the backtrace that I get:

#0  pymalloc_free.isra.0 (p=0x2ab537a4d580) at /tmp/build/80754af9/python_1546061345851/work/Objects/obmalloc.c:1797
#1  _PyObject_Free (ctx=<optimized out>, p=0x2ab537a4d580)
    at /tmp/build/80754af9/python_1546061345851/work/Objects/obmalloc.c:1834
#2  0x0000555555701c18 in subtype_dealloc ()
    at /tmp/build/80754af9/python_1546061345851/work/Objects/typeobject.c:1256
#3  0x0000555555738ce6 in _PyTrash_thread_destroy_chain ()
    at /tmp/build/80754af9/python_1546061345851/work/Objects/object.c:2212
#4  0x00005555556cd24f in frame_dealloc (f=<optimized out>)
    at /tmp/build/80754af9/python_1546061345851/work/Objects/frameobject.c:492
#5  function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1546061345851/work/Objects/call.c:291
#6  _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1546061345851/work/Objects/call.c:408
#7  0x00005555557241a6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1546061345851/work/Python/ceval.c:4616
#8  _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546061345851/work/Python/ceval.c:3124
#9  0x00005555556ccecb in function_code_fastcall (globals=<optimized out>, nargs=0, args=<optimized out>, 
    co=<optimized out>) at /tmp/build/80754af9/python_1546061345851/work/Objects/call.c:283
#10 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1546061345851/work/Objects/call.c:408
#11 0x00005555557241a6 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1546061345851/work/Python/ceval.c:4616
#12 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1546061345851/work/Python/ceval.c:3124
#13 0x00005555556690d9 in _PyEval_EvalCodeWithName ()
    at /tmp/build/80754af9/python_1546061345851/work/Python/ceval.c:3930
#14 0x0000555555669fa4 in PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1546061345851/work/Python/ceval.c:3959
#15 0x0000555555669fcc in PyEval_EvalCode (co=co@entry=0x2aaaaac08300, globals=globals@entry=0x2aaaaaba8168, 
    locals=locals@entry=0x2aaaaaba8168) at /tmp/build/80754af9/python_1546061345851/work/Python/ceval.c:524
#16 0x0000555555783664 in run_mod () at /tmp/build/80754af9/python_1546061345851/work/Python/pythonrun.c:1035
#17 0x000055555578d881 in PyRun_FileExFlags ()
    at /tmp/build/80754af9/python_1546061345851/work/Python/pythonrun.c:988
#18 0x000055555578da73 in PyRun_SimpleFileExFlags ()
    at /tmp/build/80754af9/python_1546061345851/work/Python/pythonrun.c:429
#19 0x000055555578db3d in PyRun_AnyFileExFlags ()
    at /tmp/build/80754af9/python_1546061345851/work/Python/pythonrun.c:84
#20 0x000055555578eb2f in pymain_run_file (p_cf=0x7fffffffd240, filename=0x5555558c5440 L"minimal.py", 
    fp=0x5555559059a0) at /tmp/build/80754af9/python_1546061345851/work/Modules/main.c:427
#21 pymain_run_filename (cf=0x7fffffffd240, pymain=0x7fffffffd350)
    at /tmp/build/80754af9/python_1546061345851/work/Modules/main.c:1627
#22 pymain_run_python (pymain=0x7fffffffd350) at /tmp/build/80754af9/python_1546061345851/work/Modules/main.c:2876
#23 pymain_main () at /tmp/build/80754af9/python_1546061345851/work/Modules/main.c:3037
#24 0x000055555578ec4c in _Py_UnixMain () at /tmp/build/80754af9/python_1546061345851/work/Modules/main.c:3072
#25 0x00002aaaaaf0d3d5 in __libc_start_main () from /lib64/libc.so.6
#26 0x0000555555733982 in _start () at ../sysdeps/x86_64/elf/start.S:103

If I try to step through from that point on, execution loops through three lines in obmalloc.c - GDB says on lines 1796-98, but the numbers seem to be off and the file in the traceback (in /tmp/) doesn't exist.

This happens on both Python 3.7.3 and 3.6, though the number of strings required to cause the hang-up differs (27 million was where it happened for both versions). The required memory at that point is about 80 gigabytes and it takes 45 minutes until it prints out the first statement. The version I actually use is written in cython, which reduces memory and runtime, but faces the same issue.

Using the object works as intended, even if a billion strings are inserted. Keeping the object alive (returning it from the function or putting it into globals()) postpones the problem until Python exits - so at least I can ensure that all work is done at that point, but I would really like to know what is going wrong here.

Edit: I installed python via conda (4.6.2) and I am on a linux server node:

> uname -a
Linux computingnodeX 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Stockpile answered 20/5, 2019 at 22:9 Comment(6)
What does "indefinitely" mean (how long did you wait)? Is the process consuming CPU cycles?Barleycorn
That version I posted I let run for several hours, the process is never idle. The one I actually use ran for days and never terminated. As I said, if the number of strings is just slightly lower, it exits immediately.Stockpile
/tmp/build/80754af9/python_1546061345851/work looks like the path embedded into the executable -- where the sources were on a build server when and where your Python was built.Grantor
How much physical RAM do you actually have? This issue could be exacerbated if you're swapping to disk.Pestalozzi
I'm on a server and I've run the script numerous times, at times it had 150GB allocated to it, so I don't think it ever went to swap. If I remember right, the job allocator doesn't even allow using swap space, it just kills the job if it uses too much memory.Stockpile
@Grantor Yes, I'm on a server and installed python through conda. I suppose that path is where conda put the source files during installation and then it removed them. I just learned how to build from source - with that version I will probably get pointed to the right lines numbers in files I actually have. (:Stockpile
F
4

Could you try recompiling Python?

In obmalloc.c, there is ARENA_SIZE macro defined as:

#define ARENA_SIZE              (256 << 10)     /* 256KB */

This default value is not optimized for very large memory systems.

Your script takes long time for sorting arenas by number of free pools in it. It can be O(N^2) in worst case, when many arenas has same number of free pools.

Your script frees memory blocks in random order, it is near to the worst case.

N is number of arenas here. When you change ARENA_SIZE to (1024 << 10), size of arena is 4x, N become 1/4, and N^2 become 1/16.


If you can not recompile Python, you can use malloc instead of pymalloc.

$ PYTHONMALLOC=malloc python3 yourscript.py

You can override malloc with jemalloc or tcmalloc using LD_PRELOAD environment variable.

Foliole answered 23/5, 2019 at 10:50 Comment(7)
How do I recompile python with the ARENA_SIZE changed?Stockpile
Do you know how to build Python from source (without making any changes)? If so, do the same thing but first change "256" to "1024" in the line he pointed out. But if not, read the docs for the operating system you're using.Barleycorn
I thought I needed sudo rights to build, but that's not the case. I just submitted the script using the changed build, will report.Stockpile
@methane, FYI, in my much (10x) smaller case, making this change cut total deallocation time from about 45 seconds to about 10. On python-dev, I suggested an approach to replace "sorting" with constant-time linked list operations (which would require changing a lot of code).Barleycorn
I can confirm this. After changing ARENA_SIZE to (256 << 12), deallocation takes 13 minutes. The version using malloc is running now.Stockpile
With PYTHONMALLOC=malloc, insertion takes nearly twice as long, but deallocation only takes 2 minutes!Stockpile
One last observation: For the cython version, insertion time remains identical if malloc is used, deallocation drops to a bit over one minute.Stockpile
B
7

Update

On the bug report, a run on a giant machine showed that the time to reclaim the tree storage fell from almost 5 hours to about 70 seconds:

master:

build time 0:48:53.664428
teardown time 4:58:20.132930

patched:

build time 0:48:08.485639
teardown time 0:01:10.46670

(Proposed fix)

Here's a pull request against the CPython project that proposes to "fix this" by removing the searches entirely. It works fine for my 10x smaller test case, but I don't have access to a machine with anywhere near enough RAM to run the original. So I'm waiting for someone who does before merging the PR (who knows? there may be more than one "huge number of objects" design flaw here).

Original reply

Thank you for the nice job of providing an executable sample reproducing your problem! Alas, I can't run it - requires far more memory than I have. If I cut the number of strings by a factor of ten, I end up with about 100,000,000 Node instances in about 8GB of RAM, and it takes about 45 seconds for garbage collection to tear down the tree (Python 3.7.3). So I'm guessing you have about a billion Node instances.

I expect you're not getting responses because there's no "general problem" known here, and it requires such a hefty machine to even try it. The python-dev mailing list may be a better place to ask, or open an issue on https://bugs.python.org.

The usual cause for very slow garbage collection at the end of a run is that memory got swapped out to disk, and then it takes thousands of times longer "than normal" to read objects back into RAM, in "random" order, to tear them down. I'm assuming that's not happening here, though. If it were, CPU usage usually drops to near 0, as the process spends most of its time waiting for disk reads.

Less often, some bad pattern is hit in the underlying C library's malloc/free implementation. But that also seems unlikely here, because these objects are small enough that Python only asks C for "big chunks" of RAM and carves them up itself.

So I don't know. Because nothing can be ruled out, you should also give details about the OS you're using, and how Python was built.

Just for fun, you could try this to get some sense of how far things get before it stalls out. First add this method to Node:

def delete(self):
    global killed
    if self.lo:
        self.lo.delete()
        self.lo = None
    if self.eq:
        self.eq.delete()
        self.eq = None
    if self.hi:
        self.hi.delete()
        self.hi = None
    killed += 1
    if killed % 100000 == 0:
        print(f"{killed:,} deleted")

At the end of train(), add this:

tree.root.delete()

And replace the call to main() with:

killed = 0
main()
print(killed, "killed")

Which may or may not reveal something interesting.

DIDN'T HANG FOR SOMEONE ELSE

I posted a note about this to the python-dev mailing list, and one person so far replied privately:

I started this using Python 3.7.3 | packaged by conda-forge | (default, Mar 27 2019, 23:01:00) [GCC 7.3.0] :: Anaconda, Inc. on linux

$ python fooz.py
This gets printed!
This doesn't get printed

It did take ~80 GB of RAM and several hours, but did not get stuck.

So, unless someone else pops up who can reproduce it, we're probably out of luck here. You at least need to give more info about exactly which OS you're using, and how Python was built.

Barleycorn answered 21/5, 2019 at 22:4 Comment(12)
Thanks you! I can confirm this: I gave the script more time now and it does terminate after a bit over four hours. It is indeed a bit over one billion nodes that are created. The full script I am using inserts nearly a billion strings and requires about four times as much memory - that one had three days from the point where the tree object gets dereferenced and never terminated. So it seems that the time deleting the tree takes increases discontinuously at around 1 billion and does not scale linearly from there on. So, what do I do next?Stockpile
Oh and I also tried tracking the deletion (and creation) manually as you suggested, which went through until the end. It seems that the deletion is much faster at the beginning (setting the first 100 million nodes to None take a fraction of the next 100 million, I guess that GC kicking in) and that it may even get progressively worse. Not sure about that, I haven't come around to plotting the number of nodes deleted over the time stamps I put in, but it looked like that scrolling through the log.Stockpile
Thanks, Christian! Do follow up on @methane's suggestion, if at all possible. There's a more technical discussion in the python-dev thread I linked to, and it's looking quite likely now that you're getting burned by quadratic-time sorting of a great many "arenas" (no, I don't expect anyone to know what that means unless they're intimately familiar with Python's memory internals - it's the "quadratic-time" that's important).Barleycorn
Yes, I'm just waiting for methane to respond. The problem of course is my code will be used by others and telling them to recompile python with a different arena size or using a different allocator is not really an option. I think I know what arenas are, but I don't understand why they need to be sorted for deallocation (or does that happen after every couple of blocks, so as to prepare for new allocations?). And the quadratic time taken for sorting does not explain the sudden increase when I go from 25 to 27 million inserts, right?Stockpile
Sorting time is probabilistic, and has worst-case quadratic time. Any change in the data could make any individual case anywhere from linear to quadratic time. In any case, Stackoverflow comments aren't suitable for long explanations of technical details - go to python-dev if you want that. Arenas are sorted by increasing number of free pools, to take new pools from the busiest arenas, both to increase memory locality and to give lightly used arenas the best chance of becoming entirely empty (so they can be returned to the system).Barleycorn
And there's really nothing clean you can do about it. It doesn't have to do with the tree structure, or with recursion, it has to do with a very large number of objects being freed in (pseudo)random order. This specific part of Python's memory management was designed when the biggest boxes had far less RAM, and doesn't scale well. You're a pioneer :-)Barleycorn
I understand why sorted arenas are useful for allocating economically. What I don't understand is why sorting would happen while all those small objects are freed. I take it that the sorting happens numerous times, hence the slowdown. But apparently the script never gets to the next statement until the memory for all these objects is released, so there is no allocation in between. Why would there be any sorting of arenas during a period of no allocations (as this wouldn't change which arenas are freed, just the order)? Is that because objects might have destructors that allocate new memory?Stockpile
Sorting never happens during allocation. A new pool, when needed, is taken from the arena at the head of the list, which has the fewest free pools. Then it has even fewer free pools, so remains at the head the list. It's only _de_allocating, which frees pools, that can require changing the order. When an arena's count of free pools increases, it may need to "slide to the right" in the list. There is no "sorting step" as such. Don't look for a shallow answer here - there isn't one ;-) But if you want to try, take it to python-dev. Stackoverflow comments are utterly horrible for this.Barleycorn
BTW, note that it's the sheer number of objects here, not their sizes. So you may do much better with a different data structure. For example, perhaps a multiway trie implemented via dicts mapping a character to a trie.Barleycorn
Ah, I see. I forgot about pools and didn't realize that memory in arenas will only ever be requested in single pools. I thought that an allocation might require a block larger than what is free in the arena heading the list, so it would be taken from another, which then might need to move up.Stockpile
I don't see how a multiway trie would help, though, as the number of edges/leaves would stay the same. A radix/patricia tree or using words instead of characters would reduce the number of objects, but I find it hard to accept that this straightforward implementation won't work in python past over a certain input size.Stockpile
@TimPeters Any ideas on this? :- #56549658Convoke
F
4

Could you try recompiling Python?

In obmalloc.c, there is ARENA_SIZE macro defined as:

#define ARENA_SIZE              (256 << 10)     /* 256KB */

This default value is not optimized for very large memory systems.

Your script takes long time for sorting arenas by number of free pools in it. It can be O(N^2) in worst case, when many arenas has same number of free pools.

Your script frees memory blocks in random order, it is near to the worst case.

N is number of arenas here. When you change ARENA_SIZE to (1024 << 10), size of arena is 4x, N become 1/4, and N^2 become 1/16.


If you can not recompile Python, you can use malloc instead of pymalloc.

$ PYTHONMALLOC=malloc python3 yourscript.py

You can override malloc with jemalloc or tcmalloc using LD_PRELOAD environment variable.

Foliole answered 23/5, 2019 at 10:50 Comment(7)
How do I recompile python with the ARENA_SIZE changed?Stockpile
Do you know how to build Python from source (without making any changes)? If so, do the same thing but first change "256" to "1024" in the line he pointed out. But if not, read the docs for the operating system you're using.Barleycorn
I thought I needed sudo rights to build, but that's not the case. I just submitted the script using the changed build, will report.Stockpile
@methane, FYI, in my much (10x) smaller case, making this change cut total deallocation time from about 45 seconds to about 10. On python-dev, I suggested an approach to replace "sorting" with constant-time linked list operations (which would require changing a lot of code).Barleycorn
I can confirm this. After changing ARENA_SIZE to (256 << 12), deallocation takes 13 minutes. The version using malloc is running now.Stockpile
With PYTHONMALLOC=malloc, insertion takes nearly twice as long, but deallocation only takes 2 minutes!Stockpile
One last observation: For the cython version, insertion time remains identical if malloc is used, deallocation drops to a bit over one minute.Stockpile

© 2022 - 2025 — McMap. All rights reserved.