G

7

52

My production system occasionally exhibits a memory leak I have not been able to reproduce in a development environment. I've used a Python memory profiler (specifically, Heapy) with some success in the development environment, but it can't help me with things I can't reproduce, and I'm reluctant to instrument our production system with Heapy because it takes a while to do its thing and its threaded remote interface does not work well in our server.

What I think I want is a way to dump a snapshot of the production Python process (or at least gc.get_objects), and then analyze it offline to see where it is using memory. How do I get a core dump of a python process like this? Once I have one, how do I do something useful with it?

Galleywest answered 26/9, 2008 at 19:13 Comment(1)

What is your production platform, *nix or NT? – Palladino 26/9, 2008 at 23:49

T

36

I will expand on Brett's answer from my recent experience. Dozer package is well maintained, and despite advancements, like addition of tracemalloc to stdlib in Python 3.4, its gc.get_objects counting chart is my go-to tool to tackle memory leaks. Below I use dozer > 0.7 which has not been released at the time of writing (well, because I contributed a couple of fixes there recently).

Example

Let's look at a non-trivial memory leak. I'll use Celery 4.4 here and will eventually uncover a feature which causes the leak (and because it's a bug/feature kind of thing, it can be called mere misconfiguration, cause by ignorance). So there's a Python 3.6 venv where I pip install celery < 4.5. And have the following module.

demo.py

import time

import celery 


redis_dsn = 'redis://localhost'
app = celery.Celery('demo', broker=redis_dsn, backend=redis_dsn)

@app.task
def subtask():
    pass

@app.task
def task():
    for i in range(10_000):
        subtask.delay()
        time.sleep(0.01)


if __name__ == '__main__':
    task.delay().get()

Basically a task which schedules a bunch of subtasks. What can go wrong?

I'll use procpath to analyse Celery node memory consumption. pip install procpath. I have 4 terminals:

procpath record -d celery.sqlite -i1 "$..children[?('celery' in @.cmdline)]" to record the Celery node's process tree stats
docker run --rm -it -p 6379:6379 redis to run Redis which will serve as Celery broker and result backend
celery -A demo worker --concurrency 2 to run the node with 2 workers
python demo.py to finally run the example

(4) will finish under 2 minutes.

Then I use sqliteviz (pre-built version) to visualise what procpath has recorder. I drop the celery.sqlite there and use this query:

SELECT datetime(ts, 'unixepoch', 'localtime') ts, stat_pid, stat_rss / 256.0 rss
FROM record

And in sqliteviz I create a line chart trace with X=ts, Y=rss, and add split transform By=stat_pid. The result chart is:

This shape is likely pretty familiar to anyone who fought with memory leaks.

Finding leaking objects

Now it's time for dozer. I'll show non-instrumented case (and you can instrument your code in similar way if you can). To inject Dozer server into target process I'll use Pyrasite (update: try a fork, Pyrasite-ng, if Pyrasite doesn't work). There are two things to know about it:

To run it, ptrace has to be configured as "classic ptrace permissions": echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope, which is may be a security risk
There are non-zero chances that your target Python process will crash

With that caveat I:

pip install https://github.com/mgedmin/dozer/archive/3ca74bd8.zip (that's to-be 0.8 I mentioned above)
pip install pillow (which dozer uses for charting)
pip install pyrasite

After that I can get Python shell in the target process:

pyrasite-shell 26572

And inject the following, which will run Dozer's WSGI application using stdlib's wsgiref's server.

import threading
import wsgiref.simple_server

import dozer


def run_dozer():
    app = dozer.Dozer(app=None, path='/')
    with wsgiref.simple_server.make_server('', 8000, app) as httpd:
        print('Serving Dozer on port 8000...')
        httpd.serve_forever()

threading.Thread(target=run_dozer, daemon=True).start()

Opening http://localhost:8000 in a browser there should see something like:

After that I run python demo.py from (4) again and wait for it to finish. Then in Dozer I set "Floor" to 5000, and here's what I see:

Two types related to Celery grow as the subtask are scheduled:

celery.result.AsyncResult
vine.promises.promise

weakref.WeakMethod has the same shape and numbers and must be caused by the same thing.

Finding root cause

At this point from the leaking types and the trends it may be already clear what's going on in your case. If it's not, Dozer has "TRACE" link per type, which allows tracing (e.g. seeing object's attributes) chosen object's referrers (gc.get_referrers) and referents (gc.get_referents), and continue the process again traversing the graph.

But a picture says a thousand words, right? So I'll show how to use objgraph to render chosen object's dependency graph.

pip install objgraph
apt-get install graphviz

Then:

I run python demo.py from (4) again
in Dozer I set floor=0, filter=AsyncResult
and click "TRACE" which should yield

Then in Pyrasite shell run:

objgraph.show_backrefs([objgraph.at(140254427663376)], filename='backref.png')

The PNG file should contain:

Basically there's some Context object containing a list called _children that in turn is containing many instances of celery.result.AsyncResult, which leak. Changing Filter=celery.*context in Dozer here's what I see:

So the culprit is celery.app.task.Context. Searching that type would certainly lead you to Celery task page. Quickly searching for "children" there, here's what it says:

trail = True

If enabled the request will keep track of subtasks started by this task, and this information will be sent with the result (result.children).

Disabling the trail by setting trail=False like:

@app.task(trail=False)
def task():
    for i in range(10_000):
        subtask.delay()
        time.sleep(0.01)

Then restarting the Celery node from (3) and python demo.py from (4) yet again, shows this memory consumption.

Problem solved!

Trierarch answered 16/4, 2020 at 22:7 Comment(2)

That is awesome debugging @Trierarch ! From your minimal example it looks to be a bug/leak in the trail functionality of celery 4.4 , do you know if this was ever reported to celery? – Etoile 25/4, 2020 at 7:1

@Etoile Thanks. Regarding reporting the behaviour/bug/leak, given the diversity of cases Celery is used in, I wasn't sure it would be perceived as a bug. The trail feature is described as to keep track of subtasks started by this task, which means some objects has to be kept for the duration of the task. The latter can be a bug/leak or work as intended, depending on the perspective. So I didn't report it, and did find an existing report when I faced with the behaviour myself. – Trierarch 25/4, 2020 at 9:48

D

41

Using Python's gc garbage collector interface and sys.getsizeof() it's possible to dump all the python objects and their sizes. Here's the code I'm using in production to troubleshoot a memory leak:

rss = psutil.Process(os.getpid()).get_memory_info().rss
# Dump variables if using more than 100MB of memory
if rss > 100 * 1024 * 1024:
    memory_dump()
    os.abort()

def memory_dump():
    dump = open("memory.pickle", 'wb')
    xs = []
    for obj in gc.get_objects():
        i = id(obj)
        size = sys.getsizeof(obj, 0)
        #    referrers = [id(o) for o in gc.get_referrers(obj) if hasattr(o, '__class__')]
        referents = [id(o) for o in gc.get_referents(obj) if hasattr(o, '__class__')]
        if hasattr(obj, '__class__'):
            cls = str(obj.__class__)
            xs.append({'id': i, 'class': cls, 'size': size, 'referents': referents})
    cPickle.dump(xs, dump)

Note that I'm only saving data from objects that have a __class__ attribute because those are the only objects I care about. It should be possible to save the complete list of objects, but you will need to take care choosing other attributes. Also, I found that getting the referrers for each object was extremely slow so I opted to save only the referents. Anyway, after the crash, the resulting pickled data can be read back like this:

with open("memory.pickle", 'rb') as dump:
    objs = cPickle.load(dump)

Added 2017-11-15

The Python 3.6 version is here:

import gc
import sys
import _pickle as cPickle

def memory_dump():
    with open("memory.pickle", 'wb') as dump:
        xs = []
        for obj in gc.get_objects():
            i = id(obj)
            size = sys.getsizeof(obj, 0)
            #    referrers = [id(o) for o in gc.get_referrers(obj) if hasattr(o, '__class__')]
            referents = [id(o) for o in gc.get_referents(obj) if hasattr(o, '__class__')]
            if hasattr(obj, '__class__'):
                cls = str(obj.__class__)
                xs.append({'id': i, 'class': cls, 'size': size, 'referents': referents})
        cPickle.dump(xs, dump)

Dicho answered 5/3, 2012 at 13:56 Comment(4)

Looks like sys.getsizeof has some important limitations to keep in mind: docs.python.org/library/sys.html#sys.getsizeof Namely, new in 2.6 and "default will be returned if the object does not provide means to retrieve the size. Otherwise a TypeError will be raised." – Galleywest 5/3, 2012 at 21:40

@Galleywest Using pickle.dump() multiple times to the same file means we can save a lot of memory by not storing the entire list of objects in memory before pickling them. This is helpful in my program because it is already running out memory when I start the dump – Dicho 6/3, 2012 at 6:31

Great answer. An update, instead of psutil.Process(os.getpid()).get_memory_info().rss it should be psutil.Process(os.getpid()).memory_info().rss on Python 2.7 – Soggy 17/6, 2015 at 7:13

You should add 'import psutil' – Borodin 24/11, 2021 at 13:55

T

36

I will expand on Brett's answer from my recent experience. Dozer package is well maintained, and despite advancements, like addition of tracemalloc to stdlib in Python 3.4, its gc.get_objects counting chart is my go-to tool to tackle memory leaks. Below I use dozer > 0.7 which has not been released at the time of writing (well, because I contributed a couple of fixes there recently).

Example

Let's look at a non-trivial memory leak. I'll use Celery 4.4 here and will eventually uncover a feature which causes the leak (and because it's a bug/feature kind of thing, it can be called mere misconfiguration, cause by ignorance). So there's a Python 3.6 venv where I pip install celery < 4.5. And have the following module.

demo.py

import time

import celery 


redis_dsn = 'redis://localhost'
app = celery.Celery('demo', broker=redis_dsn, backend=redis_dsn)

@app.task
def subtask():
    pass

@app.task
def task():
    for i in range(10_000):
        subtask.delay()
        time.sleep(0.01)


if __name__ == '__main__':
    task.delay().get()

Basically a task which schedules a bunch of subtasks. What can go wrong?

I'll use procpath to analyse Celery node memory consumption. pip install procpath. I have 4 terminals:

procpath record -d celery.sqlite -i1 "$..children[?('celery' in @.cmdline)]" to record the Celery node's process tree stats
docker run --rm -it -p 6379:6379 redis to run Redis which will serve as Celery broker and result backend
celery -A demo worker --concurrency 2 to run the node with 2 workers
python demo.py to finally run the example

(4) will finish under 2 minutes.

Then I use sqliteviz (pre-built version) to visualise what procpath has recorder. I drop the celery.sqlite there and use this query:

SELECT datetime(ts, 'unixepoch', 'localtime') ts, stat_pid, stat_rss / 256.0 rss
FROM record

And in sqliteviz I create a line chart trace with X=ts, Y=rss, and add split transform By=stat_pid. The result chart is:

This shape is likely pretty familiar to anyone who fought with memory leaks.

Finding leaking objects

Now it's time for dozer. I'll show non-instrumented case (and you can instrument your code in similar way if you can). To inject Dozer server into target process I'll use Pyrasite (update: try a fork, Pyrasite-ng, if Pyrasite doesn't work). There are two things to know about it:

To run it, ptrace has to be configured as "classic ptrace permissions": echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope, which is may be a security risk
There are non-zero chances that your target Python process will crash

With that caveat I:

pip install https://github.com/mgedmin/dozer/archive/3ca74bd8.zip (that's to-be 0.8 I mentioned above)
pip install pillow (which dozer uses for charting)
pip install pyrasite

After that I can get Python shell in the target process:

pyrasite-shell 26572

And inject the following, which will run Dozer's WSGI application using stdlib's wsgiref's server.

import threading
import wsgiref.simple_server

import dozer


def run_dozer():
    app = dozer.Dozer(app=None, path='/')
    with wsgiref.simple_server.make_server('', 8000, app) as httpd:
        print('Serving Dozer on port 8000...')
        httpd.serve_forever()

threading.Thread(target=run_dozer, daemon=True).start()

Opening http://localhost:8000 in a browser there should see something like:

After that I run python demo.py from (4) again and wait for it to finish. Then in Dozer I set "Floor" to 5000, and here's what I see:

Two types related to Celery grow as the subtask are scheduled:

celery.result.AsyncResult
vine.promises.promise

weakref.WeakMethod has the same shape and numbers and must be caused by the same thing.

Finding root cause

At this point from the leaking types and the trends it may be already clear what's going on in your case. If it's not, Dozer has "TRACE" link per type, which allows tracing (e.g. seeing object's attributes) chosen object's referrers (gc.get_referrers) and referents (gc.get_referents), and continue the process again traversing the graph.

But a picture says a thousand words, right? So I'll show how to use objgraph to render chosen object's dependency graph.

pip install objgraph
apt-get install graphviz

Then:

I run python demo.py from (4) again
in Dozer I set floor=0, filter=AsyncResult
and click "TRACE" which should yield

Then in Pyrasite shell run:

objgraph.show_backrefs([objgraph.at(140254427663376)], filename='backref.png')

The PNG file should contain:

Basically there's some Context object containing a list called _children that in turn is containing many instances of celery.result.AsyncResult, which leak. Changing Filter=celery.*context in Dozer here's what I see:

So the culprit is celery.app.task.Context. Searching that type would certainly lead you to Celery task page. Quickly searching for "children" there, here's what it says:

trail = True

If enabled the request will keep track of subtasks started by this task, and this information will be sent with the result (result.children).

Disabling the trail by setting trail=False like:

@app.task(trail=False)
def task():
    for i in range(10_000):
        subtask.delay()
        time.sleep(0.01)

Then restarting the Celery node from (3) and python demo.py from (4) yet again, shows this memory consumption.

Problem solved!

Trierarch answered 16/4, 2020 at 22:7 Comment(2)

That is awesome debugging @Trierarch ! From your minimal example it looks to be a bug/leak in the trail functionality of celery 4.4 , do you know if this was ever reported to celery? – Etoile 25/4, 2020 at 7:1

@Etoile Thanks. Regarding reporting the behaviour/bug/leak, given the diversity of cases Celery is used in, I wasn't sure it would be perceived as a bug. The trail feature is described as to keep track of subtasks started by this task, which means some objects has to be kept for the duration of the task. The latter can be a bug/leak or work as intended, depending on the perspective. So I didn't report it, and did find an existing report when I faced with the behaviour myself. – Trierarch 25/4, 2020 at 9:48

R

5

Could you record the traffic (via a log) on your production site, then re-play it on your development server instrumented with a python memory debugger? (I recommend dozer: http://pypi.python.org/pypi/Dozer)

Realty answered 26/9, 2008 at 23:41 Comment(1)

Might be worth a shot. There are various considerations, like how much disk write i/o I would use up and lining up the right database snapshot to work with the recorded inputs, but if we could get it working, it would be an incredibly useful tool. – Galleywest 27/9, 2008 at 5:14

C

4

Make your program dump core, then clone an instance of the program on a sufficiently similar box using gdb. There are special macros to help with debugging python programs within gdb, but if you can get your program to concurrently serve up a remote shell, you could just continue the program's execution, and query it with python.

I have never had to do this, so I'm not 100% sure it'll work, but perhaps the pointers will be helpful.

Commencement answered 27/9, 2008 at 0:11 Comment(2)

Installing a Manhole would be incredibly helpful, yes. Sadly, this is not a Twisted application. – Galleywest 27/9, 2008 at 5:11

See here how you can use a Manhole in (e.g.) a Flask application. – Goose 14/9, 2016 at 9:15

Z

2

I don't know how to dump an entire python interpreter state and restore it. It would be useful, I'll keep my eye on this answer in case anyone else has ideas.

If you have an idea where the memory is leaking, you can add checks the refcounts of your objects. For example:

x = SomeObject()
... later ...
oldRefCount = sys.getrefcount( x )
suspiciousFunction( x )
if (oldRefCount != sys.getrefcount(x)):
    print "Possible memory leak..."

You could also check for reference counts higher than some number that is reasonable for your app. To take it further, you could modify the python interpreter to do these kinds of check by replacing the Py_INCREF and Py_DECREF macros with your own. This might be a bit dangerous in a production app, though.

Here is an essay with more info on debugging these sorts of things. It's more geared for plugin authors but most of it applies.

Debugging Reference Counts

Zampino answered 26/9, 2008 at 21:37 Comment(1)

Unfortunately that "idea of where memory is leaking" is what I am trying to get. That look-for-high-reference-counts approach might be useful though. – Galleywest 26/9, 2008 at 22:30

D

2

The gc module has some functions that might be useful, like listing all objects the garbage collector found to be unreachable but cannot free, or a list of all objects being tracked.

If you have a suspicion which objects might leak, the weakref module could be handy to find out if/when objects are collected.

Dependence answered 26/9, 2008 at 21:45 Comment(1)

Yeah. If I could just figure out how to grab something like gc.get_objects and export it to be analyzed later. I don't think I can use pickle for that, as not everything is pickleable. – Galleywest 26/9, 2008 at 22:28

G

2

Meliae looks promising:

This project is similar to heapy (in the 'guppy' project), in its attempt to understand how memory has been allocated.

Currently, its main difference is that it splits the task of computing summary statistics, etc of memory consumption from the actual scanning of memory consumption. It does this, because I often want to figure out what is going on in my process, while my process is consuming huge amounts of memory (1GB, etc). It also allows dramatically simplifying the scanner, as I don't allocate python objects while trying to analyze python object memory consumption.

Galleywest answered 31/1, 2010 at 0:16 Comment(0)

Example

Finding leaking objects

Finding root cause

Added 2017-11-15

Example

Finding leaking objects

Finding root cause

Recommended topics

Hot tags