I have a huge celery task that works basically like this:
@task
def my_task(id):
if settings.DEBUG:
print "Don't run this with debug on."
return False
related_ids = get_related_ids(id)
chunk_size = 500
for i in xrange(0, len(related_ids), chunk_size):
ids = related_ids[i:i+chunk_size]
MyModel.objects.filter(pk__in=ids).delete()
print_memory_usage()
I also have a manage.py command that just runs my_task(int(args[0])), so this can either be queued or run on the command line.
When run on the command line, print_memory_usage() reveals a relatively constant amount of memory used.
When run inside celery, print_memory_usage() reveals an ever-increasing amount of memory, continuing until the process is killed (I'm using Heroku with a 1GB memory limit, but other hosts would have a similar problem.) The memory leak appears to correspond with the chunk_size; if I increase the chunk_size, the memory consumption increases per-print. This seems to suggest that either celery is logging queries itself, or something else in my stack is.
Does celery log queries somewhere else?
Other notes:
- DEBUG is off.
- This happens both with RabbitMQ and Amazon's SQS as the queue.
- This happens both locally and on Heroku (though it doesn't get killed locally due to having 16 GB of RAM.)
- The task actually goes on to do more things than just deleting objects. Later it creates new objects via MyModel.objects.get_or_create(). This also exhibits the same behavior (memory grows under celery, doesn't grow under manage.py).
itertools.islice(related_ids, i, i + chunk_size)
instead ofrelated_ids[i:i+chunk_size]
. It's probably not the only factor, but this might reduce some copying. – ExtragalacticQuerySet.delete
always loads instances into memory before deleting them. I’d try replacing that with a raw SQLDELETE
statement and see what happens. – Hebe