Celery: WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL)
Asked Answered
G

4

74

I use Celery with RabbitMQ in my Django app (on Elastic Beanstalk) to manage background tasks and I daemonized it using Supervisor. The problem now, is that one of the period task that I defined is failing (after a week in which it worked properly), the error I've got is:

[01/Apr/2014 23:04:03] [ERROR] [celery.worker.job:272] Task clean-dead-sessions[1bfb5a0a-7914-4623-8b5b-35fc68443d2e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
  File "/opt/python/run/venv/lib/python2.7/site-packages/billiard/pool.py", line 1168, in mark_as_worker_lost
    human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).

All the processes managed by supervisor are up and running properly (supervisorctl status says RUNNNING).

I tried to read several logs on my ec2 instance but no one seems to help me in finding out what is the cause of the SIGKILL. What should I do? How can I investigate?

These are my celery settings:

CELERY_TIMEZONE = 'UTC'
CELERY_TASK_SERIALIZER = 'json'
CELERY_ACCEPT_CONTENT = ['json']
BROKER_URL = os.environ['RABBITMQ_URL']
CELERY_IGNORE_RESULT = True
CELERY_DISABLE_RATE_LIMITS = False
CELERYD_HIJACK_ROOT_LOGGER = False

And this is my supervisord.conf:

[program:celery_worker]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery worker -A com.cygora -l info --pidfile=/opt/python/run/celery_worker.pid
startsecs=10
stopwaitsecs=60
stopasgroup=true
killasgroup=true
autostart=true
autorestart=true
stdout_logfile=/opt/python/log/celery_worker.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_worker.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1

[program:celery_beat]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery beat -A com.cygora -l info --pidfile=/opt/python/run/celery_beat.pid --schedule=/opt/python/run/celery_beat_schedule
startsecs=10
stopwaitsecs=300
stopasgroup=true
killasgroup=true
autostart=false
autorestart=true
stdout_logfile=/opt/python/log/celery_beat.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_beat.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1

Edit 1

After restarting celery beat the problem remains.

Edit 2

Changed killasgroup=true to killasgroup=false and the problem remains.

Glandule answered 2/4, 2014 at 8:3 Comment(1)
Hint: Most probably it's due to low memory/ram on your server. If you're running containers through docker command, you can see the memory consumption of each container using docker stats.Ampereturn
M
74

The SIGKILL your worker received was initiated by another process. Your supervisord config looks fine, and the killasgroup would only affect a supervisor initiated kill (e.g. the ctl or a plugin) - and without that setting it would have sent the signal to the dispatcher anyway, not the child.

Most likely you have a memory leak and the OS's oomkiller is assassinating your process for bad behavior.

grep oom /var/log/messages. If you see messages, that's your problem.

If you don't find anything, try running the periodic process manually in a shell:

MyPeriodicTask().run()

And see what happens. I'd monitor system and process metrics from top in another terminal, if you don't have good instrumentation like cactus, ganglia, etc for this host.

Macleod answered 3/4, 2014 at 16:52 Comment(8)
You are right "celery invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0" ...now I have to find why this happen, because previously it was running as expected :P Thank you very much!Glandule
@Glandule I think Lewis Carol once wrote, ""Beware the oom-killer, my son! The jaws that bite, the claws that catch!"Macleod
On my Ubuntu box the log to check is /var/log/kern.log, not /var/log/messagesRosenblatt
in my Ubuntu box it is /var/log/syslog (so much for consistency)Grovel
@Glandule how did you go about finding why that happens. im also stuck at a similar position. and the problem is its happening for only one task and the according to compute engine it everything about cpu usage and memory seems fineBackstitch
I figured out it was a memory issue.Mixup
I was running celery workers on ecs with too little RAM per task and I also saw oom killing processes. So its not always related to memory leaks, but can also be the cause of not enough RAM.Mendelson
@JSelecta Thanks. It really helped. The same issue was with my server. I was running multiple container on 1GB ram server. The celery worker needed 400MB for a specific task. When I upgraded to 2GB, it's working fine.Ampereturn
3
11

This error occurs when an asynchronous task (e.g. using Celery) or the script you are using consumes a large amount of memory due to a memory leak.

In my case, I was getting data from another system and saving it on a variable to export all the data (into a Django model / Excel file) after finishing the process.

Here is the catch. My script was gathering 10 Million data; it was leaking memory while I was gathering data. This resulted in the raised Exception.

To overcome the issue, I divided 10 million pieces of data into 20 parts (half a million on each part). I stored the data in my preferred local file / Django model every time the data length reached 500,000 items, repeating this for every batch of 500k items.

There is no need to do the exact number of partitions. It is the idea of solving a complex problem by splitting it into multiple subproblems and solving them one by one. :D

3d answered 14/6, 2021 at 20:38 Comment(1)
Solved this first your way by splitting the amounts of objects processed...then I tried gathering pks using a values_list for about 3,000,000 pks and grabbed each entry with get and all runs smoothly...not the most optimized but it is an async task so speed isn't the biggest issue (memory leaks were)Dunedin
I
1

OOM Killer was also the case in our setup with the ECS EC2 cluster. We used Datadog for monitoring, and you can search for messages like this from the aws-ecs-agent:

level=info time=2024-04-03T14:36:38Z msg="DockerGoClient: process within container <container_id> (name: \"ecs-production-core-worker-393-production-core-worker-a89898f398d0c4fd3c00\") died due to OOM" module=docker_client.go

In our case it happened during SQLAlchemy querying the Postgres DB, serialization took so much memory we had clear spikes when certain celery task was executed.

Check memory spikes below:

Memory task monitoring screen

The solution was easy, do the whole work in chunks! In our case, we didn't predict that big data increase for one client, and it started happening out of a sudden!

Isbell answered 3/4, 2024 at 14:53 Comment(0)
P
0

Our cause was a (synchronous) task that was downloading Zipped JSON file and unpacking/iterating in a memory on a newly migrated server.

The solution was to enable SWAP partition but optimization (json-stream) and monitoring (Prometheus+Grafana) is highly recommended.

Pericranium answered 9/4, 2024 at 10:47 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.