In celery, how to ensure tasks are retried when worker crashes

Asked 18/1, 2013 at 17:40 Answered 27/4, 2022 at 14:43

redis scheduled-tasks celery django-celery celery-task

First of all please don't consider this question as a duplicate of this question

I have a setup an environment which uses celery and redis as broker and result_backend. My question is how can I make sure that when the celery workers crash, all the scheduled tasks are re-tried, when the celery worker is back up.

I have seen advice on using CELERY_ACKS_LATE = True , so that the broker will re-drive the tasks until it get an ACK, but in my case its not working. Whenever I schedule a task its immediately goes to the worker which persists it until the scheduled time of execution. Let me give some example:

I am scheduling a task like this: res=test_task.apply_async(countdown=600) , but immediately in celery worker logs i can see something like : Got task from broker: test_task[a137c44e-b08e-4569-8677-f84070873fc0] eta:[2013-01-...] . Now when I kill the celery worker, these scheduled tasks are lost. My settings:

BROKER_URL = "redis://localhost:6379/0"  
CELERY_ALWAYS_EAGER = False  
CELERY_RESULT_BACKEND = "redis://localhost:6379/0"  
CELERY_ACKS_LATE = True

Chalmer answered 18/1, 2013 at 17:40 Comment(4)

Which celery version you are using ? Look a this thread it might help you groups.google.com/forum/#!topic/celery-users/eVl6CHHJ0Jc – Practical 19/1, 2013 at 8:27

Thanks for the link. celery version is v3.0.12 (Chiastic Slide) – Chalmer 19/1, 2013 at 9:54

Now I have realized that the tasks are not exactly lost. It was indeed delivered to celery exactly at ETA. So I guess the behavior is something like this: When a task is scheduled it is delivered to the worker immediately, and the broker waits for an ACK. If the worker doesn't ACK within the ETA, then it would try to resend and thus ensure the task is executed. – Chalmer 19/1, 2013 at 9:57

@aqs but if ..When a task is scheduled it is delivered to the worker immediately.. is true, this implies that selection of worker for execution is done much before actual execution begins. That hardly seems to be right for a scheduler (distributed systems) which is supposed to dynamically factor in the load before scheduling tasks across workers – Sedillo 19/7, 2018 at 8:14

Apparently this is how celery behaves. When worker is abruptly killed (but dispatching process isn't), the message will be considered as 'failed' even though you have acks_late=True

Motivation (to my understanding) is that if consumer was killed by OS due to out-of-mem, there is no point in redelivering the same task.

You may see the exact issue here: https://github.com/celery/celery/issues/1628

I actually disagree with this behaviour. IMO it would make more sense not to acknowledge.

Baskett answered 15/11, 2018 at 18:44 Comment(0)

I've had the issue, where I was using some open-source C libraries that went totaly amok and crashed my worker ungraceful without throwing an exception. For any reason whatsoever, one can simply wrap the content of a task in a child process and check its status in the parent.

n = os.fork()
if n > 0: //inside the parent process
    status = os.wait() //wait until child terminates
    print("Signal number that killed the child process:", status[1])
    if status[1] > 0: // if the signal was something other then graceful
        // here one can do whatever they want, like restart or throw an Exception.
        self.retry(exc=SomeException(), countdown=2 ** self.request.retries)
else: // here comes the actual task content with its respected return
     return myResult // Make sure there are not returns in child and parent at the same time.

Femoral answered 27/4, 2022 at 14:43 Comment(0)

Recommended topics

Hot tags