Airflow scheduler stuck

Asked 29/8, 2019 at 15:26 Answered 31/10, 2023 at 1:0

I'm testing the use of Airflow, and after triggering a (seemingly) large number of DAGs at the same time, it seems to just fail to schedule anything and starts killing processes. These are the logs the scheduler prints:

[2019-08-29 11:17:13,542] {scheduler_job.py:214} WARNING - Killing PID 199809
[2019-08-29 11:17:13,544] {scheduler_job.py:214} WARNING - Killing PID 199809
[2019-08-29 11:17:44,614] {scheduler_job.py:214} WARNING - Killing PID 2992
[2019-08-29 11:17:44,614] {scheduler_job.py:214} WARNING - Killing PID 2992
[2019-08-29 11:18:15,692] {scheduler_job.py:214} WARNING - Killing PID 5174
[2019-08-29 11:18:15,693] {scheduler_job.py:214} WARNING - Killing PID 5174
[2019-08-29 11:18:46,765] {scheduler_job.py:214} WARNING - Killing PID 22410
[2019-08-29 11:18:46,766] {scheduler_job.py:214} WARNING - Killing PID 22410
[2019-08-29 11:19:17,845] {scheduler_job.py:214} WARNING - Killing PID 42177
[2019-08-29 11:19:17,846] {scheduler_job.py:214} WARNING - Killing PID 42177
...

I'm using a LocalExecutor with a PostgreSQL backend DB. It seems to be happening only after I'm triggering a large number (>100) of DAGs at about the same time using external triggering. As in:

airflow trigger_dag DAG_NAME

After waiting for it to finish killing whatever processes he is killing, he starts executing all of the tasks properly. I don't even know what these processes were, as I can't really see them after they are killed...

Did anyone encounter this kind of behavior? Any idea why would that happen?

Postboy answered 29/8, 2019 at 15:26 Comment(5)

What's your concurrency setting for the dag? – Cord 29/8, 2019 at 18:41

Do you mean the max active runs per dag? The settings there are quite unclear as to what they affect, and online as well it's unclear.. Is there a specific setting I should Iook at? – Postboy 30/8, 2019 at 19:22

Maybe it's easier if you can share the dag file? Default is 16 concurrency task, but you can bump it up. github.com/apache/airflow/blob/master/airflow/models/… – Cord 30/8, 2019 at 19:36

We seem to be experiencing a similar issue since upgrading to Airflow 10.5, but we haven't been able to get to the bottom of it. What version of Airflow are you running? – Fahey 12/9, 2019 at 5:22

@LouisSimoneau what version does not have the issue? – Hodges 14/9, 2019 at 17:0

The reason for the above in my case was that I had a DAG file creating a very large number of DAGs dynamically.

The "dagbag_import_timeout" config variable which controls "How long before timing out a python file import while filling the DagBag" was set to the default value of 30. Thus the process filling the DagBag kept timing out.

Postboy answered 18/9, 2019 at 9:39 Comment(2)

this answer just saved me; we dont have control over this variable in AWS MWAA, however this did help me realize that our dag generator was taking too long, and splitting it up fixe the problem! – Fountain 17/3, 2021 at 16:14

For me increasing the value of dag_file_processor_timeout fixed the problem. – Nephrolith 4/12, 2023 at 19:8

I've had a very similar issue. My DAG was of the same nature (a file that generates many DAGs dynamically). I tried the suggested solution but it didn't work (had this value to some high already, 60 seconds, increased to 120 but my issue wasn't resolved).

Posting what worked for me in case someone else has a similar issue.

I came across this JIRA ticket: https://issues.apache.org/jira/browse/AIRFLOW-5506

which helped me resolve my issue: I disabled the SLA configuration, and then all my tasks started to run!

There can also be other solutions, as other comments in this ticket suggest.

For the record, my issue started to occur after I enabled lots of such DAGs (around 60?) that I had disabled for a few months. Not sure how the SLA affects this from technical perspective TBH, but it did.

Hynes answered 27/7, 2020 at 12:57 Comment(1)

This answer worked for me. I recently updated the sla configuration to add the slack callback for sla missed. Soon after that this specific DAG started to get halted in the middle of a run for hours. After removing the sla_miss_callback that was recently configured things tend to fall back to normal. This could be happening since we had thousands of SLA miss records before adding the callback and soon after adding the callback it started bombarding the slack channel with all the previously missed sla notifications. – Gaberdine 12/10, 2023 at 12:18

I had a similar issue on airflow 1.10 on top of kubernetes.

Restarting all the management node and worker nodes solve the issue. They were running for one year, without reboot. It seems we need frequent maintenance reboot for all kubernetes nodes to prevent such issues.

Necklace answered 29/8, 2019 at 15:26 Comment(0)

I faced similar issue and in my case it was due to 100% CPU utilization.

Ganof answered 31/10, 2023 at 1:0 Comment(0)

Recommended topics

Hot tags