airflow - how to 'Filling up the DagBag' once only
Asked Answered
H

1

9

My dag takes about 50seconds to parse, I only use external triggers to start dag runs, no schedules. I notice airflow wants to fill the dagbag a lot --> On every trigger_dag command AND in the background it keeps checking the dags folder AND creating .pyc files seemingly instantly once new .py deployed.

Is there anyway I can deploy my cluster and get dags filled once! Then for the next 2 weeks get dagruns starting instantly on any trigger_dag (right now takes 50 seconds just to fill the dagbag before starting). I have no need to update dag definitions within the 2 weeks.

Harlow answered 25/4, 2019 at 15:40 Comment(0)
E
4

50 seconds is an incredibly huge amount of time for DAG instantiation. Looks like you are using a big piece of code (or just long-working) in your DAG file. It is very bad practice:

Note: This means all top level code (ie. anything that isn't defining the DAG) in a DAG file will get run each scheduler heartbeat. Try to avoid top level code to your DAG file unless absolutely necessary.

Airflow works exactly as you described. It is why you should treat your Python files in your DAG folder mostly as configuration files (with some programmatical capabilities). You can't change it with any magic config keys or something like it. This behaviour is the core of Airflow.

Exaggeration answered 25/4, 2019 at 15:57 Comment(4)
why does it need to parse the dag if it has not changed?Harlow
Because Airflow doesn't know was it changed or not. Many DAGs in Airflow are created programmatically and Airflow can't see changes until it creates a DAG again. Moreover, ALL DAGs are re-created each heartbeat, even static DAGs. It is the huge Airflow's drawback IMO.Exaggeration
Yes, in particular when using it in a container: we re-deploy when DAGs change, and it still keeps using up CPU and log rows when idling.Kukri
There are some options but they don't seem to be effective.Kukri

© 2022 - 2024 — McMap. All rights reserved.