What is the difference between min_file_process_interval and dag_dir_list_interval in Apache Airflow 1.9.0?
Asked Answered
C

2

18

We are using Airflow v 1.9.0. We have 100+ dags and the instance is really slow. The scheduler is only launching some tasks.

In order to reduce the amount of CPU usage, we want to tweak some configuration parameters, namely: min_file_process_interval and dag_dir_list_interval. The documentation is not really clear about the difference between the two

Centrepiece answered 27/7, 2018 at 12:50 Comment(0)
I
24

min_file_process_interval:

In cases where there are only a small number of DAG definition files, the loop could potentially process the DAG definition files many times a minute. To control the rate of DAG file processing, the min_file_process_interval can be set to a higher value. This parameter ensures that a DAG definition file is not processed more often than once every min_file_process_interval seconds.

dag_dir_list_interval:

Since the scheduler can run indefinitely, it's necessary to periodically refresh the list of files in the DAG definition directory. The refresh interval is controlled with the dag_dir_list_interval configuration parameter.

Source: A Google search on both terms lead to this first result https://cwiki.apache.org/confluence/display/AIRFLOW/Scheduler+Basics

Ierna answered 28/7, 2018 at 7:43 Comment(4)
correct me if I'm wrong: every dag_dir_list_interval the scheduler lists DAG definition files and those files are processed every min_file_process_intervalCentrepiece
That makes sense.Ierna
Thanks, @tobi6. Could you please elaborate on what exactly happens in listing the DAGs? What I understood from processing the files is scheduler will start considering new DAGs and also consider updating of existing DAGs for the next DagRun. Is this a correct understanding?Sympathin
I am not a developer of Airflow. It might be more appropriate to ask this in the Airflow chat: gitter.im/apache/incubator-airflowIerna
R
2

I manually did some experiment and found the below, hope this clarifies.

min_file_process_interval: for example, lets this is set to 10 seconds. This is the amount of time it takes to process dag files, which also means that, between completion of a task in any dag and to trigger the dependent task, there can be a maximum of 10 second delay as airflow checks for triggering dependent tasks every 10 seconds if the upstream jobs are completed.

If this value is higher, you tasks in dag will take more time to trigger, but airflow will consume less CPU.

Airbnb Airflow using all system resources

dag_dir_list_interval: any new python dag files that you put in dags folder, it will take this much time to be processed by airflow and show up in UI.

Rendon answered 4/3, 2020 at 4:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.