We have an Airflow deployment with Celery executors.
Many of our DAGs require a local processing step of some file in a BashOperator
or PythonOperator
.
However, in our understanding the tasks of a given DAG may not always be scheduled on the same machine.
The options for state sharing between tasks I've gathered so far:
Use
Local Executors
- this may suffice for one team, depending on the load, but may not scale to the wider companyUse
XCom
- does this have a size limit? Probably unsuitable for large filesWrite custom Operators for every combination of tasks that need local processing in between. This approach reduces modularity of tasks and requires replicating existing operators' code.
Use Celery queues to route DAGs to the same worker (docs) - This option seems attractive at first, but what would be an appropriate way to set it up in order to avoid routing everything to one executor, or crafting a million queues?
Use a shared network storage in all machines that run executors - Seems like an additional infrastructure burden, but is a possibility.
What is the recommended way to do sharing of large intermediate state, such as files, between tasks in Airflow?
Airflow is not a data streaming solution. Tasks do not move data from one to the other (though tasks can exchange metadata!)
based on docs: airflow.apache.org – Oof