Handling Airflow DAG changes through time (DAG Versioning)
Asked Answered
S

0

4

We have relatively complex dynamic DAG as part of our ETL. DAG contains hundreds of transformations and it is created programmatically based on set of yaml files. It is changed through time: new tasks are added, queries executed by tasks are changed and even relationships between tasks are changed.

I know that new DAG should be created each time it is changed in this way and that DAG versioning is not supported by Airflow, but this is real use case and I would like to hear if there are some suggestions how to do this.

One of the most important request and why we want to try to tackle this, is that we must aware of DAG versions when we are doing clear of backfill for some moment in the past. This effectively means that when DAG is executed for some past moment, that must be version of DAG from that moment, not the newest one.

Any suggestions are more than welcome.

Sinistrad answered 4/6, 2021 at 7:6 Comment(8)
Airflow does not support versioning natively. Personally I would be in favour of generating DAG code for each particular run and then running it, something like generate_dag_code >> trigger_this_new_dag_once >> wait >> disable_the_dag. If this DAG is run every minute then it can be noisy in terms of number of DAGs but otherwise it would serve all your needs I think.Celerity
Yeah nice idea. It is scheduled daily, so it is acceptable in the terms of number of runs. Other than that it may be some issues with backfill or clearing some old runs, but may even work. I will definitely try to analyze this idea deeper.Sinistrad
If you will name your dags with a date when was it generated/supposed to run then it should be quite easy to use them and do reruns. Other options based on for example dag_run config or some logic seem a) non-trivial b) hard to implement c) may still be problematicCelerity
Yes, there is also possibility to serialize these dags on some storage (like Google Cloud) and to fetch proper dag from there and to run it.Sinistrad
If I'm not mistaken in Composer DAG folder is synchronised in bidirectional way, so creating a file locally should result in it being visible in DAGs GCS - but I'm not 100% sure.Celerity
Yes, I just need to keep version for each execution date, so I can reference it easily.Sinistrad
There is a proposal to address that. But it appears it has somewhat low priority: cwiki.apache.org/confluence/display/AIRFLOW/…Mccormick
@ChiboleteSophos But scope of this proposal is not completely same (in fact it misses main part). From proposal there is bold sentence: "The scope of this AIP to make sure that the visibility behavior of Airflow is correct, without changing the execution behaviour which will continue to be based on the most recent version of the DAG." This means if I rewind DAG it will still execute only last version, not the version from that date in past.Sinistrad

© 2022 - 2024 — McMap. All rights reserved.