Airflow - Different schedule interval for backfilling
Asked Answered
S

2

14

What's the best way to handle having a different schedule interval for backfilling and ongoing running?

For backfilling I want to use a daily interval, but for ongoing running I want to use an hourly interval.

I can think of three approaches to this:

  1. The easiest approach I see is to define two DAGs in the one .py file. dag_backfill with a daily interval, a start date in the past and end date of datetime.now(), and dag_ongoing with an hourly interval and start date of datetime.now() that takes over when dag_backfill finishes. However two DAGs in one file is discouraged here:

    We do support more than one DAG definition per python file, but it is not recommended as we would like better isolation between DAGs from a fault and deployment perspective...

  2. Two .py files that import the same python functions that make up the pipeline. I worry about keeping the separate files consistent in this approach.

  3. Only one DAG with an hourly interval that checks if the the run date is over 1 day in the past and if so only runs at midnight for those dates. I feel like that is inelegant through as it would obscure the schedule the backfilling will run on, at least from the gui homepage.

Is there a common pattern for this or known best practice?

Sibbie answered 17/9, 2019 at 10:27 Comment(1)
I've just experienced the same issue, and I'm convinced that a your solution number 1 is the best given the options provided by Airflow so far. Personally, I prefer generating several DAGs from the same Python file even if it's discouraged as it will more compatible with software engineering best practices like avoiding duplicate code.Crossway
H
0

Of your three options, either 2 or 3 are completely acceptable.

In our ETL schedule, we have delta extracts that run daily then refreshes on the weekend. We use two separate DAGs for this on two separate schedules. They call the same code, just with different parameters, so the underlying code is always consistent. Just make sure you name the DAGs similar to each other, like dag_import_xyz_hourly and dag_import_xyz_daily.

Hackathorn answered 26/2, 2022 at 22:20 Comment(0)
T
0

I recommend using option 1 except set the transition time explictly, don’t use now(). Airflow reparses DAGs frequently so that time will not be static the way you want it to be.

I have read many, many files that generate multiple DAGs and it is not a problem, especially when those DAGs are closely conceptually related.

Treasure answered 15/4, 2023 at 3:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.