How to make a DAG in Apache Airflow run like a simple cron job?
Asked Answered
C

2

3

Airflow scheduler kinda left me scratching my head for the past few days as it backfills dag runs even after catchup=False. My timezone-aware dag has a start date of 13-04-2021 19:30 PST or 14-04-2021 2:30 UTC and has the following configuration:

# define DAG and its parameters
dag = DAG(
    'backup_dag',
    default_args=default_args,
    start_date=pendulum.datetime(2021, 4, 13, 19, 30, tz='US/Pacific'),  # set start_date in US/Pacific (PST) timezone
    description='A data backup pipeline',
    schedule_interval="30 19 * * *",  # 7:30 PM every day
    catchup=False,
    is_paused_upon_creation=False
)

This dag runs on an edge device, that edge device is sometimes on and sometimes off. I want this dag to basically schedule its run at 19:30 PST or 2:30 UTC, whenever the edge device is on, otherwise don't. The weird thing is that when I deploy the container with the dag to the edge device the dag automatically starts its first run outside the scheduled interval, even though that interval has passed!

Screenshot from 2021-04-19 16-50-54

What am I missing here? I can't wrap my head around why the scheduler is doing this

Following is my understanding after reading all the documentation, please do correct me if I'm wrong.

DAG picked up by scheduler at 2021-04-19T011:30:00+00:00 UTC, ideally it should run at 2021-04-20T02:30:00+00:00 UTC according to the dag config. All times below are in UTC

      Dag Start_date         1st run(skip catchup=false)   2nd run(skip catchup=false)    3rd run(skip catchup=false)   4th run(skip catchup=false)
2021-04-14T02:30:00+00:00 ---> 2021-04-15T02:30:00+00:00 ---> 2021-04-16T02:30:00+00:00  ---> 2021-04-17T02:30:00+00:00 ---> 2021-04-18T02:30:00+00:00 ---> 

5th run(skip catchup=false)   6th run(should execute)              
 2021-04-19T02:30:00+00:00 ---> 2021-04-20T02:30:00+00:00

So, why is the 5th run taking place for interval 2021-04-18T02:30:00+00:00 to 2021-04-19T02:30:00+00:00 even though the interval has passed?

I want the DAG to only run when its interval has come.

Colligan answered 22/4, 2021 at 11:59 Comment(0)
F
2

This is expected Airflow behavior:

turn catchup off. [...] When turned off, the scheduler creates a DAG run only for the latest interval.

The corresponding example in the Catchup section is similar to yours and explains the behavior in more detail.

A dirty workaround of which I can think is to set the schedule_interval=None and actually trigger the DAG from cron using CLI.

Faultfinder answered 22/4, 2021 at 17:28 Comment(4)
Hmmm 🤔 kinda of an interesting workaround, but seems like a big oversight from the Airflow devsColligan
@RafayKhan they are working on a solution cwiki.apache.org/confluence/display/AIRFLOW/…Faultfinder
I tried circumventing this by setting a dynamic start date of yesterday, but seems like that also does not fix this. I'm going to try your method now.Colligan
This suggestion bypasses the whole benefit of using Airflow as a task orchestrator. Airflow is greatly capable of handling such cases. Please look at my answer below.Knew
K
-1

This is the expected behavior. If you set the catchup=False, it skips creating all previous DAG runs, except the very latest. To handle your desired scenario in Airflow, you need to use a sensor. proceed according to steps below:

  • set catchup=False
  • set depends_on_past=False
  • create a TimeSensor as your first task. your DAG will proceed only after the condition for the time sensor is passed.

here is an example of using a TimeSensor: https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/example_dags/example_sensors.html

Knew answered 20/1 at 7:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.