Airflow scheduler kinda left me scratching my head for the past few days as it backfills dag runs even after catchup=False
.
My timezone-aware dag has a start date of 13-04-2021 19:30 PST
or 14-04-2021 2:30 UTC
and has the following configuration:
# define DAG and its parameters
dag = DAG(
'backup_dag',
default_args=default_args,
start_date=pendulum.datetime(2021, 4, 13, 19, 30, tz='US/Pacific'), # set start_date in US/Pacific (PST) timezone
description='A data backup pipeline',
schedule_interval="30 19 * * *", # 7:30 PM every day
catchup=False,
is_paused_upon_creation=False
)
This dag runs on an edge device, that edge device is sometimes on and sometimes off. I want this dag to basically schedule its run at 19:30 PST
or 2:30 UTC
, whenever the edge device is on, otherwise don't. The weird thing is that when I deploy the container with the dag to the edge device the dag automatically starts its first run outside the scheduled interval, even though that interval has passed!
What am I missing here? I can't wrap my head around why the scheduler is doing this
Following is my understanding after reading all the documentation, please do correct me if I'm wrong.
DAG picked up by scheduler at 2021-04-19T011:30:00+00:00 UTC
, ideally it should run at 2021-04-20T02:30:00+00:00 UTC
according to the dag config. All times below are in UTC
Dag Start_date 1st run(skip catchup=false) 2nd run(skip catchup=false) 3rd run(skip catchup=false) 4th run(skip catchup=false)
2021-04-14T02:30:00+00:00 ---> 2021-04-15T02:30:00+00:00 ---> 2021-04-16T02:30:00+00:00 ---> 2021-04-17T02:30:00+00:00 ---> 2021-04-18T02:30:00+00:00 --->
5th run(skip catchup=false) 6th run(should execute)
2021-04-19T02:30:00+00:00 ---> 2021-04-20T02:30:00+00:00
So, why is the 5th run taking place for interval 2021-04-18T02:30:00+00:00
to 2021-04-19T02:30:00+00:00
even though the interval has passed?
I want the DAG to only run when its interval has come.