Airflow setup for high availability
Asked Answered
W

4

12

How to deploy apache airflow (formally known as airbnb's airflow) scheduler in high availability?

I am not asking about the backend DB or RabbitMQ that should obviously be deployed in high availability configuration.

My main focus is the scheduler - is there something special needs to be done?

Webworm answered 19/9, 2016 at 11:27 Comment(0)
W
11

After a bit digging I found that it is not safe to run multiple schedulers simoultanously, this means that out of the box - airflow schedulers are not safe to use in high availablity environments.

The airflow team are planning to solve this issue by adding a lock mechanism on the DAG data structure, but this is not implemented yet (I checked by running 2 schedulers and saw that they schedule the same dag instances which is not good). This is described here: https://groups.google.com/forum/#!topic/airbnb_airflow/-1wKa3OcwME

I did found a way to workaround this high availalbilty issue by wrapping the schedulers with my own code and use cluster tools for leader election (I personanlly use consul for this purpose). This way only the elected master is running the scheduler and when the master is down the slave replaces him.

Please consider this when u use airflow in high availabilty environments since out of the box, airflow scheduler is currently not suitable for this (unless you solve this issue yourself).

Edit - an alternative approach to the master slave solution is to use a cluster manager/scheduler to make sure that only one airflow scheduler instance is always available. This approach relies on the self healing abilities of the cluster manager u have. For example both mesos and nomad supports this kind of configuration (I presonally chose nomad for its simplicity).

Webworm answered 20/9, 2016 at 13:31 Comment(5)
I'm researching this and came across this via Google. Did you blog about this or any sample code that describes how Nomad and Consul plays into all this? We use Consul but Nomad is still new to us. I want to be able to fire up the scheduler on a different node or same node should it go down for some reason.Prieto
i didn't blog about it. If you setup nomad just specify it as a service and use count=1 and ake sure the constrainsts meet few nodes. it should do the trick.Webworm
U can also use Kubernetes or any other orchestration tool. this is called self healing.Webworm
@luckytaxi Not HA, but: Because we experienced scheduling lag over 5 minutes, we added some regular expressions to ignore some dags (so 5 schedulers can roughly cover 1/5 of dags each with setup). To do this the code which deactivates dags had to be skipped as a scheduler which ignores a dag might deactivate it. That leaves adding a manual step to call the periodic deactivation from a webserver that sees all the dags. Perfect? No, a round robin would be self-balancing, but need a model-store change. CloudFormation restarts stopped schedulers for us.Peacock
For anyone that finds this thread on Google, this answer is out of date; The scheduler is able to run in a HA capacity as of Airflow 2.0: astronomer.io/blog/airflow-2-schedulerHalfpenny
L
1

My personal experience was to follow the instructions I found for some best practices; that is to restart the scheduler every 10 runs ( -N 10 ) and use this software when possible:

https://github.com/teamclairvoyant/airflow-scheduler-failover-controller

I also use a DAG which pings a monitoring system to be sure that the scheduler has not gone away.

Lodicule answered 4/6, 2017 at 7:45 Comment(0)
H
1

NOTE: This answer only applies to Airflow >= 2.0.

According to the docs, it appears to be as simple as starting a second airflow-scheduler on another node:

The short version is that users of PostgreSQL 10+ or MySQL 8+ are all ready to go – you can start running as many copies of the scheduler as you like – there is no further set up or config options needed.

See this blog post from Astronomer about the work they did to make the scheduler support running in a HA capacity.

Halfpenny answered 5/6, 2023 at 14:51 Comment(0)
M
0

In my scenario, I have 2 schedulers (on 2 separate docker swarms), with the standby cluster scheduler turned off (using docker swarm service scale=0). I needed to make sure the primary scheduler had stopped fully before I started up the standby scheduler. What I found was that having 2 running schedulers (even for a brief time period) resulted in an occasional DAG scheduled to run on both clusters leading to duplicate reports generated from two different cluster zone.

Mennonite answered 28/6, 2022 at 12:52 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.