Airflow versus AWS Step Functions for workflow
Asked Answered
P

3

62

I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.

I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.

I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?

Plenitude answered 22/9, 2020 at 19:58 Comment(5)
my anecdotal experience with Airflow says that 10k concurrent tasks would simply choke the scheduler (in fact 2-3k concurrent tasks would be enough for that); but even well before that you'll start getting annoyed at the relatively slow flask frontend (which doesn't auto-refresh things). Have never explored AWS Step Functions but can give you my 2 cents on Airflow [1] do NOT create monolothic DAGs (with hundreds of tasks): try keeping DAGs at < 10 tasks. also do NOT create unnecessary dependencies b/w tasks: each dependency adds up extra work for schedulerChuckle
[2] design your workflows (tasks / operators) to use Airflow as a pure orchestrator: tasks should delegate heavy-lifting (actual processing) to external systems (different machines than the one airflow / it's workers are running on). That ways, you'll be able to scale your Airflow deployment independent of the variety of tasks it triggers [3] keep your DAGs (as well as individual tasks in them) to be immutableChuckle
the primary reason why I feel Airflow can't run so many concurrent things is because the scheduler essentially works on polling (periodically checking what tasks can be run and then running them)Chuckle
Do checkout Netflix's MetaFlow that leverages AWS Step FunctionsChuckle
@Plenitude looks like misuse of orchestrator. Splitting processing into subtasks should be responsibility of processing framework. For example, Airflow triggers one job, while Spark on Glue or EMR split the data into tasks under the hood, you should worry only about application logicChamkis
C
114

I have worked on both Apache Airflow and AWS Step Functions and here are some insights:

  • Step Functions provide out of the box maintenance. It has high availability and scalability that is required for your use-case, for Airflow we'll have to do to it with auto-scaling/load balancing on servers or containers (kubernetes).*
  • Both Airflow and Step Functions have user friendly UI's. While Airflow supports multiple representations of the state machine, Step Functions only display state machine as DAG's.
  • As of version 2.0, Airflow's Rest API is now stable. AWS Step Functions are also supported by a range of production graded cli and SDK's.
  • Airflow has server costs while Step Functions have 4000/month free step executions (free tier) and $0.000025/step after that. e.g. if you use 10K steps for AWS Batch that run once daily, you will be priced $0.25 per day ($7.5 per month). The price for Airflow server (t2.large ec2 1 year reserved instance) is $41.98 per month. We will have to use AWS Batch for either case.**
  • AWS Batch can integrate to both Airflow and Step Functions.
  • You can clear and rerun a failed task in Apache Airflow, but in Step Functions you will have to create a custom implementation to handle that. You may handle automated retries with back-offs in Step Functions definition as well.
  • For failed task in Step Functions you will get a visual representation of failed state and the detailed message when you click it. You may also use aws cli or sdk to get the details.
  • Step Functions use easy to use JSON as state machine definition, while Airflow uses Python script.
  • Step Functions support async callbacks, i.e. state machine pauses until an external source notifies it to resume. While Airflow has yet to add this feature.

Overall, I see more advantages of using AWS Step Functions. You will have to consider maintenance cost and development cost for both services as per your use case.

UPDATES (AWS Managed Workflows for Apache Airflow Service):

  • *With AWS Managed Workflows for Apache Airflow service, you can offload deployment, maintenance, autoscaling/load balancing and security of your Airflow Service to AWS. But please consider the version number you're willing to settle for, as AWS managed services are mostly behind the latest version. (e.g. As of March 08, 2021, the latest version of open source airflow is 2.01, while MWAA allows version 1.10.12)
  • **MWAA costs on environment, instance and storage. More details here.
Chiu answered 7/10, 2020 at 8:11 Comment(7)
is this still all valid with AWS-managed Airflow...? If not, let's update the answer hereNiki
@NathanBenton thanks for the pointer, will update soon.Chiu
@Chiu If I have 5 sql queries and I want to create 5 Dags so that I can trigger 5 diff SQL operations and also check if they fail so that I can trigger that particular DAG again . Can I set and email notifications for each DAG ? also can I set retry for EACH DAG. Can I achieve the same in step function. I remember if I create one DAG for the sql queries then if one fail then I have to restart my whole DAG again. Thats why I was thinking to create multiple DAG for each sql query and if one fails it retries before moving on to the next one. Please guideSixtasixteen
see "retries" argument in case of dag failsAftershaft
what about monitoring and logging? do we have the same for logging tools for AWS step function and airflow ..Lebrun
Step Functions state machines are not DAGs. They can have cycles.Undertaking
As of 2024 I do not see the reason about using Airflow. Step Functions is the way to go in AWS env.Solo
G
6

I have used both Airflow and Step Functions in my personal and work projects.

  • In general I liked step functions but the fact that you need to schedule the execution with Event Bridge is super annoying. Actually I think here Airflow could just act as a triggered for the step functions.
  • If Airflow would be cheaper to manage, I would always opt for it because I find managing Json based pipelines a hustle whenever you need to detour from the main use case. This always happen for me somehow.This becomes even a more complex issue when you need to have source control.
  • This one is a bit more subjective assessment but I find the monitoring capability of Airflow far greater than for step functions.

Also some information about the usage of Airflow vs Step functions

enter image description here

Gemsbok answered 27/7, 2022 at 13:56 Comment(0)
P
-3

Aws currently has managed airflow which is priced per hour and you don’t need to have dedicated ec2. On the other hand step functions are aws lambdas that have an execution time limit of 15min which makes them not the best candidate for a long running pipelines

Prance answered 1/2, 2023 at 3:24 Comment(2)
It is possible to use step functions for use cases where it requires time more than 15 mins. Step functions can also connect with aws batch, glue jobs which can run longer.Gilemette
No, AWS StepoFunctions are not AWS Lambdas, these two services have nothing in common.Illuminate

© 2022 - 2024 — McMap. All rights reserved.