I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
scheduler
(in fact 2-3k concurrent tasks would be enough for that); but even well before that you'll start getting annoyed at the relatively slowflask
frontend (which doesn't auto-refresh things). Have never explored AWS Step Functions but can give you my 2 cents on Airflow [1] do NOT create monolothic DAGs (with hundreds of tasks): try keeping DAGs at < 10 tasks. also do NOT create unnecessary dependencies b/w tasks: each dependency adds up extra work forscheduler
– Chucklescheduler
essentially works on polling (periodically checking what tasks can be run and then running them) – Chuckle