I am working on a project that grabs a set of input data from AWS S3, pre-processes and divvies it up, spins up 10K batch containers to process the divvied data in parallel on AWS Batch, post-aggregates the data, and pushes it to S3.
I already have software patterns from other projects for Airflow + Batch, but have not dealt with the scaling factors of 10k parallel tasks. Airflow is nice since I can look at which tasks failed and retry a task after debugging. But dealing with that many tasks on one Airflow EC2 instance seems like a barrier. Other option would be to have one task that kicks off the 10k containers and monitors it from there.
I have no experience with Step Functions, but have heard it's AWS's Airflow. There looks to be plenty of patterns online for Step Functions + Batch. Does Step Functions seem like a good path to check out for my use case? Do you get the same insights on failing jobs / ability to retry tasks as you do with Airflow?
I have worked on both Apache Airflow and AWS Step Functions and here are some insights:
Overall, I see more advantages of using AWS Step Functions. You will have to consider maintenance cost and development cost for both services as per your use case.
UPDATES (AWS Managed Workflows for Apache Airflow Service):