amazon-web-servicesaws-step-functionsaws-data-pipeline

AWS Data Pipeline vs Step Functions


I am working on a problem where we intend to perform multiple transformations on data using EMR (SparkSQL).

After going through the documentation of AWS Data Pipelines and AWS Step Functions, I am slightly confused as to what is the use-case each tries to solve. I looked around but did not find a authoritative comparison between both. There are multiple resources that show how I can use them both to schedule and trigger Spark jobs on an EMR cluster.

  1. Which one should I use for scheduling and orchestrating my processing EMR jobs?

  2. More generally, in what situation would one be a better choice over the other as far as ETL/data processing is concerned?


Solution

  • Yes, there are many ways to achieve the same thing, and the difference is in the details and in your use case. I am going to even offer yet one more alternative :)

    If you are doing a sequence of transformations and all of them are on an EMR cluster, maybe all you need is either to create the cluster with steps, or submit an API job with several steps. Steps will execute in order on your cluster.

    If you have different sources of data, or you want to handle more complex scenarios, then both AWS Data Pipeline and AWS Step Functions would work. AWS Step Functions is a generic way of implementing workflows, while Data Pipelines is a specialized workflow for working with Data.

    That means that Data Pipeline will be better integrated when it comes to deal with data sources and outputs, and to work directly with tools like S3, EMR, DynamoDB, Redshift, or RDS. So for a pure data pipeline problem, chances are AWS Data Pipeline is a better candidate.

    Having said so, AWS Data Pipeline is not very flexible. If the data source you need is not supported, or if you want to execute some activity which is not integrated, then you need to hack your way around with shell scripts.

    On the other hand, AWS Step Functions are not specialized and have good integration with some AWS Services and with AWS Lambda, meaning you can easily integrate with anything via serverless apis.

    So it really depends on what you need to achieve and the type of workload you have.