pysparkspark-streamingamazon-emramazon-kinesisaws-data-pipeline

Spark Streaming scheduling best practices


We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15 seconds and will be recreated. Is it the recommended approach?


Solution

  • For a job that takes 15 seconds running it on EMR is waste of time and resources, you will likely wait for a few minutes for an EMR cluster to bootstrap.

    AWS Data Pipeline or AWS Batch will make sense only if you have a long running job.

    First, make sure that you really need Spark since from what you described it could be an overkill.

    Lambda with a CloudWatch Event scheduling might be all what you need for such a quick job with no infrastructure to manage.