We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15 seconds and will be recreated. Is it the recommended approach?
For a job that takes 15 seconds
running it on EMR is waste of time and resources, you will likely wait for a few minutes for an EMR cluster to bootstrap.
AWS Data Pipeline or AWS Batch will make sense only if you have a long running job.
First, make sure that you really need Spark since from what you described it could be an overkill.
Lambda with a CloudWatch Event scheduling might be all what you need for such a quick job with no infrastructure to manage.