hadoopairflowgoogle-cloud-dataprocgoogle-cloud-composeroozie-workflow

Workflow scheduling on GCP Dataproc cluster


I have some complex Oozie workflows to migrate from on-prem Hadoop to GCP Dataproc. Workflows consist of shell-scripts, Python scripts, Spark-Scala jobs, Sqoop jobs etc.

I have come across some potential solutions incorporating my workflow scheduling needs:

  1. Cloud Composer
  2. Dataproc Workflow Template with Cloud Scheduling
  3. Install Oozie on Dataproc auto-scaling cluster

Please let me know which option would be most efficient in terms of performance, costing and migration complexities.


Solution

  • All 3 are reasonable options (though #2 Scheduler+Dataproc is the most clunky). A few questions to consider: how often do your workflows run, how tolerant are you to unused VMs, how complex are your Oozie workflows, and how willing are you to invest time into migration?

    Dataproc's workflows support branch/join but lack other Oozie features such as what to do on job failure, decision nodes, etc. If you use any of these, I'd would not even consider a direct migration to Workflow Templates and choose either #3 or the hybrid migration below.

    A good place to start, would be hybrid migration (this is assuming your clusters are sparsely used). Keep your Oozie workflows and have Composer + Workflow Templates create a cluster with Oozie, use init action to stage your Oozie XML files + job jars/artifacts, add a single pig sh job from a Workflow to trigger Oozie via CLI.