apache-sparkpysparkgoogle-cloud-dataprocspark-submitdataproc

How are spark jobs submitted in cluster mode?


I know there is information worth 10 google pages on this but, all of them tell me to just put --master yarn in the spark-submit command. But, in cluster mode, how can my local laptop even know what that means? Let us say I have my laptop and a running dataproc cluster. How can I use spark-submit from my laptop to submit a job to this cluster?


Solution

  • Most of the documentation on running a Spark application in cluster mode assumes that you are already on the same cluster where YARN/Hadoop are configured (e.g. you are ssh'ed in), in which case most of the time Spark will pick up the appropriate local configs and "just work".

    This is same for Dataproc: if you ssh onto the Dataproc master node, you can just run spark-submit --master yarn. More detailed instructions can be found in the documentation.

    If you are trying to run applications locally on your laptop, this is more difficult. You will need to set up an ssh tunnel to the cluster, and then locally create configuration files that tell Spark how to reach the master via the tunnel.

    Alternatively, you can use the Dataproc jobs API to submit jobs to the cluster without having to directly connect. The one caveat is that you will have to use properties to tell Spark to run in cluster mode instead of client mode (--properties spark.submit.deployMode=cluster). Note that when submitting jobs via the Dataproc API, the difference between client and cluster mode is much less pressing because in either case the Spark driver will actually run on the cluster (on the master or a worker respectively), not on your local laptop.