I have a running Dataproc cluster. I wanted to submit a Spark job directly to YARN with spark-submit
from an edge node outside of the cluster. Ideally spark-submit
should only need access to the YARN resource manager address, so we configured firewall rules to only allow that, but the job submission failed because it needed to access the cluster's HDFS.
Questions:
spark-submit
need to access HDFS?It has to do with the property spark.yarn.stagingDir
1. The dir is used by spark-submit
to stage jars and configs, so YARN can access and distribute them to executors. The default value is the current user's home directory in HDFS, but it can be set to a GCS dir to avoid HDFS, for example:
spark-submit --conf spark.yarn.stagingDir=gs://my-bucket/spark-staging/