apache-sparkcluster-computingapache-spark-standalone

Spark Standalone how to pass local .jar file to cluster


I have a cluster with two workers and one master. To start master & workers I use the sbin/start-master.sh and sbin/start-slaves.shin the master's machine. Then, the master UI shows me that the slaves are ALIVE (so, everything OK so far). Issue comes when I want to use spark-submit.

I execute this command in my local machine:

spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster /home/user/example.jar

But the following error pops up: ERROR ClientEndpoint: Exception from cluster was: java.nio.file.NoSuchFileException: /home/user/example.jar

I have been doing some research in stack overflow and Spark's documentation and it seems like I should specify the application-jar of spark-submit command as "Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes." (as it indicates https://spark.apache.org/docs/latest/submitting-applications.html).

My question is: how can I set my .jar as globally visible inside the cluster? There is a similar question in here Spark Standalone cluster cannot read the files in local filesystem but solutions do not work for me.

Also, am I doing something wrong by initialising the cluster inside my master's machine using sbin/start-master.sh but then doing the spark-submit in my local machine? I initialise the master inside my master's terminal because I read so in Spark's documentation, but maybe this has something to do with the issue. From Spark's documentation:

Once you’ve set up this file, you can launch or stop your cluster with the following shell scripts, based on Hadoop’s deploy scripts, and available in SPARK_HOME/sbin: [...] Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.

Thank you very much

EDIT: I have copied the file .jar in every worker and it works. But my point is to know if there is a better way, since this method makes me copy the .jar to each worker everytime I create a new jar. (This was one of the answers from the question of the already posted link Spark Standalone cluster cannot read the files in local filesystem )


Solution

  • @meisan your spark-submit command is missing out on 2 things.

    Now you have not specified anywhere if you are using scala or python but in the nutshell your command will look something like:

    for python :

    spark-submit --master spark://<master>:7077 --deploy-mode cluster --jar <dependency-jars> <python-file-holding-driver-logic>

    for scala:

    spark-submit --master spark://<master>:7077 --deploy-mode cluster --class <scala-driver-class> --driver-class-path <application-jar> --jar <dependency-jars>

    Also, spark takes care of sending the required files and jars to the executors when you use the documented flags. If you want to omit the --driver-class-path flag, you can set the environmental variable SPARK_CLASSPATH to path where all your jars are placed.