.netwindowsapache-sparkkubernetesspark-submit

spark-submit to remote Spark


Currently I've deployed Spark to minikube in my local machine. Pod and its containers are up and running, and I've already checked that port 7077 is listening from the host machine (local machine).

Now I want to spark-submit from the host machine. Thus, I've downloaded Spark's binaries and I've moved them to c:\bin\spark-3.2.1-bin-hadoop3.2, and I've added c:\bin\spark-3.2.1-bin-hadoop3.2\bin to the PATH.

When I run spark-submitas follows...

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master spark.local:7077 microsoft-spark-3-2_2.12-2.1.1.jar dotnet C:\projects\xxx\xxx-dotnet-solution\xx-services/infrastructure/etl-service/Spark/bin/Debug/netcoreapp3.1/xxx.xx.Services.Infraestructure.ETLService.Spark.dll

...I get the following error org.apache.spark.SparkException: Could not parse Master URL: 'spark.local'.

I'm not sure if I'm mistaken, and maybe the issue is I can't spark-submit from my local machine to the remote Spark. Is this ever possible?


Solution

  • According to the master URL docs that parameter accepts either some keywords like local, yarn or specific URL protocols, spark://, mesos://, k8s://. It can't handle machine or domain names.

    In the .NET for Apache Spark tutorial the command uses the local keyword, not a host name :

    spark-submit ^
    --class org.apache.spark.deploy.dotnet.DotnetRunner ^
    --master local ^
    microsoft-spark-3-0_2.12-<version>.jar ^
    dotnet MySparkApp.dll <path-of-input.txt>
    

    From the docs :

    The master URL passed to Spark can be in one of the following formats:

    Master URL Meaning
    local Run Spark locally with one worker thread (i.e. no parallelism at all).
    local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
    local[K,F] Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable).
    local[*] Run Spark locally with as many worker threads as logical cores on your machine.
    local[*,F] Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.
    local-cluster[N,C,M] Local-cluster mode is only for unit tests. It emulates a distributed cluster in a single JVM with N number of workers, C cores per worker and M MiB of memory per worker.
    spark://HOST:PORT Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
    spark://HOST1:PORT1,HOST2:PORT2 Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default.
    mesos://HOST:PORT Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.
    yarn Connect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
    k8s://HOST:PORT Connect to a Kubernetes cluster in client or cluster mode depending on the value of --deploy-mode. The HOST and PORT refer to the Kubernetes API Server. It connects using TLS by default. In order to force it to use an unsecured connection, you can use k8s://http://HOST:PORT.