windowsapache-sparkspark-connect

How to run spark-connect server on Windows?


As the title says.

I see the Quick Guide and Spark Connect Overview say to run sbin/start-connect-server.sh to start the server. There are no equivalent bat or cmd files available for Windows for anything in sbin. I see someone translated to create spark-sbin-windows but it's 8 years old.

In the documentation for Spark Stand alone setup, I found following note, which makes me think there is no built-in support for what I'm trying to do, and it has to be done "manually":

Note: The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand.


I tried to dig through the shell scripts in sbin, to replicate the commands manually, but could not. It seems like deep inside the scripts, start-connect-server.sh will run this command:

spark-daemon.sh submit "org.apache.spark.sql.connect.service.SparkConnectServer" 1 --name "Spark Connect server" --packages org.apache.spark:spark-connect_2.12:3.5.4

I downloaded the spark-connect jar from maven central and tried:

C:\spark-3.5.4\bin> spark-class "org.apache.spark.sql.connect.service.SparkConnectServer" 1 --name "Spark Connect server" -cp C:\Users\kash\.m2\repository\org\apache\spark\spark-connect_2.12\3.5.4\spark-connect_2.12-3.5.4.jar
Error: Could not find or load main class org.apache.spark.sql.connect.service.SparkConnectServer

C:\spark-3.5.4\bin> 

spark-connect_2.12-3.5.4.jar does contain class org.apache.spark.sql.connect.service.SparkConnectServer

I also tried many permutations/combinations of spark-submit etc., but no luck.

Environment is fine, e.g. spark-submit works fine, so do pyspark, spark-shell, ...

C:\spark-3.5.4\bin>cat ..\print_my_name.py
from pyspark.sql.session import SparkSession
print('\n\n>>>', SparkSession.Builder().getOrCreate().conf.get('spark.app.name'), '<<<\n\n')

C:\spark-3.5.4\bin>spark-submit ..\print_my_name.py
25/02/25 11:30:34 INFO SparkContext: Running Spark version 3.5.4
... snip ...

>>> print_my_name.py <<<

... snip ...
25/02/25 11:30:37 INFO SparkContext: SparkContext is stopping with exitCode 0.
C:\spark-3.5.4\bin>

Solution

  • Figured out, here is what worked for me:

    start-connect-server.bat

    set SPARK_HOME=C:\My\workspaces\spark-3.5.4-bin-hadoop3\
    set HADOOP_HOME=C:\My\workspaces\spark-3.5.4-bin-hadoop3\
    set JAVA_HOME="C:\ProgFiles\jdk-1.8"
    
    set CLASS="org.apache.spark.sql.connect.service.SparkConnectServer"
    
    spark-submit --class %CLASS% 1 --name "Spark Connect server"^
      --packages org.apache.spark:spark-connect_2.12:3.5.4,io.delta:delta-core_2.12:2.4.0,org.apache.hadoop:hadoop-aws:3.3.4,uk.co.gresearch.spark:spark-extension_2.12:2.10.0-3.5,com.oracle.database.jdbc:ojdbc8:23.2.0.0^
      --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"^
      --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
    
    rem without delta support:
    rem spark-submit --class %CLASS% 1 --name "Spark Connect server"^
    rem   --packages org.apache.spark:spark-connect_2.12:3.5.4
    

    Connect to remote "sc://localhost:15002"

    C:\My\workspaces\spark-3.5.4-bin-hadoop3> cat spark-connect-test.py
    from pyspark.sql.session import SparkSession
    
    spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
    print('\n>>>', spark.conf.get('spark.app.name'), '<<<\n')
    spark.createDataFrame(data=[(i,) for i in range(5)], schema='id: int').show()
    
    C:\My\workspaces\spark-3.5.4-bin-hadoop3> python spark-connect-test.py
    
    >>> Spark Connect server <<<
    
    +---+
    | id|
    +---+
    |  0|
    |  1|
    |  2|
    |  3|
    |  4|
    +---+
    
    
    C:\My\workspaces\spark-3.5.4-bin-hadoop3>