As the title says.
I see the Quick Guide and Spark Connect Overview say to run sbin/start-connect-server.sh
to start the server. There are no equivalent bat
or cmd
files available for Windows for anything in sbin. I see someone translated to create spark-sbin-windows but it's 8 years old.
In the documentation for Spark Stand alone setup, I found following note, which makes me think there is no built-in support for what I'm trying to do, and it has to be done "manually":
Note: The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand.
I tried to dig through the shell scripts in sbin, to replicate the commands manually, but could not. It seems like deep inside the scripts, start-connect-server.sh
will run this command:
spark-daemon.sh submit "org.apache.spark.sql.connect.service.SparkConnectServer" 1 --name "Spark Connect server" --packages org.apache.spark:spark-connect_2.12:3.5.4
I downloaded the spark-connect jar from maven central and tried:
C:\spark-3.5.4\bin> spark-class "org.apache.spark.sql.connect.service.SparkConnectServer" 1 --name "Spark Connect server" -cp C:\Users\kash\.m2\repository\org\apache\spark\spark-connect_2.12\3.5.4\spark-connect_2.12-3.5.4.jar
Error: Could not find or load main class org.apache.spark.sql.connect.service.SparkConnectServer
C:\spark-3.5.4\bin>
spark-connect_2.12-3.5.4.jar
does contain class org.apache.spark.sql.connect.service.SparkConnectServer
I also tried many permutations/combinations of spark-submit
etc., but no luck.
Environment is fine, e.g. spark-submit
works fine, so do pyspark
, spark-shell
, ...
C:\spark-3.5.4\bin>cat ..\print_my_name.py
from pyspark.sql.session import SparkSession
print('\n\n>>>', SparkSession.Builder().getOrCreate().conf.get('spark.app.name'), '<<<\n\n')
C:\spark-3.5.4\bin>spark-submit ..\print_my_name.py
25/02/25 11:30:34 INFO SparkContext: Running Spark version 3.5.4
... snip ...
>>> print_my_name.py <<<
... snip ...
25/02/25 11:30:37 INFO SparkContext: SparkContext is stopping with exitCode 0.
C:\spark-3.5.4\bin>
Figured out, here is what worked for me:
start-connect-server.bat
set SPARK_HOME=C:\My\workspaces\spark-3.5.4-bin-hadoop3\
set HADOOP_HOME=C:\My\workspaces\spark-3.5.4-bin-hadoop3\
set JAVA_HOME="C:\ProgFiles\jdk-1.8"
set CLASS="org.apache.spark.sql.connect.service.SparkConnectServer"
spark-submit --class %CLASS% 1 --name "Spark Connect server"^
--packages org.apache.spark:spark-connect_2.12:3.5.4,io.delta:delta-core_2.12:2.4.0,org.apache.hadoop:hadoop-aws:3.3.4,uk.co.gresearch.spark:spark-extension_2.12:2.10.0-3.5,com.oracle.database.jdbc:ojdbc8:23.2.0.0^
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"^
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
rem without delta support:
rem spark-submit --class %CLASS% 1 --name "Spark Connect server"^
rem --packages org.apache.spark:spark-connect_2.12:3.5.4
Connect to remote "sc://localhost:15002"
C:\My\workspaces\spark-3.5.4-bin-hadoop3> cat spark-connect-test.py
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()
print('\n>>>', spark.conf.get('spark.app.name'), '<<<\n')
spark.createDataFrame(data=[(i,) for i in range(5)], schema='id: int').show()
C:\My\workspaces\spark-3.5.4-bin-hadoop3> python spark-connect-test.py
>>> Spark Connect server <<<
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
C:\My\workspaces\spark-3.5.4-bin-hadoop3>