I am creating a PySpark application to do the following:
I previously had an issue with my Scala version, which I seem to have resolved (Receiving "Scala.MatchError" when running SQL query in a PySpark application with Apache Sedona, possibly caused by incompatible versions), but I am now facing an error when I run spark.sql(sql).show(5)
on a query with an ST_Intersects
function
Py4JJavaError: An error occurred while calling o112.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 10) (node5 executor 1): java.lang.NoSuchMethodError: org.apache.commons.text.StringSubstitutor.setEnableUndefinedVariableException(Z)Lorg/apache/commons/text/StringSubstitutor;
After some searching, I verified that Apache Commons Text v1.10 does have a StringSubstitutor.setEnableUndefinedVariableException method and I even modified my setup code to the following:
.config('spark.jars.packages',
'org.apache.sedona:sedona-spark-shaded-3.4_2.12:1.5.0,' +
'org.datasyslab:geotools-wrapper:1.5.0-28.2,' +
'org.apache.commons:commons-text:1.10.0'
)
Unfortunately, I am still getting the same error!
I have removed org.apache.commons:commons-text:1.10.0
from my config (since it wasn't really helping)
I have tried removing the jar files from my $SPARK_HOME/jars
folder and keeping them in my .config()
, but that results in the below error:
23/10/21 01:42:48 INFO DAGScheduler: ShuffleMapStage 0 (count at NativeMethodAccessorImpl.java:0) failed in 0.143 s due to Job aborted due to stage failure: Task serialization failed: org.apache.spark.SparkException: Failed to register classes with Kryo
org.apache.spark.SparkException: Failed to register classes with Kryo
at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:185)
...
Caused by: java.lang.ClassNotFoundException: org.apache.sedona.core.serde.SedonaKryoRegistrator
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
Keeping the jars in $SPARK_HOME/jars
and removing the .config()
results in the same error pointing to commons text
Here is my entire config block:
spark = (
SparkSession.builder
.appName("sedona-test-app")
.config("hive.metastore.uris", "thrift://node1:9083", conf=SparkConf())
.config("spark.serializer", KryoSerializer.getName)
.config("spark.kryo.registrator", SedonaKryoRegistrator.getName)
.config('spark.jars.packages',
'org.apache.sedona:sedona-spark-shaded-3.4_2.12:1.5.0,' +
'org.datasyslab:geotools-wrapper:1.5.0-28.2'
)
.enableHiveSupport()
.getOrCreate()
)
SedonaRegistrator.registerAll(spark)
sc = spark.sparkContext
sc.setSystemProperty("sedona.global.charset", "utf8")
I have also tried switching from Spark 3.5 to Spark 3.4.1 without any success.
I have also tried starting the script with different commands such as:
python <path to file>.py
spark-submit <path to file>.py
spark-submit --jars $SPARK_HOME/jars/sedona-spark-shaded-3.4_2.12-1.5.0.jar,$SPARK_HOME/jars/geotools-wrapper-1.5.0-28.2.jar <path to file>.py
I have also tried using Jupyter notebooks and am still getting the same error. NOTE: I am using anaconda
After a lot of experimentation, I was able to find a setup that worked.
I copied the following files into the $SPARK_HOME/jars directory:
geotools-wrapper-1.5.0-28.2.jar
apache-sedona-1.5.0-bin/sedona-spark-shaded-3.0_2.12-1.5.0.jar
I have added the following into my python code
.config('spark.jars.packages',
'org.apache.sedona:sedona-python-adapter-3.0_2.12:1.1.0-incubating,' +
'org.datasyslab:geotools-wrapper:1.5.0-28.2'
)
This seems to work despite the inconsistent sedona versions ... I might try to play around with this a little more if I have more time ...
I have no idea what changed, but I finally got it working with the latest version of sedona, I removed the jar files from the $SPARK_HOME/jars
directory and used the following lines in my python code:
.config('spark.jars.packages',
'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.5.0,' +
'org.datasyslab:geotools-wrapper:1.5.0-28.2'
)