apache-sparkpysparkapache-sedona

Receiving "NoSuchMethodError" when running SQL query in a PySpark application with Apache Sedona


I am creating a PySpark application to do the following:

I previously had an issue with my Scala version, which I seem to have resolved (Receiving "Scala.MatchError" when running SQL query in a PySpark application with Apache Sedona, possibly caused by incompatible versions), but I am now facing an error when I run spark.sql(sql).show(5) on a query with an ST_Intersects function

Py4JJavaError: An error occurred while calling o112.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 4 times, most recent failure: Lost task 0.3 in stage 8.0 (TID 10) (node5 executor 1): java.lang.NoSuchMethodError: org.apache.commons.text.StringSubstitutor.setEnableUndefinedVariableException(Z)Lorg/apache/commons/text/StringSubstitutor;

After some searching, I verified that Apache Commons Text v1.10 does have a StringSubstitutor.setEnableUndefinedVariableException method and I even modified my setup code to the following:

.config('spark.jars.packages',
                    'org.apache.sedona:sedona-spark-shaded-3.4_2.12:1.5.0,' +
                    'org.datasyslab:geotools-wrapper:1.5.0-28.2,' +
                    'org.apache.commons:commons-text:1.10.0'
                )

Unfortunately, I am still getting the same error!

UPDATE 1

I have removed org.apache.commons:commons-text:1.10.0 from my config (since it wasn't really helping)

I have tried removing the jar files from my $SPARK_HOME/jars folder and keeping them in my .config(), but that results in the below error:

23/10/21 01:42:48 INFO DAGScheduler: ShuffleMapStage 0 (count at NativeMethodAccessorImpl.java:0) failed in 0.143 s due to Job aborted due to stage failure: Task serialization failed: org.apache.spark.SparkException: Failed to register classes with Kryo
org.apache.spark.SparkException: Failed to register classes with Kryo
        at org.apache.spark.serializer.KryoSerializer.$anonfun$newKryo$5(KryoSerializer.scala:185)
...
Caused by: java.lang.ClassNotFoundException: org.apache.sedona.core.serde.SedonaKryoRegistrator
        at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)

Keeping the jars in $SPARK_HOME/jars and removing the .config() results in the same error pointing to commons text

Here is my entire config block:

spark = (
            SparkSession.builder
                .appName("sedona-test-app")
                .config("hive.metastore.uris", "thrift://node1:9083", conf=SparkConf())
                .config("spark.serializer", KryoSerializer.getName)
                .config("spark.kryo.registrator", SedonaKryoRegistrator.getName) 
                .config('spark.jars.packages',
                    'org.apache.sedona:sedona-spark-shaded-3.4_2.12:1.5.0,' +
                    'org.datasyslab:geotools-wrapper:1.5.0-28.2'
                )
                .enableHiveSupport()
                .getOrCreate()
        )


SedonaRegistrator.registerAll(spark)
sc = spark.sparkContext
sc.setSystemProperty("sedona.global.charset", "utf8")

I have also tried switching from Spark 3.5 to Spark 3.4.1 without any success.

I have also tried starting the script with different commands such as:

python <path to file>.py
spark-submit <path to file>.py
spark-submit --jars $SPARK_HOME/jars/sedona-spark-shaded-3.4_2.12-1.5.0.jar,$SPARK_HOME/jars/geotools-wrapper-1.5.0-28.2.jar <path to file>.py

I have also tried using Jupyter notebooks and am still getting the same error. NOTE: I am using anaconda


Solution

  • After a lot of experimentation, I was able to find a setup that worked.

    I copied the following files into the $SPARK_HOME/jars directory:

    geotools-wrapper-1.5.0-28.2.jar 
    apache-sedona-1.5.0-bin/sedona-spark-shaded-3.0_2.12-1.5.0.jar
    

    I have added the following into my python code

    .config('spark.jars.packages',
                        'org.apache.sedona:sedona-python-adapter-3.0_2.12:1.1.0-incubating,' +
                        'org.datasyslab:geotools-wrapper:1.5.0-28.2'
                    )
    

    This seems to work despite the inconsistent sedona versions ... I might try to play around with this a little more if I have more time ...

    Update

    I have no idea what changed, but I finally got it working with the latest version of sedona, I removed the jar files from the $SPARK_HOME/jars directory and used the following lines in my python code:

    .config('spark.jars.packages',
                        'org.apache.sedona:sedona-spark-shaded-3.0_2.12:1.5.0,' +
                        'org.datasyslab:geotools-wrapper:1.5.0-28.2'
                    )