apache-sparkpysparkapache-sedona

Receiving "Scala.MatchError" when running SQL query in a PySpark application with Apache Sedona, possibly caused by incompatible versions


I am creating a PySpark application to do the following:

Unfortunately, I am getting an error that looks something like this:

Py4JJavaError: An error occurred while calling o1539.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 56 in stage 255.0 failed 4 times, most recent failure: Lost task 56.3 in stage 255.0 (TID 874) (node5 executor 2): scala.MatchError: 40.74207644618617 (of class org.apache.spark.unsafe.types.UTF8String)
...
Caused by: scala.MatchError: 40.74207644618617 (of class org.apache.spark.unsafe.types.UTF8String)
    at org.apache.spark.sql.sedona_sql.expressions.ST_Point.eval(Constructors.scala:315)

After a lot of searching, the best suggestion I could find was that the versions of Spark, PySpark and Sedona that I am using may be incompatible with each other. I have been trying to use the latest versions of all these applications:


Solution

  • So, it turns out that Pyspark 3.5 uses Scala 2.12 (unless you specifically install the Scala 2.13 version) and I was using the Sedona jar file built for scala 2.13.

    So, to clarify, when you download sedona, you will see jar files for both versions of scala. Make sure that you copy the right jar file into your $SPARK_HOME/jars directory. your python code should look something like this when setting up your SparkSession:

    .config('spark.jars.packages',
                        'org.apache.sedona:sedona-spark-shaded-3.4_2.12:1.5.0,' +
                        'org.datasyslab:geotools-wrapper:1.5.0-28.2'
                    )
    

    Unfortunately, I am now getting a different error when running an SQL query with an ST_Intersects ... But, I will ask a separate question since it seems unrelated to the Scala version.