I am trying to run Apache Sedona 1.5.3 for Spark 3.4, on AWS EMR.
After following the instructions, I am getting the error,
File "/usr/local/lib64/python3.7/site-packages/sedona/sql/dataframe_api.py", line 156, in validated_function
return f(*args, **kwargs)
File "/usr/local/lib64/python3.7/site-packages/sedona/sql/st_constructors.py", line 125, in ST_GeomFromWKT
return _call_constructor_function("ST_GeomFromWKT", args)
File "/usr/local/lib64/python3.7/site-packages/sedona/sql/dataframe_api.py", line 65, in call_sedona_function
jc = jfunc(*args)
TypeError: 'JavaPackage' object is not callable
Which generally means that the jar package can't be found or used, or is the wrong version.
The instructions linked above were tested for EMR 6.9.0 for Spark 3.3.0; I am trying 6.14, to try to get to Spark 3.4. However, the instructions do note:
If you are using Spark 3.4+ and Scala 2.12, please use
sedona-spark-shaded-3.4_2.12
. Please pay attention to the Spark version postfix and Scala version postfix.
So it seems like Spark 3.4 should be OK. (Failing a response here, I'll try reverting to 3.3.0.)
In the Spark config, I specified,
"spark.yarn.dist.jars": "/jars/sedona-spark-shaded-3.4_2.12-1.5.3.jar,/jars/geotools-wrapper-1.5.3-28.2.jar",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.kryo.registrator": "org.apache.sedona.core.serde.SedonaKryoRegistrator",
"spark.sql.extensions": "org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
which I think are the correct versions. I may be confused, since the instructions for spark 3.3.0 use sedona-spark-shaded-3.0_2.12-1.5.3.jar
. Perhaps I don't understand what the 3.0/3.4 refer to. But based on this answer from the Sedona developers, I think I am setting this correctly.
My bootstrap downloads the jars and pip installs the directed packages. At the end, I ls -lh
'ed to make sure they're there. I also wrote them to the user directory to try them with hadoop instead of root owner (see below):
-rw-r--r-- 1 root root 29M May 10 15:49 /jars/geotools-wrapper-1.5.3-28.2.jar
-rw-r--r-- 1 root root 21M May 10 15:49 /jars/sedona-spark-shaded-3.4_2.12-1.5.3.jar
-rw-rw-r-- 1 hadoop hadoop 29M May 10 15:49 /home/hadoop/custom/geotools-wrapper-1.5.3-28.2.jar
-rw-rw-r-- 1 hadoop hadoop 21M May 10 15:49 /home/hadoop/custom/sedona-spark-shaded-3.4_2.12-1.5.3.jar
In the pyspark script, I logged sedona.version
to confirm that it is 1.5.3, and spark.sparkContext._conf.getAll()
to confirm that it includes
('spark.yarn.dist.jars', 'file:/jars/sedona-spark-shaded-3.4_2.12-1.5.3.jar,file:/jars/geotools-wrapper-1.5.3-28.2.jar'
Since /jars
wrote for root user rather than hadoop
, I also loaded the jars to a local directory in /home/hadoop/
. In previous versions of Sedona, I had used spark.driver.extraClassPath
and spark.executor.extraClassPath
to specify the downloaded jars. Finally, I tried just using --jars
, even though this post said
spark.jars property is ignored for EMR on EC2 since it uses Yarn to deploy jars. See SEDONA-183
None of these things worked: they all gave the same 'JavaPackage' object is not callable
error.
Local tests with
packages = [
"org.apache.sedona:sedona-spark-3.4_2.12:1.5.3",
"org.datasyslab:geotools-wrapper:1.5.3-28.2",
]
conf = (
SedonaContext.builder()
.config("spark.jars.packages", ",".join(packages))
[...]
.config(
"spark.jars.repositories",
"https://artifacts.unidata.ucar.edu/repository/unidata-all",
)
.getOrCreate()
)
spark = SedonaContext.create(conf)
all run fine. Note the unshaded jar, in that context, as directed:
If you run Sedona in an IDE or a local Jupyter notebook, use the
unshaded
jar.
Further, in the EMR version, I don't create Spark, so I'm left to assume that the SedonaContext "works out." I removed the SedonaRegistrator.registerAll(spark)
call, which is now deprecated and not required in the local version as above.
I believe that you need to replace SedonaRegistrator
, not simply remove it. So you can just run
sedona_conf = SedonaContext.builder().getOrCreate()
spark = SedonaContext.create(sedona_conf)
test=spark.sql("SELECT ST_Point(double(1.2345), 2.3456)")
test.show()
in your EMR job, as you did local.