I have set up Apache Sedona in Amazon EMR cluster based on https://sedona.apache.org/1.5.0/setup/emr/
I attached the EMR cluster to the JupyterLab in the Amazon EMR.
First I set up config to allow me to read from Delta table registered in the AWS Glue Catalog:
%%configure -f
{
"conf": {
"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
}
}
When I run
%%sql
select _time, _coordinate
from my_db.my_delta_table
order by _time desc
limit 5
It gives me result:
My _coordinate
is WKT string format.
Now I try to run Spatial Spark SQL from Apache Sedona:
%%sql
select
_time,
ST_Distance(
ST_GeomFromWKT('POINT(37.335480 -121.893028)'),
ST_GeomFromWKT(_coordinate)
) as `Distance to San Jose, CA`
from my_db.my_delta_table
order by _time desc
limit 5
I got error
An error was encountered:
[UNRESOLVED_ROUTINE] Cannot resolve function `ST_Distance` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`].; line 3 pos 4
Traceback (most recent call last):
File "/mnt1/yarn/usercache/livy/appcache/application_1699328410941_0006/container_1699328410941_0006_01_000001/pyspark.zip/pyspark/sql/session.py", line 1440, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
File "/mnt1/yarn/usercache/livy/appcache/application_1699328410941_0006/container_1699328410941_0006_01_000001/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1323, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/mnt1/yarn/usercache/livy/appcache/application_1699328410941_0006/container_1699328410941_0006_01_000001/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco
raise converted from None
pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `ST_Distance` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`].; line 3 pos 4
I think I need somehow register Apache Sedona SQL functions. How to register them? Thanks!
Essentially you need to put the following script in your notebook cell:
%%configure -f
{
"conf": {
"spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension,org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions",
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
}
}