apache-sparkapache-spark-sqljupyter-labapache-sedona

How to register Apache Sedona SQL functions in Amazon EMR JupyterLab?


I have set up Apache Sedona in Amazon EMR cluster based on https://sedona.apache.org/1.5.0/setup/emr/

I attached the EMR cluster to the JupyterLab in the Amazon EMR.

First I set up config to allow me to read from Delta table registered in the AWS Glue Catalog:

%%configure -f
{
  "conf": {
    "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
    "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
  }
}

When I run

%%sql
select _time, _coordinate
from my_db.my_delta_table
order by _time desc
limit 5

It gives me result:

enter image description here

My _coordinate is WKT string format.

Now I try to run Spatial Spark SQL from Apache Sedona:

%%sql
select
    _time,
    ST_Distance(
        ST_GeomFromWKT('POINT(37.335480 -121.893028)'),
        ST_GeomFromWKT(_coordinate)
    ) as `Distance to San Jose, CA`
from my_db.my_delta_table
order by _time desc
limit 5

I got error

An error was encountered:
[UNRESOLVED_ROUTINE] Cannot resolve function `ST_Distance` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`].; line 3 pos 4
Traceback (most recent call last):
  File "/mnt1/yarn/usercache/livy/appcache/application_1699328410941_0006/container_1699328410941_0006_01_000001/pyspark.zip/pyspark/sql/session.py", line 1440, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
  File "/mnt1/yarn/usercache/livy/appcache/application_1699328410941_0006/container_1699328410941_0006_01_000001/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1323, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/mnt1/yarn/usercache/livy/appcache/application_1699328410941_0006/container_1699328410941_0006_01_000001/pyspark.zip/pyspark/errors/exceptions/captured.py", line 175, in deco
    raise converted from None
pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `ST_Distance` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`].; line 3 pos 4

I think I need somehow register Apache Sedona SQL functions. How to register them? Thanks!


Solution

  • Essentially you need to put the following script in your notebook cell:

    %%configure -f
    {
      "conf": {
        "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension,org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog"
      }
    }