apache-spark azure-databricks geohashing geomesa

GeoMesa Spark can't use geohash

I am using GeoMesa Spark on a Databricks cluster referring to this sample notebook: GeoMesa - NYC Taxis. I had no problem importing and using UDF functions such as st_makePoint and st_intersects. However, when I try to use st_geoHash to create a column of geohash of the points, I got this error:

NoClassDefFoundError: Could not initialize class org.locationtech.geomesa.spark.jts.util.GeoHash$.

The cluster has geomesa-spark-jts_2.11:3.2.1 and scala-logging_2.11:3.8.0 installed, which are the two given by the notebook (but with a different version of GeoMesa, 2.3.2 in the notebook while 3.2.1 on my cluster). I am new to GeoMesa and Databricks platform. I wonder if I missed some dependencies for the Geohash class to work.

Solution

(Updated 18 OCT) I am one of the original contributors to this notebook along with Derek Yeager who was the primary author. Complex frameworks like geomesa may require more special attention as our UI maven support on clusters is built for streamlined library installs, more here. The notebook was originally built for Spark 2.4 (DBR 6.x) on a fat jar of geomesa that we generated back at that time (late 2019). That jar shaded some dependency conflicts with DBR. Instead of the fat jar, you could use Databricks Container Services which can be useful for deploying more complex frameworks over our clusters. Should mention that DBR 7+ / Spark 3+ is Scala 2.12 only, so you wouldn't expect Scala 2.11 to work on those runtimes.

CCRi (backers of geomesa) has generated Databricks friendly build. A shaded fat jar for GeoMesa (current version is 3.3.0) is available at the maven coordinates org.locationtech.geomesa:geomesa-gt-spark-runtime_2.12:3.3.0 which is for spark runtimes (such as Databricks). Since it is shaded, users can add maven exclusions to get it to cleanly install which would be "jline:*,org.geotools:*" added in Databricks library UI without quotes. I have been able to execute the notebook you referenced (with some small changes) on DBR 9.1 LTS (for Spark 3.1).

One change from the initial notebook is you no longer need to specially add com.typesafe.scala-logging:scala-logging_2.11:3.8.0
Another is that you cannot change the spark configs inside the session, so you would minimally comment out (and could potentially add to cluster config if want):

// spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
// spark.conf.set("spark.kryo.registrator", classOf[GeoMesaSparkKryoRegistrator].getName) 
// spark.conf.set("spark.kryoserializer.buffer.max","500m")

Comment out the skew hint and let Spark 3.x AQE handle:

//short circuit on geohash and apply geospatial predicate when necessary

val joined = trips.as("L")
    // - let AQE handle skew
    //.hint("skew", "pickup_geohash_25", pickup_skews).as("L")
    .join(
      neighborhoodsDF.as("R"),
      ( $"L.pickup_geohash_25" === $"R.geohash" ) && 
      ( st_contains($"R.polygon", $"L.pickupPoint") )
    )
    .drop("geohashes")
    .drop("geohash")
    .drop("polygon")

I have access to all the data from within the Databricks environment in which we produced the notebook, so I am assuming you are somehow reconstituting the data on your side if you are attempting to execute your copy the notebook.