I've been testing geomesa with simple spatial queries and comparing it with Postgis. For example this SQL query runs in 30 sec in Postgis:
with series as (
select
generate_series(0, 5000) as i
),
points as (
select ST_Point(i, i*2) as geom from series
)
select st_distance(a.geom, b.geom) from points as a, points as b
Now, the following geomesa version takes 5 min (using -Xmx10g ):
import org.apache.spark.sql.SparkSession
import org.locationtech.geomesa.spark.jts._
import org.locationtech.jts.geom._
object HelloWorld {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.config("spark.sql.crossJoin.enabled", "true")
.config("spark.executor.memory", "12g")
.config("spark.driver.memory", "12g")
.config("spark.cores.max", "4")
.master("local")
.appName("Geomesa")
.getOrCreate()
spark.withJTS
import spark.implicits._
val x = 0 until 5000
val y = for (i <- x) yield i*2
val coords = for ((i, n) <- x.zipWithIndex) yield (i, y(n))
val points = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
val points2 = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
val all_points = for {
i <- points
j <- points2} yield (i, j)
val df = all_points.toDF("point", "point2")
val df2 = df.withColumn("dist", st_distance($"point", $"point2"))
df2.show()
}
}
I'd have expected similar or better performance from geomesa, what can be done to tune a query like this?
FIRST EDIT
As Emilio suggests, this is not really a query but a computation. This query could have been written without spark. The code below runs in less than two seconds:
import org.locationtech.jts.geom._
object HelloWorld {
def main(args: Array[String]): Unit = {
val x = 0 until 5000
val y = for (i <- x) yield i*2
val coords = for ((i, n) <- x.zipWithIndex) yield (i, y(n))
val points = for (i <- coords) yield new GeometryFactory().createPoint(new Coordinate(i._1, i._2))
val points2 = for {
i <- points
j <- points} yield i.distance(j)
println(points2.slice(0,30))
}
}
GeoMesa is not going to be as fast as PostGIS for small amounts of data. GeoMesa is designed for distributed, NoSQL databases. If your dataset fits in PostGIS, you should probably just use PostGIS. Once you start hitting the limits of PostGIS, you should consider using GeoMesa. GeoMesa does offer integration with arbitrary GeoTools data stores (including PostGIS), which can make some of the GeoMesa Spark and command-line features available to PostGIS.
For your particular snippet, I suspect that most of the time is spent spinning up an RDD and running through the loops. There isn't really a 'query', as you are just running a pair-wise calculation. If you are querying data stored in a table, then GeoMesa has a chance to optimize the scan. However, GeoMesa isn't a SQL database, and doesn't have any native support for joins. Generally the join is done in memory by Spark, although there are some things you can do to speed it up (i.e. a broadcast join or RDD partitioning). If you want to do complex spatial joins, you might want to check out GeoSpark and/or Magellan, which specialize in spatial Spark operations.