scalaapache-sparkgeospark

Why is Apache Sedona not reading this Shapefile properly?


I'm using Apache Spark v3.0.1 and Apache Sedona v1.1.1 and I'm trying to read a Shapefile into a SpatialRDD. I first tried the example provided by the Sedona library (more specifically, the code inside testShapefileConstructor method), and it just worked. However, when I try to read another Shapefile, despite the fact that metadata was loaded correctly, the actual data was missing. Using count on the SpatialRDD gives me 0.

The shapefile I'm using is available here. It's the map of a Brazilian state. Since I tried with data from other states, I guess there's something wrong with those files.

And this is the code I used. I'm aware that the contents of the shapefile reside in a folder with .shp, .shx, .dbf and .prj files, so the variable path to that folder.

import org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
import org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader
import org.apache.sedona.sql.utils.{Adapter, SedonaSQLRegistrator}
import org.apache.sedona.viz.sql.utils.SedonaVizRegistrator
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.{SparkSession, DataFrame, Encoder}

object Main {

def main(args: Array[String]) {
    val spark = SparkSession.builder
      .config("spark.master", "local[*]")
      .config("spark.serializer", classOf[KryoSerializer].getName)
      .config("spark.kryo.registrator", classOf[SedonaVizKryoRegistrator].getName)
      .appName("test")
      .getOrCreate()

    SedonaSQLRegistrator.registerAll(spark)
    SedonaVizRegistrator.registerAll(spark)

    val path = "/path/to/shapefile/folder"
    val spatialRDD = ShapefileReader.readToGeometryRDD(spark.sparkContext, path)
    println(spatialRDD.fieldNames)
    println(spatialRDD.rawSpatialRDD.count())
    var rawSpatialDf = Adapter.toDf(spatialRDD, spark)
    rawSpatialDf.show()
    rawSpatialDf.printSchema()
  }
}

Output:

[ID, CD_GEOCODM, NM_MUNICIP]
0
+--------+---+----------+----------+
|geometry| ID|CD_GEOCODM|NM_MUNICIP|
+--------+---+----------+----------+
+--------+---+----------+----------+
root
 |-- geometry: geometry (nullable = true)
 |-- ID: string (nullable = true)
 |-- CD_GEOCODM: string (nullable = true)
 |-- NM_MUNICIP: string (nullable = true)

I tried changing the character encoding, as pointed out here, but the results were the same after these attempts:

System.setProperty("sedona.global.charset", "utf8")

and

System.setProperty("sedona.global.charset", "iso-8859-1")

So I still have no idea why this fails to be read. What could be problem?


Solution

  • Currently Sedona only supports Shapefile type Point, Polyline, Polygon, and MultiPoint (i.e., type 1, 3, 5, 8) according to https://github.com/apache/incubator-sedona/blob/master/core/src/main/java/org/apache/sedona/core/formatMapper/shapefileParser/parseUtils/shp/ShapeType.java

    But your data might be something else because Shapefile specification supports more types: https://en.wikipedia.org/wiki/Shapefile