Our organisation has been using Databricks recently for ETL and development of datasets. However I have found the libraries/capabilities for raster datasets very limiting. There are a few raster/Spark libraries around, but they are not very mature. For example GeoTrellis, RasterFrames and Apache Sedona.
I have therefore been exploring alternative ways to efficiently work with raster data on the Databricks platform, which leverages Spark / Delta tables / Parquet files.
One idea I had was to dump the raster data to simple x, y, value columns and load them as tables. Providing my other datasets are of the same resolution (I will pre-process them so that they are), I should then be able to do simple SQL queries for masking/addition/subtraction and more complex user-defined functions.
Step one, I thought would be to dump my raster to points as a CSV, and then I can load to a Delta table. But after 12 hours running on my Databricks cluster (128GB memory, 16 cores), a 3GB raster had still not finished (I was using the gdal2xyz function below).
Does anyone have a quicker way to dump a raster to CSV? Or even better, directly to parquet format.
python gdal2xyz.py -band 1 -skipnodata "AR_FLRF_UD_Q1500_RD_02.tif" "AR_FLRF_UD_Q1500_RD_02.csv"
Maybe I can tile the raster, dump each CSV to file using parallel processing, and then bind the CSV files together but it seems a bit laborious.
You can use Sedona to easily load GeoTiffs to DataFrame and save the dataframe as Parquet format. See here: https://sedona.apache.org/latest-snapshot/api/sql/Raster-loader/