apache-sparkgoogle-cloud-platformdatabricksgoogle-cloud-dataprocgcp-databricks

Does Dataproc support Delta Lake format?


Is the Databricks Delta format available with Google's GCP DataProc?

For AWS and AZURE it is clear that this is so. However, when perusing, researching the internet, I am unsure that this is the case. Databricks docs less clear as well.

I am assuming Google feel their offerings are sufficient. E.g. Google Cloud Storage and is it mutable? This https://docs.gcp.databricks.com/getting-started/overview.html provides too little context.


Solution

  • Delta Lake format is supported on Dataproc. You can simply use it as any other data format such as Parquet and ORC. The following is an example from this article.

    # Copyright 2022 Google LLC.
    # SPDX-License-Identifier: Apache-2.0
    import sys
    from pyspark.sql import SparkSession
    from delta import *
    
    def main():
        input = sys.argv[1]
        print("Starting job: GCS Bucket: ", input)
        spark = SparkSession\
            .builder\
            .appName("DeltaTest")\
            .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
            .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
            .getOrCreate()
        data = spark.range(0, 500)
        data.write.format("delta").mode("append").save(input)
        df = spark.read \
        .format("delta") \
        .load(input)
        df.show()
        spark.stop()
    
    if __name__ == "__main__":
        main()
    

    You also need to add the dependency when submitting the job with --properties="spark.jars.packages=io.delta:delta-core_2.12:1.1.0".