databricksdelta-lake

Databricks Lakehouse Medallion Architecture Python Sample


Does anyone know a python sample about medallion architecture in Python? A sample like this one in SQL https://www.databricks.com/notebooks/delta-lake-cdf.html


Solution

  • In the simplest case it's just a bunch of Spark's .readStream -> some transformations -> .writeStream (although it's possible to do it in the non-stream fashion, you spend more time on the tracking what has changed, etc.). In the plain Spark + Databricks Autoloader it will be:

    # bronze
    raw_df = spark.readStream.format("cloudFiles") \
      .option("cloudFiles.format", "json") \
      .load(input_data)
    raw_df.writeStream.format("delta") \
      .option("checkpointLocation", bronze_checkpoint) \
      .trigger(...) \ # availableNow=True if you want to mimic batch-like processing
      .start(bronze_path)
    # silver
    bronze_df = spark.readStream.load(bronze_path)
    # do transformations on silver_df
    silver_df = bronze_df.filter(....)
    silver_df.writeStream.format("delta") \
      .option("checkpointLocation", silver_checkpoint) \
      .trigger(...) \
      .start(silver_path)
    # gold
    silver_df = spark.readStream.load(silver_path)
    gold = silver_df.groupBy(...)
    

    But really, it's becoming much simpler if you're using Delta Live Tables - then you concentrate just on transformations, not on the things how to write data, etc. Something like this:

    @dlt.table
    def bronze():
      return spark.readStream.format("cloudFiles") \
        .option("cloudFiles.format", "json") \
        .load(input_data)
    
    @dlt.table
    def silver():
      bronze = dlt.read_stream("bronze")
      return bronze.filter(...)
    
    @dlt.table
    def gold():
      silver = dlt.read_stream("silver")
      return silver.groupBy(...)