[SOLVED] Databricks Lakehouse Medallion Architecture Python Sample

Databricks Lakehouse Medallion Architecture Python Sample

Does anyone know a python sample about medallion architecture in Python? A sample like this one in SQL https://www.databricks.com/notebooks/delta-lake-cdf.html

Solution

In the simplest case it's just a bunch of Spark's .readStream -> some transformations -> .writeStream (although it's possible to do it in the non-stream fashion, you spend more time on the tracking what has changed, etc.). In the plain Spark + Databricks Autoloader it will be:

# bronze
raw_df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  .load(input_data)
raw_df.writeStream.format("delta") \
  .option("checkpointLocation", bronze_checkpoint) \
  .trigger(...) \ # availableNow=True if you want to mimic batch-like processing
  .start(bronze_path)
# silver
bronze_df = spark.readStream.load(bronze_path)
# do transformations on silver_df
silver_df = bronze_df.filter(....)
silver_df.writeStream.format("delta") \
  .option("checkpointLocation", silver_checkpoint) \
  .trigger(...) \
  .start(silver_path)
# gold
silver_df = spark.readStream.load(silver_path)
gold = silver_df.groupBy(...)

But really, it's becoming much simpler if you're using Delta Live Tables - then you concentrate just on transformations, not on the things how to write data, etc. Something like this:

@dlt.table
def bronze():
  return spark.readStream.format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .load(input_data)

@dlt.table
def silver():
  bronze = dlt.read_stream("bronze")
  return bronze.filter(...)

@dlt.table
def gold():
  silver = dlt.read_stream("silver")
  return silver.groupBy(...)