Does anyone know a python sample about medallion architecture in Python? A sample like this one in SQL https://www.databricks.com/notebooks/delta-lake-cdf.html
In the simplest case it's just a bunch of Spark's .readStream
-> some transformations -> .writeStream
(although it's possible to do it in the non-stream fashion, you spend more time on the tracking what has changed, etc.). In the plain Spark + Databricks Autoloader it will be:
# bronze
raw_df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load(input_data)
raw_df.writeStream.format("delta") \
.option("checkpointLocation", bronze_checkpoint) \
.trigger(...) \ # availableNow=True if you want to mimic batch-like processing
.start(bronze_path)
# silver
bronze_df = spark.readStream.load(bronze_path)
# do transformations on silver_df
silver_df = bronze_df.filter(....)
silver_df.writeStream.format("delta") \
.option("checkpointLocation", silver_checkpoint) \
.trigger(...) \
.start(silver_path)
# gold
silver_df = spark.readStream.load(silver_path)
gold = silver_df.groupBy(...)
But really, it's becoming much simpler if you're using Delta Live Tables - then you concentrate just on transformations, not on the things how to write data, etc. Something like this:
@dlt.table
def bronze():
return spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load(input_data)
@dlt.table
def silver():
bronze = dlt.read_stream("bronze")
return bronze.filter(...)
@dlt.table
def gold():
silver = dlt.read_stream("silver")
return silver.groupBy(...)