[SOLVED] Delta Live Tables for Batch Incremental Processing

Delta Live Tables for Batch Incremental Processing

Is it possible to use Delta Live Tables to perform incremental batch processing?

Now, I believe that this code will always load all of the data available in the directory when a pipeline is run,

CREATE LIVE TABLE lendingclub_raw
COMMENT "The raw loan risk dataset, ingested from /databricks-datasets."
TBLPROPERTIES ("quality" = "bronze")
AS SELECT * FROM parquet.`/databricks-datasets/samples/lending_club/parquet/`

But, if we do,

CREATE LIVE TABLE lendingclub_raw
COMMENT "The raw loan risk dataset, ingested from /databricks-datasets."
TBLPROPERTIES ("quality" = "bronze")
AS SELECT * cloud_files("/databricks-datasets/samples/lending_club/parquet/", "parquet")

Will it only load the incremental data each time it runs, if the pipeline is run in triggered mode?

I know that you can achieve batch incremental processing in Auto Loader by using the trigger mode .trigger(once=True) or .trigger(availableNow=True) and running the pipeline on a schedule.

Since you cannot exactly define a trigger in DLT, how will this work?

Solution

You need to define your table as streaming live, so it will process only data that arrived since last invocation. From docs:

A streaming live table or view processes data that has been added only since the last pipeline update.

And then it could be combined with triggered execution that will behave similar to Trigger.AvailableNow. From docs:

Triggered pipelines update each table with whatever data is currently available and then stop the cluster running the pipeline. Delta Live Tables automatically analyzes the dependencies between your tables and starts by computing those that read from external sources. Tables within the pipeline are updated after their dependent data sources have been updated.