apache-sparkpysparkdelta-lakeincremental-load

Difference between checkpoints and change streams


I am trying to understand the difference between the two most common strategies of incremental data load.

What is the difference between a streaming checkpoint vs a change stream in Databricks Delta Lake?

Thanks.


Solution

    1. Checkpoints are for saving state and progress across [micro batches, incremental feeds such as CDF or append only Delta table ops using spark.readStream a la Spark Structured Streaming in various guises and for normal processing - when no error occurs] and [for restarts in the various aforementioned guises]. Meaning you need not track where you last processed from, that is automatic.

    2. Change stream is the CDF (CDC) feed for / from Delta Lake tables that can be processed via spark.read or spark.readStream. In the same plane as say a KAFKA feed.

    That's all.