I am trying to understand the difference between the two most common strategies of incremental data load.
What is the difference between a streaming checkpoint vs a change stream in Databricks Delta Lake?
Thanks.
Checkpoints
are for saving state and progress across [micro batches, incremental feeds such as CDF or append only Delta table ops using spark.readStream
a la Spark Structured Streaming in various guises and for normal processing - when no error occurs] and [for restarts in the various aforementioned guises]. Meaning you need not track where you last processed from, that is automatic.
Change stream
is the CDF (CDC) feed for / from Delta Lake tables that can be processed via spark.read
or spark.readStream
. In the same plane as say a KAFKA feed.
That's all.