databricksdelta-lakescddelta-live-tables

Azure Databricks Delta Live Table stored as SCD 2 is creating new records when no data changes


I have a streaming pipeline that ingests json files from a data lake. These files are dumped there periodically. Mostly the files contain duplicate data, but there are occasional changes. I am trying to process these files into a data warehouse using the STORED AS SCD 2 option of the Delta Live tables pipeline. Everything looks to be working fine, but I am getting duplicate rows in the scd table - even though no data is changing.

In the pipeline cluster configuration, I have added pipelines.enableTrackHistory True pipeline config

Here is the SQL for the SCD table:

CREATE OR REFRESH STREAMING LIVE TABLE currStudents_SCD;

APPLY CHANGES INTO
  live.currStudents_SCD
FROM
  stream(live.currStudents_ingest)
KEYS
  (id)
SEQUENCE BY
  file_modification_time
STORED AS
  SCD TYPE 2
TRACK HISTORY ON * EXCEPT (file_modification_time)
;

Any ideas why I get this and how I can stop these duplicates?

I can't find any information in the documentation to help with this. Except possibly, the vague suggestion that, in SCD 2, there will be new records when no columns change. But the Track History documentation seems to indicate that only changes to monitored columns will result in a new record....


@Shivam Shukla below is correct. However, for those who would like to address this issue, Databricks were able to help me out. I have now added a new stage between the ingestion of the stream and the SCD2 table. This stage uses the option:

df.dropDuplicates([col for col in df.columns if col != "file_modification_time"])

option to ensure that only changes are past to the SCD2 tracking, without including my file_modification_time column in the assessment of duplicates


Solution

  • This is a expected behaviour of databricks delta live table. If you use SCD 2 pattern irrespective of the changes the history of the records are maintained. Track history helps you with specifying which columns to consider for tracking updates. If you don't specify any columns all history is reported. If tracking columns are provided this will basically only create history if there are changes in this columns else it will simply overwrite the table like in SCD 1 pattern.