I'm attempting to implement SCD2 within Azure Data Factory, but I'm encountering issues with the update mechanism. Instead of updating rows, my process seems to insert all rows from the source data into the destination, leading to duplicates. This example is just trying to do the update, I scaled it down for simplicity - and I am using AdventureWorksLT2019 database. Here's a breakdown of the data flow:
Data Retrieval:
Hash Key Generation for New Data:
New_ProductID
, New_Name
, New_ProductNumber
, New_Color
, and New_StandardCost
.Existence Check:
Data Lookup:
Existing_ProductID == New_ProductID
AND Existing_Hashkey != HashKey
.Row Alteration:
true()
).Column Selection and Renaming:
NEW_
from the source dataset. For the hash key, retain the existing one from the destination dataset.Data Sink (Inline - Delta Dataset Type):
Hashkey
Despite following the above steps, I'm still getting duplicate rows in the destination. I've tried various adjustments, such as changing the list of columns key and filtering before the sink operation (e.g., only inserting rows with ID = 707
). However, these attempts haven't resolved the issue.
I have no idea how to fix this. Any tips or help is appreciated!
Okay. So, turn out I was asking for a solution for apples when my problem was pears. It took me quite some time to find the root of the problem.
I was reading my data via Azure Synapse. I thought I should read the data with .parquet format in Azure Synapse. But, since I wasn't writing my data via an Azure Gen storage account gen2 (parquet) destination. So i wasn't actually using the parquet format. Reading through the format ".parquet" was actually wrong. Since I was writing the data via the DELTA format, I needed to use the file format "delta" to read the correct the data. Delta format - Microsoft
CREATE EXTERNAL FILE FORMAT DeltaFormat
WITH
(
FORMAT_TYPE = DELTA
)
So when reading the data via the correct format, I got the correct view.
The reason reading through the .parquet format was giving me "duplicates", is that when writing through delta, it writes new files, and sets the old files as "Not Active", or "Do not use". Which the parquet reading form ignores, while the delta format takes this metadata in consideration. So, even though Delta datatype writes to parquet format, it's different from regular parquet files.