I have a scenario in Azure Data Factory where I need to compare a daily CSV file in Blob Storage with a table in an SQL Database. The goal is to copy rows from the CSV to the SQL table only if they don't already exist in the table.
Problem Details:
Approach Tried:
Issues Encountered:
Questions:
Any advice or alternative approaches would be greatly appreciated.
Here is a snapshot of my dataflow:
When you import the schemas from both source and sink datasets, it is comparing the imported data types.
I have tried without importing any schema from the source and sink datasets and it got worked for the given scenario.
Clear the schema from both source and sink datasets.
This is sample source data where some rows already exists in the target table.
ID,Name,mytime,Amount,Fahrenheit,age,role,mydate
8,Laddu,2024-04-26 11:45:00,12.24,95.04,24,Pirate,02-16-00
9,MS,2024-04-27 12:00:00,24.267,10.10,26,No job,05-30-24
10,ABD,2024-04-26 11:45:00,12.24,95.04,24,Pirate,02-16-00
11,Starc,2024-04-26 11:45:00,12.24,95.04,24,Pirate,02-16-00
12,KP,2024-04-26 11:45:00,12.24,95.04,24,Pirate,2024-05-30
13,Rabada,2024-04-26 11:45:00,12.24,95.04,24,Bowler,2023-06-29
7,Rakesh,2024-04-25 10:30:00,123.451,97.16,23,Engineer,2021-12-16
In both sources of the dataflow, set the projection to empty.
The columns will not be identified by the dataflow debug. So, use byName(<column_name>)
when using columns in the derived column.
Here, for the exists transformation columns, I have created extra column Id_temp
in both sources like below.
Then, use these new columns in the exists transformation.
Next, use select transformation rule-based mapping to remove the extra column.
Add sink to this and execute the dataflow from the pipeline.
It will give the expected results like below.