azure azure-sql-database azure-data-factory

Azure Data Factory: How to use Exists when the source(Blob - csv) and target (SQL DB) are massive datasets (different datatypes ~370 cols, ~7M rows)

I have a scenario in Azure Data Factory where I need to compare a daily CSV file in Blob Storage with a table in an SQL Database. The goal is to copy rows from the CSV to the SQL table only if they don't already exist in the table.

Problem Details:

Data Volume: Both the CSV file and SQL table are massive, with around 370 columns and 7 million rows.
Data Types: The dataset includes various data types such as strings, doubles, and timestamps.

Approach Tried:

I followed a similar use case from this Stack Overflow post: Azure Data factory - insert a row in the azure sql database only if it doesn't exist.
I used a Data Flow with an 'Exists' transformation, leveraging a derived column with SHA2 hash for comparison.

Issues Encountered:

Data Type Mismatch: When creating the source in my Data Flow, the data types from the CSV often do not match the data types in the SQL table.
Incorrect Matching: Due to the data type mismatch, the derived columns in the Data Flow do not match correctly, resulting in all rows from the CSV being treated as non-matching.

Questions:

Data Type Handling: How can I ensure that the data types in the Data Flow match those in the SQL table to prevent mismatches?
Optimization for Large Datasets: Are there best practices or optimizations in ADF to handle such large datasets more efficiently during comparison?

Any advice or alternative approaches would be greatly appreciated.

Here is a snapshot of my dataflow:

Solution

When you import the schemas from both source and sink datasets, it is comparing the imported data types.

I have tried without importing any schema from the source and sink datasets and it got worked for the given scenario.

Clear the schema from both source and sink datasets.

enter image description here

This is sample source data where some rows already exists in the target table.

ID,Name,mytime,Amount,Fahrenheit,age,role,mydate
8,Laddu,2024-04-26 11:45:00,12.24,95.04,24,Pirate,02-16-00
9,MS,2024-04-27 12:00:00,24.267,10.10,26,No job,05-30-24
10,ABD,2024-04-26 11:45:00,12.24,95.04,24,Pirate,02-16-00
11,Starc,2024-04-26 11:45:00,12.24,95.04,24,Pirate,02-16-00
12,KP,2024-04-26 11:45:00,12.24,95.04,24,Pirate,2024-05-30
13,Rabada,2024-04-26 11:45:00,12.24,95.04,24,Bowler,2023-06-29
7,Rakesh,2024-04-25 10:30:00,123.451,97.16,23,Engineer,2021-12-16

In both sources of the dataflow, set the projection to empty.

enter image description here

The columns will not be identified by the dataflow debug. So, use byName(<column_name>) when using columns in the derived column.

Here, for the exists transformation columns, I have created extra column Id_temp in both sources like below.

enter image description here

Then, use these new columns in the exists transformation.

enter image description here

Next, use select transformation rule-based mapping to remove the extra column.

enter image description here

Add sink to this and execute the dataflow from the pipeline.

It will give the expected results like below.

enter image description here