azureazure-sql-databaseazure-data-factory

Azure Data Factory: How to use Exists when the source(Blob - csv) and target (SQL DB) are massive datasets (different datatypes ~370 cols, ~7M rows)


I have a scenario in Azure Data Factory where I need to compare a daily CSV file in Blob Storage with a table in an SQL Database. The goal is to copy rows from the CSV to the SQL table only if they don't already exist in the table.

Problem Details:

Approach Tried:

Issues Encountered:

Questions:

Any advice or alternative approaches would be greatly appreciated.

Here is a snapshot of my dataflow:

dataflow snapshot


Solution

  • When you import the schemas from both source and sink datasets, it is comparing the imported data types.

    I have tried without importing any schema from the source and sink datasets and it got worked for the given scenario.

    Clear the schema from both source and sink datasets.

    enter image description here

    This is sample source data where some rows already exists in the target table.

    ID,Name,mytime,Amount,Fahrenheit,age,role,mydate
    8,Laddu,2024-04-26 11:45:00,12.24,95.04,24,Pirate,02-16-00
    9,MS,2024-04-27 12:00:00,24.267,10.10,26,No job,05-30-24
    10,ABD,2024-04-26 11:45:00,12.24,95.04,24,Pirate,02-16-00
    11,Starc,2024-04-26 11:45:00,12.24,95.04,24,Pirate,02-16-00
    12,KP,2024-04-26 11:45:00,12.24,95.04,24,Pirate,2024-05-30
    13,Rabada,2024-04-26 11:45:00,12.24,95.04,24,Bowler,2023-06-29
    7,Rakesh,2024-04-25 10:30:00,123.451,97.16,23,Engineer,2021-12-16
    

    In both sources of the dataflow, set the projection to empty.

    enter image description here

    The columns will not be identified by the dataflow debug. So, use byName(<column_name>) when using columns in the derived column.

    Here, for the exists transformation columns, I have created extra column Id_temp in both sources like below.

    enter image description here

    Then, use these new columns in the exists transformation.

    enter image description here

    Next, use select transformation rule-based mapping to remove the extra column.

    enter image description here

    Add sink to this and execute the dataflow from the pipeline.

    It will give the expected results like below.

    enter image description here