sql-serverazure-data-factoryetlazure-synapsedata-transfer

Delete records in sink table when they get removed in source table while syncing incremental from two different environments


I’ve created a pipeline that synchronizes data from SQL Server A toward SQL Server B. Both of them are running on-premises but on a different server. Since there was quite a lot of data in some of the tables which I was syncing I wanted to sync the data incremental.

This works fine when data is being added or updated in the source table. It becomes an issue when data in the source is being removed… because my pipeline doesn’t ‘recognizes the delete actions so only inserts or updates the records in my sink table.

How can I solve an issue like this or what is the most friendly way of doing this?

Here below is a screenshot and explanation of the activities in my pipeline: enter image description here

From the right to the left:

Stored procedure: Inserts a record in a log table. This log table contains the tableName of the table that got synced and a LastSyncRunTimeSpan of this specific table. The tableName is hardcoded and the LastSyncRunTimeSpan is collected from the output of the last extract last update lookup (of the source table).

enter image description here

Copy data: Copies the data which got changed in the source since the last pipeline run.

enter image description here

Lookup (on source table): Retrieves the last changed record from the data in my sink table.

enter image description here

Lookup (on last sink log table): Retrieves the last sync run of this specific table. enter image description here

Before I configured my pipeline in an incremental way I was truncating the destination table before the new records got added. Since inserting all the records was quite time-consuming I've changed it to this incremental flow.

I saw some examples where they were deleting the records which were not found in the sink table when compared with the source table. But this is not possible since both tables are on a different server and I want to minimize the CPU or RAM usage on the sink server.

Besides that, I want to prevent that I've to create a staging table that gets truncated every time when syncing. Besides costs, it mainly brings a pretty time-consuming activity since most of the tables that I'm syncing have about 2500000+ records in them.


Solution

  • we have this exact same plight when copying data to our Enterprise Data Warehouse. Our best solution was to not compare Server A to Server B, but instead to create a Trigger on Server A that would log at a minimum the Primary Key of the record being deleted.

    We then use this logging of deletes to Target what to delete from Server B. If your server A follows most DB best practices, then it shouldn't be a huge performance hit, and is quite easy to systematically generate the code and triggers using sys tables.

    Here is a diagram to illustrate the process:

    Diagram

    The benefit here is that you don't have to pull every Key from Server A and compare to Server B to see what exists in B that doesn't exist in A, rather it's all driven by Server A in what should be a highly efficient manner.

    Hope this helps!