Using Synapse pipelines and mapping data flow to process multiple daily files residing in ADLS which represent incremental inserts and updates for any given primary key column. Each daily physical file has ONLY one instance for any given primary key value. Keys/rows are unique within a daily file, but the same key value can exist in multiple files for each day where attributes related to that key column changed over time. All rows flow to the Upsert condition as shown in screen shot.
Sink is a Synapse table where primary keys can only be specified with non-enforced primary key syntax which can be seen below.
Best practice with mapping data flows is avoid placing mapping data flow within a foreach activity to process each file individually as this spins up a new cluster for each file which takes forever and gets expensive. Instead, I have configured the mapping data flow source to use wildcard path to process all files at once with a sort by file name to ensure they are ordered correctly within a single data flow (avoiding the foreach activity for each file).
Under this configuration, a single data flow looking at multiple daily files can definitely expect the same key column to exist on multiple rows. When the empty target table is first loaded from all the daily files, we get multiple rows showing up for any single key column value instead of a single INSERT for the first one and updates for the remaining ones it sees (essentially never doing any UPDATES).
The only way I avoid duplicate rows by the key column is to process each file individually and execute a mapping data flow for each file within a for each activity. Does anyone have any approach that would avoid duplicates while processing all files within a single mapping data flow without a foreach activity for each file?
Does anyone have any approach that would avoid duplicates while processing all files within a single mapping data flow without a foreach activity for each file?
AFAIK, there is no other way than using ForEach loop to process file one by one.
When we use wildcard, it takes all the matching file in the one go. like below same values from different file.
using alter rows condition will help you to upsert rows if you have only on single file as you are using multiple files this will create duplicate records like this similar question Answer by Leon Yue.
As scenario explained you have same values in multiple files, and you want to avoid that to being getting duplicated. to avoid this, you have to iterate over each of the file and then perform dataflow operations on that file to avoid duplicates getting upsert.