ssis

Identifying and Redirecting Duplicate Column Values in SSIS 2019


To be clear this is not a requirement to identify and redirect or dedupe rows in the data flow task leveraging the Sort Transformation. Instead, the requirement is to identify the duplicate values in a column then handle affected rows differently. Using the attached mock-up table, ProductID 1002 will quality as duplicates, therefore rows 3 and 4 should flow down a different path. Likewise, ProductCode A01 resulting in rows 2 and 6 being redirected.

Any help is appreciated. Thanks!

Mock-up Table

I tried splitting the dataset then performing joins but that didn't work as intended.


Solution

  • If this is a large dataset, the sorts will cause performance issues. However, if you must do this in SSIS instead of the database, this is what it could look like:

    enter image description here

    1. Get your source data, whatever that is. Some sources will allow you to specify a source column that it is sorted by, but we'll assume we have no control over that.
    2. Sort by ProductID
    3. Add a multicast to rejoin the raw rows by product ID later
    4. Aggregate by ProductID and select (*) as Count All
    5. Add a conditional split with 1 condition: (DT_I4)[Count all] > 1
    6. Add two more sorts by ProductID for each path
    7. Use a merge join to bring the original rows back again. See image below for configuration example:

    enter image description here