azure-sql-databaseetlazure-data-factoryprimary-keyazure-mapping-data-flow

Azure data factory - multiple primary keys in source excel to be inserted in SQL database


I am working on a pipeline where I am using Excel as the source. The data has a primary key let's say Id, which is repeating multiple times in the Excel.

Now, when I insert it into a SQL database, it fails with the error:

java.sql.BatchUpdateException: Violation of PRIMARY KEY constraint. Cannot insert duplicate key in object 'dbo.xyz'. The duplicate key value is XXXX.

How can I make take care of this scenario using mapping data flows in ADF?

I am using a mapping data flow here to take care of the different other transformations.

Example of such data, coming from the Excel source

ID Name PhoneNo
1 John Doe 11110000
1 John Doe 88881111
2 Harry Potter 88999000
2 Harry Potter 00001112
3 abc xyz 77771111

I need to take save the top 1 ID and Name (and there are more columns) in one table and Phone No and ID will be saved in another


Solution

  • You can use the aggregate transformation to remove duplicate values from the source.

    Source:

    Add sample excel source with duplicate values in ID and Name columns.

    enter image description here

    Aggregate transformation:

    Under the group by property, add the list of columns in which the duplicate rows are identified.

    enter image description here

    Under the aggregates property, add the aggregate column. Here, we are getting the first value of the Phone column from the duplicate rows.

    Expression: first(Phone)

    enter image description here

    Aggregate output:

    enter image description here

    Sink1:

    Add the aggregate output to sink1 to pass the ID, and Name columns to 1 table.

    enter image description here

    Sink2:

    Add another sink after aggregate transformation to pass Id and phone to a different table.

    enter image description here