I am working on a pipeline where I am using Excel as the source. The data has a primary key let's say Id
, which is repeating multiple times in the Excel.
Now, when I insert it into a SQL database, it fails with the error:
java.sql.BatchUpdateException: Violation of PRIMARY KEY constraint. Cannot insert duplicate key in object 'dbo.xyz'. The duplicate key value is XXXX.
How can I make take care of this scenario using mapping data flows in ADF?
I am using a mapping data flow here to take care of the different other transformations.
Example of such data, coming from the Excel source
ID Name PhoneNo
1 John Doe 11110000
1 John Doe 88881111
2 Harry Potter 88999000
2 Harry Potter 00001112
3 abc xyz 77771111
I need to take save the top 1 ID and Name (and there are more columns) in one table and Phone No and ID will be saved in another
You can use the aggregate
transformation to remove duplicate values from the source.
Source:
Add sample excel source with duplicate values in ID and Name columns.
Aggregate transformation:
Under the group by property, add the list of columns in which the duplicate rows are identified.
Under the aggregates property, add the aggregate column. Here, we are getting the first value of the Phone column from the duplicate rows.
Expression: first(Phone)
Aggregate output:
Sink1:
Add the aggregate output to sink1 to pass the ID, and Name columns to 1 table.
Sink2:
Add another sink after aggregate transformation to pass Id and phone to a different table.