azureazure-data-factory

Remove duplicate lines in Azure Data Factory


I have a datastream with ~ 15 columns. I want to delete every row in my dataset that is already identical and keep unique ones. If there is one value differently in a row, I want to keep it. Unfortunately the columns are dynamic, so I don't know in advance all names of columns.

Example data:

matnr,vendor,sparepart,text,value,warehouse
12345, mercedes, xlm12, valve, 2, spain
12345, mercedes, xlm12, valve, 2, spain
12345, mercedes, xlm12, valve, 2, spain
12345, mercedes, xlm12, valveXX, 2, spain

After transformation:

matnr,vendor,sparepart,text,value,warehouse
12345, mercedes, xlm12, valve, 2, spain
12345, mercedes, xlm12, valveXX, 2, spain

What would be the fastest way? Found plenty solutions but they solve the problem that duplicates are only checked by one column specific (e.g. matnr).


Solution

  • You can use sha2(256,columns()) expression in the aggregate transformation as suggested by @ray to achieve your requirement.

    Take Aggregate transformation after your source and use the sha2(256,columns()) expression and give any column name to it in the Group by section as shown below. Here, columns() will give all the columns names as an array.

    enter image description here

    In the aggregate section, use a column pattern and take only the first row (first($$)) of every group by giving the condition of true().

    enter image description here

    This will give the distinct rows irrespective of the structure of the source data. But it will add the extra column that was used for grouping.

    enter image description here

    To remove, use Rule-based mapping in the select transformation and give the condition as name!='new_column'.

    enter image description here

    Now, it will give the expected results.

    enter image description here

    Whatever the source columns, it will give the desired result upon running the dataflow from pipeline.