I have a datastream with ~ 15 columns. I want to delete every row in my dataset that is already identical and keep unique ones. If there is one value differently in a row, I want to keep it. Unfortunately the columns are dynamic, so I don't know in advance all names of columns.
Example data:
matnr,vendor,sparepart,text,value,warehouse
12345, mercedes, xlm12, valve, 2, spain
12345, mercedes, xlm12, valve, 2, spain
12345, mercedes, xlm12, valve, 2, spain
12345, mercedes, xlm12, valveXX, 2, spain
After transformation:
matnr,vendor,sparepart,text,value,warehouse
12345, mercedes, xlm12, valve, 2, spain
12345, mercedes, xlm12, valveXX, 2, spain
What would be the fastest way? Found plenty solutions but they solve the problem that duplicates are only checked by one column specific (e.g. matnr).
You can use sha2(256,columns())
expression in the aggregate transformation as suggested by @ray to achieve your requirement.
Take Aggregate transformation after your source and use the sha2(256,columns())
expression and give any column name to it in the Group by section as shown below. Here, columns()
will give all the columns names as an array.
In the aggregate section, use a column pattern and take only the first row (first($$)
) of every group by giving the condition of true()
.
This will give the distinct rows irrespective of the structure of the source data. But it will add the extra column that was used for grouping.
To remove, use Rule-based mapping in the select transformation and give the condition as name!='new_column'
.
Now, it will give the expected results.
Whatever the source columns, it will give the desired result upon running the dataflow from pipeline.