I have a table in Kusto where I'm trying to deduplicate. Here is the table structure:
| column1 | column2 | column3 | column4 |
|---|---|---|---|
| 1234 | business | orange | paypal |
| 1234 | business | orange | |
| 3456 | business | apple | |
| 4845 | business | banana | venmo |
I'm struggling to deduplicate this table so in case of duplicates per combination of column1 , column2 , column3, only non-null values of the column4 survives. please note, in some cases, like column1 == 3456, there is a single records per column1, column2, and column3 where the column4 is in fact null and those records must survive.
tbl
| distinct column1 , column2 , column3, column4
| sort by column4 desc nulls last , column1 , column2 , column3
| extend priority_ = row_number(1, prev(column1) != column1 and prev(column2) != column2 and prev(column3) != column3 )
| where priority_ == 1
Because I don't know how to deprioritize the null values of column4, this returns incorrect results. Can you help me figure out how to correctly survive the non-null values in case there are duplicates?
Check if this works for you:
datatable(column1:string, column2:string, column3:string, column4:string)
[
"1234", "business", "orange", "paypal",
"1234", "business", "orange", "",
"3456", "business", "apple", "",
"4845", "business", "banana", "venmo"
]
| summarize make_set(column4) by column1, column2, column3
// check whether there's only a single element for {column1,column2,column3} combination
| extend single_element = array_length(set_column4) == 1
| mv-expand colummn4 = set_column4 to typeof(string)
// include only non empty or combinations with only a single column4 value (even if it's empty)
| where single_element or isnotempty(colummn4)
| project-away set_column4, single_element
| column1 | column2 | column3 | colummn4 |
|---|---|---|---|
| 1234 | business | orange | paypal |
| 3456 | business | apple | |
| 4845 | business | banana | venmo |