azurecloudkqlazure-monitor

How to remove duplicates from dynamic column through KQL


we are trying to setup alerts for application(container) logs through kQL Queries in Azure cloud. we are running following query to track error and remove duplicates.

    let ContainerIdList = KubePodInventory
    | where ContainerName contains "custom-app-logger"
    | where Namespace has "devns"
    | where ClusterId =~ '/subscriptions/12345-xyzl-4e12-bc3a-5c7859365342636/resourcegroups/rg-aks-dev-dev/providers/Microsoft.ContainerService/managedClusters/aksdevcluster'
    | distinct ContainerID;
     ContainerLog 
    | where ContainerID in (ContainerIdList)
    | where LogEntry !has "SRV1174"
    | where LogEntry has "| E |" or LogEntry has "| F |"
    | where LogEntry !contains "the I/O interface definition of project" 
    | where LogEntry !contains "the I/O interface definition of cuc" 
    | project LogEntrySource, LogEntry, TimeGenerated 
    | order by TimeGenerated desc
    | top 1000 by LogEntry
    | render table
    | distinct LogEntry

however its not removing the duplicates. if you see the log output below except first and last one, all others are same only with different timestamp in it. please find the log output screenshot below.

[![enter image description here][1]][1]

could you please help me how can we remove duplicates with actual error, if it is same error with different time stamp in it, need to pick the latest one.


Solution

  • I have reproduced in my environment and got expected results as below:

    Firstly, taken a simple table like below:

    LogEntry
    
    0|2023-07-23 11:40:50|b|abc
    0|2023-07-23 11:42:50|b|abc
    0|2023-07-23 12:40:50|c|xyz
    0|2023-07-23 14:40:50|d|bnm
    

    enter image description here

    Then I have used below query to remove duplicates:

    let x = datatable(LogEntry: string)
    [
        "0|2023-07-23 11:40:50|b|abc",
        "0|2023-07-23 11:42:50|b|abc",
        "0|2023-07-23 12:40:50|c|xyz",
        "0|2023-07-23 14:40:50|d|bnm"
    ]
    | extend SplitLog = split(LogEntry, "|")
    | project C1 = SplitLog[0], cc=SplitLog[1],C2 = todatetime(SplitLog[1]), C3 = SplitLog[2], UV = SplitLog[3];
    x
    | summarize arg_max(C2,*) by tostring(UV)
    | project-away UV
    | project Log = strcat(tostring(C1), "|", cc, "|", C3, "|", UV1)
    

    enter image description here

    Output:

    enter image description here

    Fiddle.

    Split the LogEntry column by pipes(|) and create new columns. Then summarize it by UniqueValues(UV) and take the maximum of Timestamp(C2) and remove Unique Value column(which you will get extra column) then concat it back with Pipe symbol like I have done. Try to follow above example to get desired result as I have got.

    EDIT:

    ContainerIdList
        | extend SplitLog = split(LogEntry, "|")
        | project C1 = SplitLog[0], cc=SplitLog[1],C2 = todatetime(SplitLog[1]), C3 = SplitLog[2], UV = SplitLog[3]
        | summarize arg_max(C2,*) by tostring(UV)
        | project-away UV
        | project Log = strcat(tostring(C1), "|", cc, "|", C3, "|", UV1