kqlazure-data-explorerkusto-explorer

How to do a KQL time window join at millisecond granularity?


I am attempting to join the processName of the first table into the event log output:

let processStartLog = datatable(startTime:datetime, processName:string)
[
    datetime('2024-08-15T01:41:23.034Z'),'D',
    datetime('2024-08-13T08:22:51.688Z'),'C',
    datetime('2024-08-12T10:48:40.124Z'),'B',
    datetime('2024-08-10T00:00:12.526Z'),'A',
];
let eventLog = datatable(timestamp:datetime, eventName:string)
[
    datetime('2024-08-14T09:30:19.136Z'),'foo', // occurs during processName 'C'
    datetime('2024-08-13T12:23:01.976Z'),'bar', // occurs during processName 'C'
    datetime('2024-08-10T01:11:41.124Z'),'xyz', // occurs during processName 'A'
];

The desired output would look something like this:

let desiredOutput = datatable(timestamp:datetime, eventName:string, processName:string)
[
    datetime('2024-08-14T09:30:19.136Z'),'foo','C',
    datetime('2024-08-13T12:23:01.976Z'),'bar','C',
    datetime('2024-08-10T01:11:41.124Z'),'xyz','A',
];

If I take an individual event timestamp, I could get the process name quite easily by doing something like this:

processStartLog
| where startTime <= datetime('2024-08-14T09:30:19.136Z') // timestamp of 'foo' event
| top 1 by startTime desc

Unfortunately, I cannot access the event log timestamp in a subquery:

eventLog
| extend processName = toscalar(
    processStartLog
    | where startTime <= $cannotAccessEventLog.timestamp
    | top 1 by startTime desc
    | project processName)

This is documented at: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/toscalar-function.

I am aware of the "time window join" strategy documented at: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/join-time-window. The problem for me is there are many rows in both tables, the time frame of the tables are long, and the granularity of events and process starts can be at the millisecond level. For the strategy to work I would have to use a 1 millisecond window granularity which would create over 86 million rows over the course of just one day for me to join into.

Is there a strategy I can use for my scenario, or am I out of luck with KQL?


Solution

  • Here is a simple solution:

    eventLog
    | extend key = 1
    | join kind=inner (processStartLog | extend key = 1) on key
    | where timestamp >= startTime
    | summarize arg_min(td = timestamp - startTime, processName) by timestamp, eventName
    | project-away td
    

    The query performance may still need to be improved. It may be possible to replace the constant key 1 here with a time window (e.g. on the same day).