time-series kql azure-data-explorer kusto-explorer

KQL - Break down timespan of how long an item is in a specific state by day

I am trying to calculate how long an item has been active in a specific state with KQL. Currently I am using window functions together with partitioning. This works well when I need to calculate the total "time-in-state" over a period, but the issue arises when trying to break it down per day (or any arbitrary granularity).

Input:

let inputData=datatable(id:string, status: string, timestamp: datetime) [ 
    "id1","P",datetime(2024-03-12T05:30:15),
    "id1","F",datetime(2024-03-14T10:10:00),
    "id2","P",datetime(2024-03-12T05:30:15)
];
let startDate=datetime(2024-03-12T00:00:00);
let endDate=datetime(2024-03-15T00:00:00);

Applying the following query yields the total time each id has spent in each state:

inputData
| partition hint.strategy=native by id
    (
        order by timestamp asc
        | extend tsDiff = min_of(endDate, next(timestamp)) - timestamp
        | extend pTime = iif(status == "P", tsDiff, timespan(0))
        | extend fTime = iif(status == "F", tsDiff, timespan(0))
    )
    | summarize totalPTime=sum(pTime), totalFTime=sum(fTime) by id

Results:
id  totalPTime  totalFTime
id1 2.04:39:45  13:50:00
id2 2.18:29:45  00:00:00

Now I'm stumped on how to break this down on a day-to-day basis. Trying to use for example make-series gives:

inputData
| partition hint.strategy=native by id
    (
        order by timestamp asc
        | extend tsDiff = min_of(endDate, next(timestamp)) - timestamp
        | extend pTime = iif(status == "P", tsDiff, timespan(0))
        | extend fTime = iif(status == "F", tsDiff, timespan(0))
    )
    | make-series 
        totalP=sum(pTime),
        totalF=sum(fTime) on timestamp from startDate to endDate step 1d by id

Results:
id  totalP  totalF  timestamp
id1 ["2.04:39:45","00:00:00","00:00:00"]    ["00:00:00","00:00:00","13:50:00"]  ["2024-03-12","2024-03-13","2024-03-14"]
id2 ["2.18:29:45","00:00:00","00:00:00"]    ["00:00:00","00:00:00","00:00:00"]  ["2024-03-12","2024-03-13","2024-03-14"]

What I would need the results to be is at a maximum 1 day and the overflow onto the next day if that makes sense. E.g:

id  totalP  totalF  timestamp
id1 ["1.00:00:00","1.00:00:00","04:39:45"]  ["00:00:00","00:00:00","13:50:00"]  ["2024-03-12","2024-03-13","2024-03-14"]
id2 ["1.00:00:00","1.00:00:00","18:29:45"]  ["00:00:00","00:00:00","00:00:00"]  ["2024-03-12","2024-03-13","2024-03-14"]

Am I on the right track here or should I utilize some other functionality for this purpose? Any help is greatly appreciated!

Solution

I just wanted to say thank you for such an interesting question, it's occupied my mind almost all weekend. At first glance it seemed fairly logical but the more I thought about it and tested some code the more difficult it got! I'm sure you were in the exact same position...

I'm confident someone else will come and give you a much more eloquent and computationally efficient solution, but I do believe you are on the right path with what you were doing. The thing that seemed to be missing is that like most query language KQL isn't good at showing things which aren't there. So my personal preference is to always start with something we know. For your question I took the time ranges you had already available.

There's a good amount of partition here (as you were using), range, scan and dynamic variables.

There are some caveats to this solution, namely in its current format it's only accurate to the minute, we may not need calculate the final status and I'm inferring the status is only ever P or F.

I know I'm not calculating the time in the same way as your query, but hopefully you may be able to tweak it as needed.

let inputData=datatable(id:string, status:string, timestamp: datetime) [ 
    "id1","P",datetime(2024-03-12T05:30:15),
    "id1","F",datetime(2024-03-14T10:10:00),
    "id2","P",datetime(2024-03-12T05:30:15)
];
let units = 1m; //eg 1 minute calculating
let startDate = floor(datetime(2024-03-12T00:00:00), units);
let endDate = floor(datetime(2024-03-15T03:00:00), units);
range timestamp from startDate to endDate step units
| extend JoinHere = 1
| join kind=leftouter (inputData | summarize by id, JoinHere = 1) on JoinHere //Create an entry for every possible timestamp and id
| join kind=leftouter (inputData
| summarize arg_min(timestamp, status) by id //Get the first change
| extend status = iif(status == 'P', 'F', 'P') //Assume prior to this it was on the other status
| extend timestamp = startDate //Start it at the beginning of our window
) on timestamp, id
| join kind=leftouter (inputData
| summarize arg_max(timestamp, status) by id //Get the last change
| extend timestamp = endDate //End it at the end of the window
) on timestamp, id
| order by timestamp asc
| join kind=leftouter (inputData | extend timestamp = floor(timestamp, units)) on timestamp, id //Mix in our known changes
| extend status = coalesce(status, status1, status2) //From all the joins have a single value for status
| project timestamp, id, status
| partition hint.strategy=native by id
    (
    sort by timestamp asc
    | scan declare (status:string) with
       (
       step Step1: true => status = iif(isempty(status), Step1.status, status);
       )
    )
| partition hint.strategy=native by id
   (
   summarize TimeUnits = array_length(make_set(timestamp)) by floor(timestamp, 1d), status, id
   | extend pTime = iif(status == "P", TimeUnits, 0) //Unit in this case is 1 minute as above
   | extend fTime = iif(status == "F", TimeUnits, 0)
   )
| make-series 
  totalP=sum(pTime),
  totalF=sum(fTime) on timestamp from startDate to endDate step 1d by id

id	totalP	totalF	timestamp
id1	[1110,1440,610,0]	[330,0,830,181]	[""2024-03-12T00:00:00.0000000Z","2024-03-13T00:00:00.0000000Z","2024-03-14T00:00:00.0000000Z","2024-03-15T00:00:00.0000000Z""]
id2	[1110,1440,1440,181]	[330,0,0,0]	[""2024-03-12T00:00:00.0000000Z","2024-03-13T00:00:00.0000000Z","2024-03-14T00:00:00.0000000Z","2024-03-15T00:00:00.0000000Z""]