I am attempting to implement an Azure Alert that triggers when our Availability SLI drops below a threshold, say 99.9%. For context, our Availability SLI is calculated as 100 - (the number of requests with a status code of 5xx / total requests)
for the calendar month so far. (Yes, it's really just an error rate.)
I have a Kusto query that calculates a running average based on 5 minute intervals for the current month. It assumes the rest of the month will be at 100% so that we can consider this as a representation of our error budget. Here is the query:
let resolution = 5m;
let monthStart = startofmonth(datetime(now));
let monthEnd = endofmonth(datetime(now));
let now = datetime(now);
AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS"
and OperationName == "ApplicationGatewayAccess"
and TimeGenerated >= monthStart
and TimeGenerated <= monthEnd
| summarize
TotalRequests = count(),
ErrorRequests = countif(httpStatus_d > 499)
by bin(TimeGenerated, resolution)
| sort by TimeGenerated asc
| serialize Period = row_number()
| extend periodsLeft = round((monthEnd - TimeGenerated) / resolution)
| extend periodsTotal = Period + periodsLeft
| extend AvailabilityRateInPeriod = 100 - (todouble(ErrorRequests) / TotalRequests * 100)
| serialize RunningPeriodSum = row_cumsum(AvailabilityRateInPeriod)
| extend AvailabilityRateRunning = (RunningPeriodSum + (100 * periodsLeft)) / periodsTotal
| project TimeGenerated, AvailabilityRateRunning
(As an aside, if there is a better way to do this, I'm all ears. I'm very new to KQL.)
This works well as a standalone query to come up with a number or historical chart.
However, when I try to use it with an Alert, it seems as though the Alert has a max look-back period of 2 days. Through experimentation I have found that this limits the data "sent" to the query, meaning that even though I specify the full time range I want to query (month-to-date), the only data available to the query is the last two days, and therefor doesn't represent the full month to date.
Can I somehow store the running intermediate values somewhere and alert based on that instead? Or is there a better approach I'm not considering?
I ended up creating a Logic App that queried the logs and took action based on that. It was much more extensible and let me query exactly what I needed.