I'm running bosun
to alert against an elasticsearch
data set.
The scenario is that there's a number of cron jobs that do various things. If these execute successfully, they will log a success message. If they die / fail to run for whatever reason and fail to log the success message, we need to know about it.
My question is how to get a 0
result if no record is found, rather than null
. Here's the basic query:
nv(sum(escount(esls("logs"), "context.taskname", esand(esgte("context.elapsed_time", 0), esor(esquery("context.taskname", "Task1 or Task2 or Task3 or Task4"))), "360m", "360m", "")), 0)
If a given task has run in the interval specified, the query should return a non-zero value for the number of success messages the task has logged.
This works, but I want the alert to fire ONLY if the task hasn't run. The problem is that if Task1
hasn't run and logged a completion message, it's just dropped from the final grouping rather than returning a 0
count.
Is there a way to ensure that each task in the esor
returns something, even if it's a zero value?
In your situation there are 3 aspects to monitor:
Elastic doesn't matter in this case, so I have simulated the responses with the series
function:
alert zero_example {
# success log messages
$successful = sum(merge(series("job=task1", 0, 1), series("job=task2", 0, 1)))
# error log messages
$error = sum(merge(series("job=task1", 0, 0), series("job=task3", 0, 1)))
# warn if no successful message or there is a non-zero number of error messages.
# nv makes it so if there are no error messages, it will be treated as zero
warn = nv($successful == 0, 0) || nv($error != 0, 0)
# the final case is that a job hasn't logged. As long as the alert saw it in the
# first place, then Bosun will treat it as "unknown" when the result set disappears
# from the result
}