prometheusgrafanagrafana-alerts

"Smart" Grafana alerts - or - Compound Conditional of Unrelated Targets in Prometheus


Summary

I want to alert using Grafana 8+ when a specific PromQL result is greater than 0 and when another, unrelated PromQL result is 1.

Background

We have devices that are turned on and off each day. I want to know when they're down during this operating window. I'm using probe_success as the PromQL query to know when a device is down. I'm using a custom Prometheus app to know when devices are on and off - example PromQL powerStatus{job="powerMonitor", section="1", zone="2" } == bool 1

Attempts

I've looked into straight alertmanager/PromQL but haven't found a query that will work when I want it to, considering the power status of the section and zone. I've tried everything I can extract out of the Grafana 8 alert documentation including using Classic Condition expressions but it seems that I can only come up with a solution that relies on the evaluation of a Classic Condition in another Classic Condition which isn't allowed. I thought this would be a common use case but I'm not seeing blog posts on the web about it.

Hunch

I have a feeling that there is an idiomatic way to accomplish this that I'm not seeing, sort of like when going from imperative to declarative programming and wanting to loop through sets of data :).

Can you help?


Solution

  • The thing you need is the on() vector match operator. The thing works much like JOIN in the context of relational databases.

    The examples below are based on the following metrics:

    foo{bar="one", baz="two", instance="localhost:9999", job="test"}      1
    bar{spam="three", eggs="four", instance="localhost:9999", job="test"} 0
    

    Match metrics by name, disregard all labels

    Gives you ALL foo series with the value of 1 when ANY of the bar series is 0:

    foo == 1 and on() bar == 0
    

    The result will be:

    foo{bar="one", baz="two", instance="localhost:9999", job="test"} 1
    # I have only one foo series in this example, but if you had more of them
    # with the value of 1, you would see all of them here.
    

    As you may noticed, the result contains only the foo metric. If you wish to see bar instead, you need to place it first:

    # Show all bar series with the value of 0 when ANY foo is 1
    bar == 0 and on() foo == 1
    

    Please note, that "disregard all labels" is literal and that includes "job" and "instance" labels. What it means is that if, for example, you have foo == 1 on one instance and bar == 0 on another, you would still trigger the alert. To overcome this, you might want to...

    Disregard only some labels

    You can give a comma-separated list of label names to on() operator to ensure that the series have at least something in common:

    foo == 1 and on(instance) bar == 0
    

    The above will trigger an alert for ALL foo == 1 when ANY bar is 0 on the same instance.

    Other ways

    There is another one-to-one join operator which works the opposite way. You can make the left and the right sides of the expression to match ALL labels, ingoring() only some of them.

    Further info in the PromQL documentation: https://prometheus.io/docs/prometheus/latest/querying/operators/#vector-matching