prometheustelegrafprometheus-alertmanagertelegraf-inputs-plugin

Prometheus - Creating an alert in case of http errors using the telegraf http_response plugin


I am using Telegraf and Prometheus to monitor my local services, for example OpenHab and my Grafana instance.

The http_response plugin might produce the following results:

http_response_http_response_code{host="master-pi",instance="192.168.2.15:9126",job="telegraf-master-pi",method="GET",result="success",result_type="success",server="http://www.grafana.local",status_code="200"}    200
http_response_http_response_code{host="master-pi",instance="192.168.2.15:9126",job="telegraf-master-pi",method="GET",result="success",result_type="success",server="http://www.grafana.local",status_code="502"}    502
http_response_http_response_code{host="master-pi",instance="192.168.2.15:9126",job="telegraf-master-pi",method="GET",result="success",result_type="success",server="http://www.thuis.local/start/index",status_code="200"} 200

Now I want an alert, which notifies me whenever the !200 status_code count of the last 30 minutes is higher than the 200 status_code count.

I started of simple:

alert: service_down_external
expr: http_response_http_response_code{status_code!~"200|302"}
for: 35m
labels:
  severity: high

This works fine, but the problem is that this won't work for my services which I monitor not every 10 seconds, but every 5 to 30 minutes (because I want to reduce the load on some API's).

So I figured, lets try it another way:

expr: count_over_time(http_response_http_response_code{status_code!~"200|302"}[30m]) > on(job, instance, method, server) count_over_time(http_response_http_response_code{status_code=~"200|302"}[30m])

This seemed promising, but unfortunately won't work if there are no 200/302 responses at all, in that case "no data" is returned.

So I though, lets just divide it by the total amount:

count_over_time(http_response_http_response_code{status_code!~"200|302"}[300m]) > on(job, instance, method, server) count_over_time(http_response_http_response_code[300m])

But, that results in:

Error executing query: found duplicate series for the match group {instance="192.168.2.15:9126", job="telegraf-master-pi", method="GET", server="http://www.grafana.local/series"} on the right hand-side of the operation: [{host="master-pi", instance="192.168.2.15:9126", job="telegraf-master-pi", method="GET", result="success", result_type="success", server="http://www.grafana.local/series", status_code="502"}, {host="master-pi", instance="192.168.2.15:9126", job="telegraf-master-pi", method="GET", result="success", result_type="success", server="http://www.grafana.local/series", status_code="200"}];many-to-many matching not allowed: matching labels must be unique on one side

Also when trying ignoring:

count_over_time(http_response_http_response_code{status_code!~"200|302"}[30m]) >ignoring(status_code) count_over_time(http_response_http_response_code[30m])

The same error occurs.

Is there some other way to alert me whenever the http response returns only 5xx errors in the last 30 minutes?


Solution

  • 6 months later, another attempt to solve this one and I finally came up with a query which gives me the expected results:

    count_over_time(http_response_result_code{result!~"success"}[2h]) / on(job, instance, method, server, type) group_left() sum by(job, instance, method, server, type) (count_over_time(http_response_result_code[2h])) >= 0.5
    

    The sum by part solves the "found duplicate series for the match group" as it will sum all duplicate series (so for example all results with "response_string_mismatch" and "success")

    The group_left selects the left part of the query so I can still use the still the result_type label in my alert. The right part only contains the 5 fields mentioned in the sum by.

    Finally, the query will give me the percentage of calls which where not successful in the last 2 hours, exactly what I needed.