I'm having issues with Prometheus alerting rules. I have various cAdvisor specific alerts set up, for example:
- alert: ContainerCpuUsage
expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
title: 'Container CPU usage (instance {{ $labels.instance }})'
description: 'Container CPU usage is above 80%\n VALUE = {{ $value }}\n LABELS: {{ $labels }}'
When the condition is met, I can see the alert in the "Alerts" tab in Prometheus, however some labels are missing thus not allowing alertmanager to send a notification via Slack. To be specific, I attach custom "env" label to each target:
{
"targets": [
"localhost:8080",
],
"labels": {
"job": "cadvisor",
"env": "production",
"__metrics_path__": "/metrics"
}
}
But when the alert based on cadvisor metrics is firing, the labels are: alertname, instance and severity - no job label, no env label. All the other alerts from other exporters (f.e. node-exporter) work just fine and the label is present.
This is due to the sum
function that you use; it gathered all the time series present and added them together, groping BY (instance, name)
. If you run the same query in Prometheus, you will see that sum
left only grouping labels:
{instance="foo", name="bar"} 135.38819037447163
Other aggregation methods like avg
, max
, min
, etc, work in the same fashion. To bring the label back simply add env
to the grouping list: by (instance, name, env)
.