I have configured following alert-rules.yml as follows:
groups:
- name: alert.rules
rules:
- alert: HostOutOfMemory
expr: ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) < 25
for: 5m
labels:
severity: warning
annotations:
summary: "Host out of memory (instance {{ $labels.instance }})"
description: "Node memory is filling up (< 25% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HostOutOfDiskSpace
expr: (sum(node_filesystem_free_bytes) / sum(node_filesystem_size_bytes) * 100) < 30
for: 1s
labels:
severity: warning
annotations:
summary: "Host out of disk space (instance {{ $labels.instance }})"
description: "Disk is almost full (< 30% left)\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
My alert manager config looks something like this:
route:
receiver: 'teams'
group_wait: 30s
group_interval: 5m
receivers:
- name: 'teams'
webhook_configs:
- url: "http://prom2teams:8089"
send_resolved: true
I am pushing these notifications to MS Teams through prom2teams. These notifications gets displayed in teams as follows:
Note that for "Host out of Memory" alert, it says "In host: node-exporter:9100" , while for "Host out of Disk space" alert, it says "In host: unknown". Why is it so?
Because query
(sum(node_filesystem_free_bytes) / sum(node_filesystem_size_bytes) * 100) < 30
doesn't return label instance
. In fact, it doesn't return any labels.
If you want simply preserve instance
label you can used sum by
instead of sum
:
(sum by (instance) (node_filesystem_free_bytes) / sum by (instance)(node_filesystem_size_bytes) * 100) < 30
But I would argue, that alert without aggregation would much more reasonable, as it will provide detailed information on what caused alert, and also alert will be created if device is over threshold (not like current state where alert is created only if total volume of devices is checked).
node_filesystem_free_bytes / node_filesystem_size_bytes * 100 < 30