prometheusgrafanaprometheus-alertmanagerprometheus-node-exporter

Getting "In host: Unknown" in Prometheus alert


I have configured following alert-rules.yml as follows:

groups: 
- name: alert.rules 
  rules: 
  - alert: HostOutOfMemory 
    expr: ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) < 25
    for: 5m 
    labels: 
      severity: warning 
    annotations: 
      summary: "Host out of memory (instance {{ $labels.instance }})" 
      description: "Node memory is filling up (< 25% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}" 

  - alert: HostOutOfDiskSpace 
    expr: (sum(node_filesystem_free_bytes) / sum(node_filesystem_size_bytes) * 100) < 30
    for: 1s 
    labels: 
      severity: warning 
    annotations: 
      summary: "Host out of disk space (instance {{ $labels.instance }})" 
      description: "Disk is almost full (< 30% left)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}" 

My alert manager config looks something like this:

route:
  receiver: 'teams'
  group_wait: 30s
  group_interval: 5m

receivers:
  - name: 'teams'
    webhook_configs:
      - url: "http://prom2teams:8089"
        send_resolved: true

I am pushing these notifications to MS Teams through prom2teams. These notifications gets displayed in teams as follows:

enter image description here

enter image description here

Note that for "Host out of Memory" alert, it says "In host: node-exporter:9100" , while for "Host out of Disk space" alert, it says "In host: unknown". Why is it so?


Solution

  • Because query

    (sum(node_filesystem_free_bytes) / sum(node_filesystem_size_bytes) * 100) < 30
    

    doesn't return label instance. In fact, it doesn't return any labels.

    If you want simply preserve instance label you can used sum by instead of sum:

    (sum by (instance) (node_filesystem_free_bytes) / sum by (instance)(node_filesystem_size_bytes) * 100) < 30
    

    But I would argue, that alert without aggregation would much more reasonable, as it will provide detailed information on what caused alert, and also alert will be created if device is over threshold (not like current state where alert is created only if total volume of devices is checked).

    node_filesystem_free_bytes / node_filesystem_size_bytes * 100 < 30