prometheusalertlibvirtprometheus-node-exporterconntrack

How to alert anomalies on network traffic jump with prometheus?


We want to detect if a VM in our IaaS infra is under DDOS attack or not.

And We have several symptoms and metrics like: node_nf_conntrack_entries, node_network_receive_packets_total and also libvirt_domain_interface_stats_receive_packets_total

We do not want to have a false positive by setting a trigger point. Traffic > n then alert!

rate(libvirt_domain_interface_stats_receive_packets_total{host="x"}[5m])

enter image description here

rate(node_network_receive_packets_total{instance="y1"}[5m])

enter image description here

sum(node_nf_conntrack_entries_limit - node_nf_conntrack_entries) by (instance) < 1000

enter image description here


Solution

  • You can compare the average network traffic for the last 5 minutes to the average 5-minute network traffic 5 minutes ago. If it increases in 5 minutes by more than 10x, then alert:

    (
      rate(node_network_receive_packets_total[5m])
        /
      rate(node_network_receive_packets_total[5m] offset 5m)
    ) > 10
    

    See docs for offset modifier.

    This query may result in incorrect alerts though. For example, if the network traffic was close to zero and then it increased by more than 10x, but in absolute values it is still too small. This can be solved by adding a filter on too low network traffic. For example, the following query would alert only if the average per-second packet rate for the last 5 minutes is greater than 1000:

    ((
      rate(node_network_receive_packets_total[5m])
        /
      rate(node_network_receive_packets_total[5m] offset 5m)
    ) > 10)
      and
    (
      rate(node_network_receive_packets_total[5m]) > 1000
    )
    

    This query can miss slow-changing DOS-attack when the network traffic grows at a rate lower than 10x per 5 minutes. This can be fixed by playing with offset value or by adding the absolute maximum packet rate, when the query should alert unconditionally. For example, the following query would alert unconditionally when the average packet rate for the last minute exceeds 100K/sec:

    (
      ((
        rate(node_network_receive_packets_total[5m])
          /
        rate(node_network_receive_packets_total[5m] offset 5m)
      ) > 10)
        and
      (
        rate(node_network_receive_packets_total[5m]) > 1000
      )
    )
      or
    (
      rate(node_network_receive_packets_total[1m]) > 100000
    )
    

    See these docs for and and or operators.