[SOLVED] How to define Alerts with exception in InfluxDB/Kapacitor

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.

What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?

I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?

This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.

I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.

Solution

As said in the comments, you can use the sideload node to achieve that.

Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).

Given the following excerpt from a tick script

var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
    |eval(lambda: "numMeasurements")
        .as('value')

var customized = data
    |sideload()
        .source('file:///etc/kapacitor/customizations/demo/')
        .order('hosts/host-{{.hostname}}.yaml')
        .field('maxNumMeasurements',100)
    |log()

var trigger = customized
    |alert()
        .crit(lambda: "value" > "maxNumMeasurements")

and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows

maxNumMeasurements: 10

A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.

There is an example in the documentation handling scheduled downtimes using sideload

Furthermore, I have created an example available on github using docker-compose

Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.