monitoringnrpecheck-mk

check_mk "State and count of processes" rule threshold values


I'm trying to turn a Nagios-NRPE check into a Check_MK one. The first one is:

check_procs -w 10 -c 15 -C crond

My attempt is to use the State and coung processes rule but it always raise a critical alert. The parameters of my rule are (extracted from the rules.mk configuration file):

'process': 'crond'
'okmax':   10
'okmin':    1
'warnmax': 15
'warnmin': 11

As the WATO config screen says nothing about critical thresholds, I have guessed the values outside these thresholds above raise a critical alert.

My problem is: when this rule is active, an critical alert is raised even when the number of processes found is inside the OK threshold.

The Status detail of the alert is

CRIT - 7 processes (ok from 1 to 15)CRIT 1620.6 MB virtual, 28.2 MB resident, 2.7% CPU

Then, I cannot understand this behaviour and I feel that I misunderstand the check_MK threshold parameters or I'm missing something.

Can you help me?

Thanx in advance.


Solution

  • As I suspected in my question last paragraph, I misunderstood the check_MK threshold parametes.

    These are the python code lines found in ~/share/check_mk/checks/ps:

    state = 0
    if count > params["warnmax"] or count < params["warnmin"]:
        state = 2
        infotext += " (ok from %d to %d)(!!)" % (params["okmin"], params["okmax"])
    elif count > params["okmax"] or count < params["okmin"]:
        state = 1
        infotext += " (ok from %d to %d)(!)" % (params["okmin"], params["okmax"])
    

    So any value lower than warnmin raises a critical alert. Thus, in order to prevent this, the warn interval must include the ok one. In my example, the warmin value should be lowered down to match the okmin one.

    'process': 'crond'
    'okmax':   10
    'okmin':    1
    'warnmax': 15
    'warnmin':  1
    

    In mathematical terms, the ok interval must be a subinterval of warn one.

    I wrongly guessed these intervals should not overlap, but actually they must.