I have monit configured to check that my IRCd and their services are running. Recently the instance thats runs all this restarted, and it did not do its job.
It was configured to start on boot.
[root@ip-172-31-21-162 ec2-user]# chkconfig --list monit
monit 0:off 1:off 2:on 3:on 4:on 5:on 6:off
The control file
[root@ip-172-31-21-162 ec2-user]# cat /etc/monit.conf
set httpd port 2812
allow 127.0.0.1
set daemon 60
include /etc/monit.d/*
check process ircd with pidfile /home/ec2-user/inspircd/run/pid
start program = "/usr/bin/perl /home/ec2-user/inspircd/run/inspircd start"
as uid "ec2-user" and gid "ec2-user"
with timeout 30 seconds
check process services with pidfile /home/ec2-user/anope/run/data/services.pid
depends on ircd
start program = "/bin/sh /home/ec2-user/anope/run/bin/anoperc start"
as uid "ec2-user" and gid "ec2-user"
with timeout 30 seconds
The syntax of this looks alright according to the documentation...
<START | STOP | RESTART> [PROGRAM] = "program"
[[AS] UID <number | string>]
[[AS] GID <number | string>]
[[WITH] TIMEOUT <number> SECOND(S)]
And doing a check on it says the same
[ec2-user@ip-172-31-29-142 ~]$ sudo monit -t
Control file syntax OK
Logs show that the start methods are not defined for these monitored processes, though!
[UTC May 14 04:39:51] error : 'ircd' process is not running
[UTC May 14 04:39:51] error : monit: Start or stop method not defined -- process ircd
[UTC May 14 04:39:51] error : 'services' process is not running
[UTC May 14 04:39:51] error : monit: Start or stop method not defined -- process services
Starting the processes manually through monit works for some reason
[root@ip-172-31-21-162 ec2-user]# monit start ircd
[root@ip-172-31-21-162 ec2-user]# monit status
The Monit daemon 5.2.5 uptime: 7h 14m
Process 'ircd'
status running
monitoring status monitored
pid 26483
parent pid 1
uptime 3m
...
data collected Sat May 14 02:49:57 2016
Process 'services'
status running
monitoring status monitored
pid 26488
parent pid 1
uptime 3m
...
data collected Sat May 14 02:49:57 2016
Which is rather odd. When I stop those checked processes and restart monit with debug logging enabled, I see that it reports on the start programs.
Process Name = ircd
Pid file = /home/ec2-user/inspircd/run/pid
Monitoring mode = active
Start program = '/home/ec2-user/inspircd/run/inspircd start' as uid 500 as gid 500 timeout 30 second(s)
Existence = if does not exist 1 times within 1 cycle(s) then restart else if succeeded 1 times within 1 cycle(s) then alert
Pid = if changed 1 times within 1 cycle(s) then alert
Ppid = if changed 1 times within 1 cycle(s) then alert
Process Name = services
Pid file = /home/ec2-user/anope/run/data/services.pid
Monitoring mode = active
Start program = '/home/ec2-user/anope/run/bin/anoperc start' as uid 500 as gid 500 timeout 30 second(s)
Existence = if does not exist 1 times within 1 cycle(s) then restart else if succeeded 1 times within 1 cycle(s) then alert
Depends on Service = ircd
Pid = if changed 1 times within 1 cycle(s) then alert
Ppid = if changed 1 times within 1 cycle(s) then alert
Any idea what in Glob's name is going on here?
According to the documented behavior of monit, a stop method must also be defined for non-running processes to be started properly
In active mode (the default), Monit will pro-actively monitor a service and in case of problems raise alerts and/or restart the service.
-- Monit docs; service methods
The action which is performed by Monit when process is not running was always "restart", but since there was no standalone "restart program" (until Monit 5.7), stop+start sequence was used.
-- Monit issues; restart instead of start when a process is not running
Therefore, the solution is and was to add the stop program
line to the checked processes in the control file. Evidently if you are running version >=5.7, you could alternatively use restart program