I'm trying to migrate control of a group of resque workers from upstart to systemd. Under upstart we were able to have two control scripts, one script that defined a single worker and a second script that invoked the first script multiple times to start or stop multiple workers with a single upstart command. We're trying to implement the same ability using systemd.
I've tried using a single systemd unit per worker, so if we're trying to manage 6 workers, we use 6 separate systemd unit scripts, one per worker. We then employ a bash script to trigger:
systemctl start|stop|restart worker-1.service &
systemctl start|stop|restart worker-2.service &
...
The problem is that it appears when we send the kill signal via systemctl, it is killing the parent resque process immediately causing any forked child workers to die immediately, rather than finishing out their job before dying. We were able to implement this exact behavior using upstart where the parent process would not accept new jobs (would stop forking) and the child worker process was allowed to stay alive while it was working on the job, after the job completes the child worker process dies on its own.
Under systemd the workers all die immediately and jobs are terminated mid-stream before they can complete.
Our systemd unit script looks like this:
[Unit]
Description=Controls a single Resque worker process: worker-1
After=redis.service
[Service]
Restart=on-failure
RestartSec=10
StartLimitInterval=400
StartLimitBurst=5
KillSignal=SIGQUIT
User=www-data
WorkingDirectory=/app/working/dir
Type=single
ExecStart=/usr/bin/bundle exec rake production resque:work QUEUE=a,b,c,d,e,f
ExecStop=/bin/kill -QUIT $MAINPID
[Install]
WantedBy=multi-user.target
I've tried changing Type=single to Type=forking, but the process does not stay up, it tries to start, then when no job is available, since the parent process only forks when there's a job, the process dies and fails to stay up. With Type=simple, the processes work as expected, but as described above, we cannot control them gracefully like we could with upstart.
Our bash script looks like this:
systemctl $COMMAND resque-worker-1.service &
Where there is a command for each worker service. $COMMAND is simply an argument passed to the script for (start|stop|restart).
The previous upstart scripts used:
start on runlevel [2345] stop on runlevel [06]
kill signal QUIT
Think I solved this myself. By removing this directive:
ExecStop=/bin/kill -QUIT $MAINPID
When I issue a systemctl stop resque-worker-n.service now, it gracefully waits until the job is completed before killing the worker.
Noticed though that certain jobs would still quit instantly, so added this directive:
KillMode=process
But then noticed that when restarting workers later, the "completed" jobs were considered queueable by resque and so would be queued again incorrectly. So added this directive:
SendSIGKILL=no
And now behavior seems to be identical to previous behavior we had using upstart.