I have a god/resque setup that spans a few worker servers. Every so often, the workers get jammed up by long polling connections and won't time out correctly. We have tried coding around it (but regardless of why it doesn't work), the keep-alive packets being sent down the wire won't let us time it out easily.
I would like certain workers (which I already have segmented out in their own watch blocks) to not be allowed to run for longer than a certain amount of time. In pesudocode, I am looking for a watch condition like the following (i.e. restart that worker if it takes longer than 60 sec to complete the task):
w.transition(:up, :restart) do |on|
on.condition(:process_timer) do {|c| c.greater_than = 60.seconds}
end
Any thoughts or pointers on how to accomplish this would be greatly appreciated.
As it turns out, there is an example of how to do this in some sample resque files. It's not exactly what I was looking for since it doesn't add an on.condition(:foo)
, but it is a viable solution:
# This will ride alongside god and kill any rogue stale worker
# processes. Their sacrifice is for the greater good.
WORKER_TIMEOUT = 60 * 10 # 10 minutes
Thread.new do
loop do
begin
`ps -e -o pid,command | grep [r]esque`.split("\n").each do |line|
parts = line.split(' ')
next if parts[-2] != "at"
started = parts[-1].to_i
elapsed = Time.now - Time.at(started)
if elapsed >= WORKER_TIMEOUT
::Process.kill('USR1', parts[0].to_i)
end
end
rescue
# don't die because of stupid exceptions
nil
end
# Sleep so we don't run too frequently
sleep 30
end
end