rubyresquegod

How do I write a Resque condition that says "if a process is running for longer than n seconds, kill it"?


I have a god/resque setup that spans a few worker servers. Every so often, the workers get jammed up by long polling connections and won't time out correctly. We have tried coding around it (but regardless of why it doesn't work), the keep-alive packets being sent down the wire won't let us time it out easily.

I would like certain workers (which I already have segmented out in their own watch blocks) to not be allowed to run for longer than a certain amount of time. In pesudocode, I am looking for a watch condition like the following (i.e. restart that worker if it takes longer than 60 sec to complete the task):

w.transition(:up, :restart) do |on|
  on.condition(:process_timer) do {|c|  c.greater_than = 60.seconds}
end

Any thoughts or pointers on how to accomplish this would be greatly appreciated.


Solution

  • As it turns out, there is an example of how to do this in some sample resque files. It's not exactly what I was looking for since it doesn't add an on.condition(:foo), but it is a viable solution:

    # This will ride alongside god and kill any rogue stale worker
    # processes. Their sacrifice is for the greater good.
    
    WORKER_TIMEOUT = 60 * 10 # 10 minutes
    
    Thread.new do
      loop do
        begin
          `ps -e -o pid,command | grep [r]esque`.split("\n").each do |line|
            parts   = line.split(' ')
            next if parts[-2] != "at"
            started = parts[-1].to_i
            elapsed = Time.now - Time.at(started)
    
            if elapsed >= WORKER_TIMEOUT
              ::Process.kill('USR1', parts[0].to_i)
            end
          end
        rescue
          # don't die because of stupid exceptions
          nil
        end
    
        # Sleep so we don't run too frequently
        sleep 30
      end
    end