daskpbs

Dask: are workers restarted if the job running them is killed (e.g. due to timeout)


I'm running Dask on a PBS cluster. My nodes are downloads that take an indeterminate amount of time due to fluctuations in server load. I've set up jobs with reasonably large walltimes (e.g. 4 hours) which should be able to encompass many individual nodes/downloads. However, I have tens of thousands of downloads, so the jobs will timeout before all the downloads finish.

Two questions:

  1. When launching jobs with PBSCluster.scale(n), when jobs timeout, are new ones automatically launched to take their place?
  2. When a job dies (e.g. due to timeout), are the nodes that are running on that job restarted on another job, or are they lost?

Thanks!


Solution

  • When launching jobs with PBSCluster.scale(n), when jobs timeout, are new ones automatically launched to take their place?

    No, but you could try using adapt intead

    cluster.adapt(minimum_jobs=n, maximum_jobs=n)
    

    When a job dies (e.g. due to timeout), are the nodes that are running on that job restarted on another job, or are they lost?

    They are restarted. However beware that if the same task needs to be restarted several times then Dask will stop trusting it and just mark it as failed.