I'm running Dask on a PBS cluster. My nodes are downloads that take an indeterminate amount of time due to fluctuations in server load. I've set up jobs with reasonably large walltimes (e.g. 4 hours) which should be able to encompass many individual nodes/downloads. However, I have tens of thousands of downloads, so the jobs will timeout before all the downloads finish.
Two questions:
PBSCluster.scale(n)
, when jobs timeout, are new ones automatically launched to take their place?Thanks!
When launching jobs with PBSCluster.scale(n), when jobs timeout, are new ones automatically launched to take their place?
No, but you could try using adapt intead
cluster.adapt(minimum_jobs=n, maximum_jobs=n)
When a job dies (e.g. due to timeout), are the nodes that are running on that job restarted on another job, or are they lost?
They are restarted. However beware that if the same task needs to be restarted several times then Dask will stop trusting it and just mark it as failed.