rayray-tune

Is there an `initial_workers` (cluster.yaml) replacement mechanism in ray tune?


I shortly describe my use case: Assuming I wanted to spin up a cluster with 10 workers on AWS: In the past I always used initial_workers: 10, min_workers: 0, max_workers: 10 options (cluster.yaml) to initially spin up the cluster to full capacity and then exploit the automated downscaling of the cluster based on idle time. So at the end of job, where almost all trials have been terminated and the full capacity of the cluster is not needed anymore, nodes are automatically removed. Now with the initial_workers option gone #12444, it is not really clear to me how to accomplish the same downscaling behavior.

I experimented with the programatic way to request resources (ray.autoscaler.sdk.request_resources) before and after tune.run but this seems to be the same as settig the min_workers field and I can only downscale the cluster after all jobs have been terminated. I also tried to set the upscaling_speed but for some reason upscaling is very slowly and seems to add only one node at a time (I am requesting GPUs). There is also always only one pending task which I also do not really understand yet (Unfortunately I also do not really have the time to investigate this fully :()

Currently I am using the programatic way described above which works fine but then I have a lot of idle resources at the end of the job that run for hours before I can downscale.

Would be great if someone could point me to the right direction to solve this.

Thx


Solution

  • With ray version 1.30 the autoscaler issues I observed seem to be resolved and now the cluster scales with the pending trials as expected (using AWS ec2 g4dn instances). So no need for intial_workers option anymore.