I am interested in running Ray on AWS Batch multi-node. This is a pattern that hasn't been done before on Ray, and thus, there's no documentation on it. But, I'd really like to try it since Ray can be installed on-premise as well.
I stood up the AWS Batch multi-node gang-scheduled closer and ran the following commands:
subprocess.Popen(f"ray start --head --node-ip-address {current.parallel.main_ip} --port {master_port} --block", shell=True).wait()
import ray
node_ip_address = ray._private.services.get_node_ip_address()
subprocess.Popen(f"ray start --node-ip-address {node_ip_address} --address {current.parallel.main_ip}:{master_port} --block", shell=True).wait()
The head node seems to be working, but there's some issue with the worker nodes not syncing with the head node.
I get the following output in stderr
:
[2023-07-28 09:25:55,500 I 427 427] global_state_accessor.cc:356: This node has an IP address of 10.14.52.21, but we cannot find a local Raylet with the same address. This can happen when you connect to the Ray cluster with a different IP address or when connecting to a container.
Any insight on how I can get Ray working on AWS Batch multi-node would be much appreciated!
Seemed to be an issue with pydantic
. I downgraded pydantic
version to 1.10.12
and after that, it seemed to work like a charm.
Also, for AWS Batch, the worker nodes need to be kept alive. So there needs to be a heartbeat where the worker nodes ping the head node to check if the job is complete, and if not, then you execute time.sleep