I'm learning to use RLlib. I've been running it in my debugger on an example script, and it works, but for some reason I get an error message about the monitoring service failing. This is the traceback:
File "/home/ramrachum/.venvs/ray_env/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 600, in <module>
monitor = Monitor(
File "/home/ramrachum/.venvs/ray_env/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 205, in __init__
logger.exception(
File "/usr/lib/python3.10/logging/__init__.py", line 1512, in exception
self.error(msg, *args, exc_info=exc_info, **kwargs)
File "/usr/lib/python3.10/logging/__init__.py", line 70, in error
File "/usr/lib/python3.10/logging/__init__.py", line 1911, in _LogErrorReplacement
msg,
File "/home/ramrachum/.venvs/ray_env/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py", line 199, in __init__
prometheus_client.start_http_server(
File "/home/ramrachum/.venvs/ray_env/lib/python3.10/site-packages/prometheus_client/exposition.py", line 168, in start_wsgi_server
TmpServer.address_family, addr = _get_best_family(addr, port)
File "/home/ramrachum/.venvs/ray_env/lib/python3.10/site-packages/prometheus_client/exposition.py", line 157, in _get_best_family
infos = socket.getaddrinfo(address, port)
File "/usr/lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -5] No address associated with hostname
I'm trying to understand why this bug is happening and how I can fix it. The hostname it's trying to use is ''
, which sounds like something that shouldn't work. Working my way up the traceback, I see that in ray/autoscaler/_private/monitor.py
line 201, there's this logic:
addr="127.0.0.1" if head_node_ip == "127.0.0.1" else "",
Since in my case, head_node_ip
is equal to '192.168.1.116'
, the else
clause is used and an empty address is passed on getaddrinfo
.
I'm not sure what the logic of this code is. Can getaddrinfo
even work with an empty string? How does this service work for people normally? How do I make it not fail?
This is a known bug with prometheus-client==0.14
.