pythonray

In Ray, does worker node system logs get streamed back to header node or only stay in worker node by default


I noticed system logs on worker node become unavailable when the node is gone, suspecting the log only stays in worker node. Is that correct? If so, how does the worker node becomes available on Ray dashboard?

The Ray log persistence doc at https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/logging.html#putting-everything-together only shows configuration for the header node, does the same configuration need applied to worker nodes too?


Solution

  • Your two questions are very related to each other, so I might have to repeat myself twice here.

    Is that correct? If so, how does the worker node become available on Ray dashboard?

    Yes, you are correct. By default, Ray writes logs to files in the directory /tmp/ray/session_*/logs on each Ray node’s file system, including application logs and system logs. This means that if a node becomes unavailable, its logs may also become unavailable. Ray does not provide a native storage solution for log data, so users need to manage the lifecycle of the logs by themselves. You can use open-source log processing tools such as Vector, FluentBit, Fluentd, Filebeat, and Promtail to help manage your logs.

    As for the Ray dashboard, it lists the Ray logs in your Cluster, organized by node and log file name. Many log links in the other pages link to this view and filter the list so the relevant logs appear. However, if you execute the Driver directly on the Head Node of the Ray Cluster (without using the Job API) or run with Ray Client, the Driver logs are not accessible from the Dashboard. In this case, you would need to see the terminal or Jupyter Notebook output to view the Driver logs.

    If a worker node becomes unreachable, the autoscaler will add a new node to the cluster, and any pending or new tasks to be scheduled will be scheduled on the new node if necessary. If the dead node comes alive, it’ll join the cluster, send a heartbeat to GCS, and GCS will inform all other Raylets running that there’s a new node available.

    does the same configuration need to be applied to worker nodes, too?

    Again, That's correct; by default, Ray writes logs to files in the directory /tmp/ray/session_*/logs on each Ray node’s file system, including application logs and system logs. This applies to both the head node and worker nodes in a Ray cluster. If a node terminates unexpectedly, you may lose access to its logs unless you have set up log persistence. The logs are also viewable in the Ray Dashboard, but to persist Ray logs beyond the lifetime of a pod or a node, you need to set up log persistence. This involves processing and exporting logs to external storage or management systems. You can use open-source log processing tools such as Vector, FluentBit, Fluentd, Filebeat, and Promtail for this purpose. To set up log persistence, you need to ingest log files on each node of your Ray Cluster as sources, parse and transform the logs and then ship the transformed logs to log storage or management systems. This applies to both the head node and worker nodes in a Ray cluster.