I have a program that crashes for unknown reasons on a cluster. I have the feeling that it could be something that has to do with the use of a specific node(s). Is there a command to see on which nodes of the cluster a completed job has been running (I mean the node ID)? I would like to check if by any chance the job is run always on the same nodes.
The sacct
command can be used to query the accounting database:
sacct --start=2024-10-01 --format jobid,state,nodelist
With the --format
, you can specify the columns that you want to see. The --start
allows looking at past jobs (by default, sacct
will only show jobs for the current day)