timeslurmsacct

questions on time usage reported by SLURM


I have problems understanding the time usage report below:

1) why the times for job step 1 & 2 do not add up to the batch line?

2) what is the relationship between each column, especially for TotalCPU and CPUTime?

3) for time usage of the job, which one is best to report?

$ sacct -o JOBID,AllocCPUs,AveCPU,reqcpus,systemcpu,usercpu,tot
alcpu,cputime,cputimeraw -j 649176
       JobID  AllocCPUS     AveCPU  ReqCPUS  SystemCPU    UserCPU   TotalCPU    CPUTime CPUTimeRAW 
------------ ---------- ---------- -------- ---------- ---------- ---------- ---------- ---------- 
649176               24                  24  00:02.047  01:06.896  01:08.943   00:23:36       1416 
649176.batch         24   00:00:00       24  00:00.027  00:00.014  00:00.041   00:23:36       1416 
649176.0             24   00:00:00       24  00:00.813  00:24.886  00:25.699   00:08:48        528 
649176.1             24   00:00:18       24  00:01.207  00:41.996  00:43.203   00:14:24        864 

Solution

  • 1) why the times for job step 1 & 2 do not add up to the batch line?

    The time reported for .batch for SystemCPU, UserCPU and TotalCPU is the time spend running the commands in the batch file, not counting the spawned processes[1]. CPUTime and CPUTimeRAW do count the spawned processes and thus they add up to the lines corresponding to the job steps.

    2) what is the relationship between each column, especially for TotalCPU and CPUTime?

    TotalCPU is the sum of UserCPU and SystemCPU of each CPU, while CPUTime is the elapsed time multiplied by the number requested CPU. The difference between both is the time spent with the CPUs doing nothing (neither in user mode nor in kernel mode), most of the time waiting for I/O [2]

    3) for time usage of the job, which one is best to report?

    It depends on what you want to show. Elapsed (which you did not show here) gives the "time to solution". CPUTimeRAW is what is often accounted and paid for. Difference between CPUTime and TotalCPU gives information about the I/O overhead.

    [1] From the man page

    SystemCPU The amount of system CPU time used by the job or job step. The format of the output is identical to that of the Elapsed field.

    NOTE: SystemCPU provides a measure of the task’s parent process and does not include CPU time of child processes.

    [2] https://en.wikipedia.org/wiki/CPU_time