slurmsacct

sacct reports different results for the same job


I run sacct with -j switch, for a specific job-id. Depending on other command line switches two completely different results are reported for the same job. Here are three examples. The second one shows different result than the other two.

attar@lh> sacct -a -s CA,CD,F,NF,PR,TO  -S 2020-07-26T00:00:00 -E 2020-07-27T23:59:59  --format=JobId,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus -j 1401                        JobID      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS
------------ ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ----------
1401         CANCELLED+  UNLIMITED 2020-07-26T20:45:31 2020-07-27T08:36:10   11:50:39                              1          2
1401.batch    COMPLETED            2020-07-26T20:45:31 2020-07-27T08:36:17   11:50:46    103856K    619812K        1          2

attar@lh> sacct -a -s CA,CD,F,NF,PR,TO  -S 2020-07-26T00:00:00 -E 2020-07-26T23:59:59  --format=JobId,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus -j 1401
       JobID      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS
------------ ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ----------
1401          NODE_FAIL  UNLIMITED 2020-06-15T09:38:38 2020-07-26T00:17:26 40-14:38:48                              1          2

attar@lh> sacct -a -s CA,CD,F,NF,PR,TO    --format=JobId,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus -j 1401
       JobID      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS
------------ ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ----------
1401         CANCELLED+  UNLIMITED 2020-07-26T20:45:31 2020-07-27T08:36:10   11:50:39                              1          2
1401.batch    COMPLETED            2020-07-26T20:45:31 2020-07-27T08:36:17   11:50:46    103856K    619812K        1          2

Why are the start/end times different for the same job? One reports 11 hours run-time and the other 40 days run-time!

Any of your insight is highly appreciated!


Solution

  • This would typically happen when two jobs have the same JobId. The sacct documentation says:

    If Slurm job ids are reset, some job numbers will probably appear more than once in the accounting log file but refer to different jobs. Such jobs can be distinguished by the "submit" time stamp in the data records.

    Try running the sacct command with the --duplicates option.