monitoringbenchmarkingslurmhpcsacct

Can the Slurm job statistics (from seff and sacct) be trusted?


I'm currently working on the benchmark of a few tools on a Slurm managed HPC. As I notably want to evaluate the ressources used by each tool, I tried to use the Slurm seff and the sacct commands. I don't know if it's a good idea. The results seem however pretty strange, even if the tools are launched with the same Slurm ressource allocations.

seff returns for example such results :

CPU Efficiency: 25.00% of 00:00:20 core-walltime, Job Wall-clock time: 00:00:10, Memory Utilized: 132.00 KB, Memory Efficiency: 0.00% of 4.00 GB,

The Memory Utilized is the same (132 KB) for a few tools, even if some others use more than 132 KB. The 0.00% of Memory Efficiency is pretty weird too.

It's also the case with sacct. Such variables remain at 0 or at a small value when they should be bigger. It's for example the case of MaxDiskWrite and MaxDiskRead, which still are at 0 for tools reading and writing some files. Note that for sacct I'm talking about the step ressources (jobId.X) and not about the whole job. Here, the step is basically the command launching and using the tool.

So, here are my questions : do I miss something when I'm looking to these data, notably in the way of Slurm works ? and if not, can these data be trusted ?

Thanks and good luck in your development.


Solution

  • Monitoring of job step usage is done at a regular interval, not continuously. The default interval is 30 seconds ; you can check the actual value with scontrol show config | grep JobAcctGatherFrequency.

    With a job wall-clock time of 00:00:10 and an accounting period of 30s, the numbers are pretty much meaningless.

    You can try increasing the frequency at which usage is polled for your job with the --acctg-freq parameter of sbatch.