How to monitor resources during slurm job?

I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, i.e while the job is running. I know about sacct and sstat and I was thinking to include these commands in my submission script, e.g. something in the line of

#!/bin/bash
#SBATCH <options>

# Running the actual job in background
srun my_program input.in output.out &

# While loop that records resources
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
FIRST=0
#sleep time in seconds
STIME=15
while [ "$JobStatus" != "COMPLETED" ]; do
    #update job status
    JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
    if [ "$JobStatus" == "RUNNING" ]; then
        if [ $FIRST -eq 0 ]; then
            sstat --format=AveCPU,AveRSS,MaxRSS -P -j ${SLURM_JOB_ID} >> usage.txt
            FIRST=1
        else
            sstat --format=AveCPU,AveRSS,MaxRSS -P --noheader -j ${SLURM_JOB_ID} >> usage.txt
        fi
        sleep $STIME
    elif [ "$JobStatus" == "PENDING" ]; then
        sleep $STIME
    else
        sacct -j ${SLURM_JOB_ID} --format=AllocCPUS,ReqMem,MaxRSS,AveRSS,AveDiskRead,AveDiskWrite,ReqCPUS,AllocCPUs,NTasks,Elapsed,State >> usage.txt
        JobStatus="COMPLETED"
        break
    fi
done

However, I'm not really convinced of this solution:

sstat unfortunately doesn't show how many cpus are used at the moment (only average)
MaxRSS is also not helpful if I try to record memory usage over time
there still seems to be some error (script doesn't stop after job finishes)

Does anyone have an idea how to do that properly? Maybe even with top or htop instead of sstat? Any help is much appreciated.

Solution

Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. The file contains a time series for each measure tracked, and you can choose the time resolution.

You can activate it with

#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>

See the documentation here.

To check that this plugin is installed, run

scontrol show config | grep AcctGatherProfileType

It should output AcctGatherProfileType = acct_gather_profile/hdf5.

The files are created in the folder referred to in the ProfileHDF5Dir Slurm configuration parameter (in slurm.conf)

As for your script, you could try replacing sstat with an SSH connection to the compute nodes to run ps. Assuming pdsh or clush is installed, you could run something like:

pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt

This will give you CPU and memory usage per process.

As a final note, your job never terminates simply because it will terminate when the while loop terminates, and the while loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED" will never be observed from within the script. When the job is completed, the script is killed.