slurmlsf

How can I get detailed job run info from SLURM (e.g. like that produced for "standard output" by LSF)?


When using bsub with LSF, the -o option gave a lot of details such as when the job started and ended and how much memory and CPU time the job took. With SLURM, all I get is the same standard output that I'd get from running a script without LSF.

For example, given this Perl 6 script:

warn  "standard error stream";
say  "standard output stream";

Submitted thus:

sbatch -o test.o%j -e test.e%j -J test_warn --wrap 'perl6 test.p6'

Resulted in the file test.o34380:

Testing standard output

and the file test.e34380:

Testing standard Error  in block <unit> at test.p6:2

With LSF, I'd get all kinds of details in the standard output file, something like:
Sender: LSF System <lsfadmin@my_node>
Subject: Job 347511: <test> Done

Job <test> was submitted from host <my_cluster> by user <username> in cluster <my_cluster_act>.
Job was executed on host(s) <my_node>, in queue <normal>, as user <username> in cluster <my_cluster_act>.
</home/username> was used as the home directory.
</path/to/working/directory> was used as the working directory.
Started at Mon Mar 16 13:10:23 2015
Results reported at Mon Mar 16 13:10:29 2015

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
perl6 test.p6

------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :    0.19 sec.
    Max Memory :    0.10 MB
    Max Swap   :    0.10 MB

    Max Processes  :         2
    Max Threads    :         3

The output (if any) follows:

standard output stream

PS:

Read file <test.e_347511> for stderr output of this job.

Update:

One or more -v flags to sbatch gives more preliminary information, but doesn't change the standard output.

Update 2:

Use seff JOBID for the desired info (where JOBID is the actual number). Just be aware that it collects data once a minute, so it might say that your max memory usage was 2.2GB, even though your job was killed due to using more than the 4GB of memory you requested.


Solution

  • At the end of each job I use to insert

    sstat -j $SLURM_JOB_ID.batch --format=JobID,MaxVMSize

    to add RAM usage to the standard output.