How can I get detailed job run info from SLURM (e.g. like that produced for "standard output" by LSF)?

Asked 28/4, 2015 at 20:8 Answered 23/5, 2017 at 19:13

When using bsub with LSF, the -o option gave a lot of details such as when the job started and ended and how much memory and CPU time the job took. With SLURM, all I get is the same standard output that I'd get from running a script without LSF.

For example, given this Perl 6 script:

warn  "standard error stream";
say  "standard output stream";

Submitted thus:

sbatch -o test.o%j -e test.e%j -J test_warn --wrap 'perl6 test.p6'

Resulted in the file test.o34380:

Testing standard output

and the file test.e34380:

Testing standard Error  in block <unit> at test.p6:2

With LSF, I'd get all kinds of details in the standard output file, something like:

Sender: LSF System <lsfadmin@my_node>
Subject: Job 347511: <test> Done

Job <test> was submitted from host <my_cluster> by user <username> in cluster <my_cluster_act>.
Job was executed on host(s) <my_node>, in queue <normal>, as user <username> in cluster <my_cluster_act>.
</home/username> was used as the home directory.
</path/to/working/directory> was used as the working directory.
Started at Mon Mar 16 13:10:23 2015
Results reported at Mon Mar 16 13:10:29 2015

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
perl6 test.p6

------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :    0.19 sec.
    Max Memory :    0.10 MB
    Max Swap   :    0.10 MB

    Max Processes  :         2
    Max Threads    :         3

The output (if any) follows:

standard output stream

PS:

Read file <test.e_347511> for stderr output of this job.

Update:

One or more -v flags to sbatch gives more preliminary information, but doesn't change the standard output.

Update 2:

Use seff JOBID for the desired info (where JOBID is the actual number). Just be aware that it collects data once a minute, so it might say that your max memory usage was 2.2GB, even though your job was killed due to using more than the 4GB of memory you requested.

Umbilicate answered 28/4, 2015 at 20:8 Comment(1)

Assuming your SLURM version is new enough, just use seff – Umbilicate 25/2, 2022 at 21:17

At the end of each job I use to insert

sstat -j $SLURM_JOB_ID.batch --format=JobID,MaxVMSize

to add RAM usage to the standard output.

Reaper answered 23/5, 2017 at 19:13 Comment(2)

What does the .batch appended to the job id represent? I have also noticed that there can be a .extern, any ideas what that is as well? – Herwick 14/12, 2017 at 3:57

@Herwick see https://mcmap.net/q/580561/-slurm-sacct-shows-39-batch-39-and-39-extern-39-job-names – Umbilicate 9/8, 2023 at 17:13

UPDATED ANSWER:

Years after my original answer, a friend pointed out seff to me, which is by far the best way to get this info:

seff JOBID

Just be aware that memory consumption is not constantly monitored, so if your job gets killed due to using too much memory, then know that it really did go over what you requested even if seff reports less.

ORIGINAL ANSWER:
For recent jobs, try

sacct -l

Look under the "Job Accounting Fields" section of the documentation for descriptions of each of the three dozen or so columns in the output.

For just the job ID, maximum RAM used, maximum virtual memory size, start time, end time, CPU time in seconds, and the list of nodes on which the jobs ran. By default this just gives info on jobs run the same day (see --starttime or --endtime options for getting info on jobs from other days):

sacct --format=jobid,MaxRSS,MaxVMSize,start,end,CPUTimeRAW,NodeList

This will give you output like:

       JobID  MaxRSS  MaxVMSize               Start                 End CPUTimeRAW NodeList
------------ ------- ---------- ------------------- ------------------- ---------- --------
36511                           2015-04-29T11:34:37 2015-04-29T11:34:37          0  c50b-20
36511.batch     660K    181988K 2015-04-29T11:34:37 2015-04-29T11:34:37          0  c50b-20
36514                           2015-04-29T12:18:46 2015-04-29T12:18:46          0  c50b-20
36514.batch     656K    181988K 2015-04-29T12:18:46 2015-04-29T12:18:46          0  c50b-20

Use --state COMPLETED for checking previously completed jobs. When checking a state other than RUNNING, you have to give a start or end time.

sacct --starttime 08/01/15 --state COMPLETED --format=jobid,MaxRSS,MaxVMSize,start,end,CPUTImeRaw,NodeList,ReqCPUS,ReqMem,Elapsed,Timelimit

You can also get work directory about the job using scontrol:

scontrol show job 36514

Which will give you output like:

JobId=36537 JobName=sbatch
UserId=username(123456) GroupId=my_group(678)
......
WorkDir=/path/to/work/dir

However, by default, scontrol can only access that information for about five minutes after the job finishes, after which it is purged from memory.

Umbilicate answered 29/4, 2015 at 18:1 Comment(5)

Take into account that the information provided by scontrol will disappear if the JobMinAge parameter is configured in slurm.conf (default 300 seconds). Take a look to the elasticsearch plugin github.com/asanchez1987/jobcomp-elasticsearch (not yet available in a stable version of slurm, but already merged into the master branch) which stores almost all the information provided by scontrol and some of the sacct except performance data. Using this plugin will allow to query the workdir of a past job as well as the jobscript – Acicular 30/4, 2015 at 14:41

@CarlesFenoy Thanks! I updated my answer to indicate how brief the opportunity to capture that information may be. – Umbilicate 30/4, 2015 at 18:43

you can "always" use sacct as it queries the accounting database, while scontrol queries the slurmctld memory – Acicular 30/4, 2015 at 18:46

@CarlesFenoy I assume you mean "always" as in until that particular jobID gets recycled (after hitting MaxJobId, the next job id starts over at FirstJobId)? – Umbilicate 30/4, 2015 at 19:23

the accounting can store multiple jobs with same jobid I think. The jobs can get purged from the database (periodically or manually) but they are not purged automatically – Acicular 30/4, 2015 at 19:25