FAQ Batch Jobs

From HPC Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Common Problems and Pitfalls of Batch Jobs

(explained by taking the example of SLURM)

Why are there certain mandatory resource requirements?

The batch scheduler needs to know some minimal properties of a job to decide which nodes it should be started on.

If for example you would not specify --mem-per-cpu= or --mem-per-node=, a task requiring very large main memory might be scheduled to a node with too little RAM and would thus crash.

To put it another way: with the resource requirements of all user jobs, the scheduler needs to play kind of “multidimensional tetris”. At least along the dimensions runtime, memory size and number of CPU cores, the scheduler places your jobs as efficiently and as gap-free as possible into the cluster. (In the background, many more parameters are used.)

These three properties of a job are thus the bare minimum to give the scheduler something to schedule with.

After submission of my job, it seems to start but exits immediately, without creating any output or error. What's wrong?

Check whether all directories mentioned in your job script are in fact there and writable for you. In particular, the directory specified with

#SBATCH -e /path/to/error/directory/%j.err

for the STDERR of your jobs needs to exist beforehand and must be writable for you. SLURM ends the job immediately if it is unable to write i.e. the error file (due to a missing target directory).

Due to being a “chicken and egg” problem, a construct inside the job script like

#SBATCH -e /path/to/error/directory/%j.err
mkdir -p   /path/to/error/directory/

cannot work either, since for Slurm, the “mkdir” command is already part of the job. Thus, any of “mkdir”s potential output (STDOUT or STDERR) would have to be written to a directory which at begin of the job does not yet exist.

Sometimes, my job runs successfully, sometimes it does not. Why is that?

Make sure the relevant modules are loaded in your job script.

While you can load those modules right when logging in on the login node (since these are inherited by your batch job), this in fact is not reliable. Instead, it renders your jobs dependent on what modules you have loaded in your login session.

We thus recommend to begin each job script with

module purge
module load <each and every relevant module>
myScientificProgram …

to have exactly those modules loaded which are needed, and not more.

This also makes sure your job is reproducible later on, independently of what modules were loaded in your login session at submit time.

I get "srun: Job step creation temporarily disabled", have no results and my job seems to have idled until it times out?

This ususally is caused by "nested calls" to either srun or mpirun within the same job. The second or "inner" instance of srun/mpirun tries to allocate the same resources as the "outer" one already did, and thus cannot complete.

srun myScientificProgram …

Check whether myScientificProgram in fact is an MPI-capable binary. Then, the above syntax is correct.

But if myScientificProgram turns out to be a script, calling srun or mpirun by itself, then remove the srun in front of myScientificProgram and run it directly.

My jobs are reported as “COMPLETED”, even though my scientific program in fact failed miserably. Why is that?

There is no magic by which the scheduler could know the really important part of your job script. The only way for Slurm to detect success or failure is the exit code of your job script, not the real success or failure of any program or command within it.

The exit code of well-written programs is zero in case everything went well, and >0 if an error has occurred.

Imagine the following job script:

myScientificProgram …

Here, the last command executed is in fact your scientific program, so the whole job script exits with the exit code of “myScientificProgram” as desired. Thus, Slurm will assign COMPLETED if “myScientificProgram” has had an exit code of 0, and will assign FAILED if not.

If you issue just one simple command after “myScientificProgram”, this will overwrite the exit code of “myScientificProgram” with its own:

myScientificProgram …
cp resultfile $HOME/jobresults/

Now, the “cp” command's exit code will be the whole job's exit code, since “cp” is the last command of the job script. If the “cp” command succeeds, Slurm will assign COMPLETED even though “myScientificProgram” might have failed – “cp”s success covers the failure of “myScientificProgram”.

To avoid that, save the exit code of your important program before executing any additional commands:

myScientificProgram …
cp resultfile $HOME/jobresults/
/any/other/job/closure/cleanup/commands …

Immediately after execution of myScientificProgram, its exit code is saved to $EXITCODE, and as a last line now, your job script can re-set its own exit code to the one of the real payload. That way, Slurm grasps the exit code of “myScientificProgram”, not just the one of the command which happens to be the last one in your job script, and will set COMPLETED or FAILED appropriately.