Difference between revisions of "SLURM"

From HPC Wiki
Jump to navigation Jump to search
m (updated dead link)
 
(42 intermediate revisions by 11 users not shown)
Line 1: Line 1:
 +
[[Category:HPC-User]]
 
== General ==
 
== General ==
  
SLURM is a workload manager / job [[scheduler]]. To get an overview of the functionality of a scheduler, go [[Scheduler#General|here]].
+
SLURM is a workload manager / job [[scheduler]]. To get an overview of the functionality of a scheduler, go [[Scheduler#General|here]] or to the [[Scheduling_Basics|Scheduling Basics]].
  
  
Line 56: Line 57:
 
| --ntasks-per-node=<num_procs> || number of processes per node (the possible maximum depends on your nodes)
 
| --ntasks-per-node=<num_procs> || number of processes per node (the possible maximum depends on your nodes)
 
|}
 
|}
 +
  
 
Email notifications:
 
Email notifications:
Line 65: Line 67:
 
| --mail-user=<email_address> || email address to send notifications to
 
| --mail-user=<email_address> || email address to send notifications to
 
|}
 
|}
 +
 +
A more complete List of sbatch settings can be found in the [https://slurm.schedmd.com/sbatch.html Official SBATCH documentation].
 +
 +
=== OpenMP/Multithreading vs. MPI ===
 +
While there are several ways to request a certain amount of CPU cores for your program (# in the following examples), notice the following distinction:
 +
--ntasks=# / -n #
 +
: requests "#" (no of) CPU cores for MPI ranks (distinct processes) &rarr; these can be distributed over ''several'' compute nodes!
 +
 +
 +
--cpus-per-task=# / -c #
 +
: requests "#" (no of) CPU cores for multithreaded applications (eg. OpenMP) &rarr; these will ''always'' be allocated ''inside'' one single compute node, ''never'' to several nodes!
 +
 +
 +
For a ''plain MPI'' application, use <tt>--ntasks=#</tt>, using [[Parallel Programming#Distributed_Memory|''Distributed Memory'']] (across nodes), requires MPI.<br />
 +
For a ''plain OpenMP/multithreaded'' application, use <tt>--ntasks=1 --cpus-per-task=#</tt>, using [[Parallel Programming#Shared_Memory|''Shared Memory'']] (inside a single node).<br />
 +
For a [[Hybrid Slurm Job|''hybrid'' application]], use <tt>--ntasks=<no of nodes></tt> plus <tt>--cpus-per-task=<no of cores per node></tt>, using both [[Parallel Programming|SM and DM]], requires MPI.<br />
 +
 +
The SBATCH option <tt>--ntasks-per-core=#</tt> is only suitable for compute nodes having HyperThreading ''enabled'' in hardware/BIOS, which is not always the case.
 +
 +
All numbers above are subject to your own scaling tests! If your OpenMP application does not scale up well enough to the number of cores physically available in a compute node, slice your data into smaller chunks and use smaller jobs with <tt>--cpus-per-task=<''optimum of your scaling test''></tt>.
 +
 +
If for example your program scales best up to '''24''' CPU cores (while your typical compute node has '''96'''), send '''4''' jobs with <tt>--cpus-per-task='''24'''</tt>, preferably without <tt>#SBATCH --exclusive</tt>, so that these four can fit onto the same node.
  
 
== Job Submission ==
 
== Job Submission ==
Line 77: Line 101:
  
 
  $ squeue -u <user_id>
 
  $ squeue -u <user_id>
 +
 +
Please add the parameter <code>--start</code> to the <code>squeue</code> command in order to report the expected start time and resources to be allocated for pending jobs. Please note that this start time is not guaranteed and might be changed due to high priority jobs or job backfilling.
 +
  
 
In case you submitted a job on accident or realised that your job might not be running correctly, you can always remove it from the queue or terminate it when running by typing:
 
In case you submitted a job on accident or realised that your job might not be running correctly, you can always remove it from the queue or terminate it when running by typing:
  
 
  $ scancel <job_id>
 
  $ scancel <job_id>
 +
 +
Furthermore, Information about current and past jobs can be accessed via:
 +
$ sacct
 +
with more detailed information at the [https://slurm.schedmd.com/sacct.html Slurm documentation of this command]
 +
 +
== Array and Chain Jobs ==
 +
Arrays are the best way to submit ''many similar'' jobs. In short: whenever you are tempted to write a shell loop around <tt>sbatch</tt> like
 +
<syntaxhighlight lang="bash">
 +
for i in {1..1000} ; do
 +
    sbatch myJobScript ${i}.jpg
 +
done
 +
</syntaxhighlight>
 +
do ''not'' do it - instead, use Slurm's '''job array''' feature.
 +
 +
; High-Throughput computing
 +
The above example of an image analysis over 1000 JPG files (named 1.jpg, 2.jpg, 3.jpg, ...) can be written as a ''job array'' with
 +
 +
<syntaxhighlight lang="bash">
 +
#SBATCH --array=1-1000
 +
#SBATCH ...
 +
myJPGAnalyzer --input=${SLURM_ARRAY_TASK_ID}.jpg > ${SLURM_ARRAY_TASK_ID}.out
 +
</syntaxhighlight>
 +
 +
Slurm will create '''1''' job with '''1000''' elements (subjobs = array tasks), each of these being
 +
* independent of each other
 +
* scheduled in any free time slot on any free compute node
 +
* run as many in parallel as are nodes & time slots free
 +
* with less than a tenth of the Slurm-internal efforts as was necessary for distinct single jobs (those created by the above shell loop)
 +
 +
These subjobs = array tasks all have the ''same'' Slurm job id, followed by an underscore and their array index:
 +
23477687_1
 +
23477687_2
 +
...
 +
23477687_1000
 +
 +
; Time series
 +
Job arrays can also be used to create series of jobs, starting one after another.
 +
 +
Imagine a cluster with very few nodes allowing jobs up to 7 days runtime, and a lot of nodes dedicated to jobs <24h. If you submit a job with <tt>--time=7-00:00:00</tt>, this job would have to wait in "pending" for a long time.
 +
 +
If you have to run a very huge and long simulation, using a program capable of [[Snapshotting Jobs|CPR]], you can create a time series of 24h jobs:
 +
<syntaxhighlight lang="bash">
 +
#SBATCH --array=1-19%1
 +
#SBATCH --time=1-00:00:00
 +
#SBATCH ...
 +
mySim ... -statefile=Simulation19d.state
 +
</syntaxhighlight>
 +
 +
With "'''%#'''", you can restrict the number of array tasks which Slurm runs in parallel. Our  '''%1''' here thus creates a "one after another" suite of follow-up array tasks.
 +
 +
Each array task will
 +
* run for one day, continuously saving the state of the simulation in "Simulation19d.state"
 +
* be killed by Slurm after 24:00:01 runtime
 +
* just to be followed by the next array task, which picks up right at where its predecessor left (by reading in "Simulation19d.state")
 +
That way, you run your 19 day simulation in '''19''' single-day chunks, using the many more compute nodes available in the 24h queue!
 +
 +
 +
 +
<syntaxhighlight lang="zsh">
 +
 +
#SBATCH --array=1-4:2%1
 +
 +
</syntaxhighlight>
 +
 +
This creates an array job with *2* subjobs (numbered 1..4 with a stepping of 2) where only *one* may be executed at a time, in a random order.
 +
 +
An explicit order can be forced by either submitting each sub job at the end of its predecessor (which may prolong pending) or using the ''dependencies'' feature, which results in a ...
 +
 +
; chain job with ''dependencies''
 +
 +
<syntaxhighlight lang="zsh">
 +
 +
#SBATCH --dependency=<type>
 +
 +
</syntaxhighlight>
 +
 +
The available conditions for chain jobs are
 +
 +
{| class="wikitable" style="width: 60%;"
 +
| Condition || Function
 +
|-
 +
| after:<jobID> || job can start once job <jobID> has started execution
 +
|-
 +
| afterany:<jobID> || job can start once job <jobID> has terminated
 +
|-
 +
| afterok:<jobID> || job can start once job <jobID> has terminated successfully
 +
|-
 +
| afternotok:<jobID> || job can start once job <jobID> has terminated upon failure
 +
|-
 +
| singleton || job can start once any previous job with identical name and user has terminated
 +
|}
  
 
== Jobscript Examples ==
 
== Jobscript Examples ==
Line 125: Line 243:
 
### Memory your job needs per node, e. g. 500 MB
 
### Memory your job needs per node, e. g. 500 MB
 
#SBATCH --mem=500M
 
#SBATCH --mem=500M
 
### Use one node for parallel jobs on shared-memory systems
 
#SBATCH --nodes=1
 
  
 
### Number of threads to use, e. g. 24
 
### Number of threads to use, e. g. 24
 
#SBATCH --cpus-per-task=24
 
#SBATCH --cpus-per-task=24
 
### Number of hyperthreads per core
 
#SBATCH --ntasks-per-core=1
 
 
### Tasks per node (for shared-memory parallelisation, use 1)
 
#SBATCH --ntasks-per-node=1
 
  
 
### The last part consists of regular shell commands:
 
### The last part consists of regular shell commands:
Line 187: Line 296:
 
srun myapp.exe
 
srun myapp.exe
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
Please find more elaborate SLURM job scripts for
 +
[[hybrid slurm job|running a hybrid MPI+OpenMP program in a batch job]] and for
 +
[[multiple runs in one slurm job|running multiple shared-memory / OpenMP programs at a time in one batch job]].
 +
 +
== Site specific notes ==
 +
 +
=== RRZE ===
 +
 +
* <code>--output=</code> ''should not'' be used on RRZE's clusters; the submit filter already sets suitable defaults automatically
 +
* <code>--mem=<memlimit></code> '''must not''' be used on RRZE's clusters
 +
* the first line of the job script ''should be'' <code>#/bin/bash -l</code> otherwise <code>module</code> commands won't work in te job script
 +
* to have a clean environment in job scripts, it is recommended to add <code>#SBATCH --export=NONE</code> '''and''' <code>unset SLURM_EXPORT_ENV</code> to the job script. Otherwise, the job will inherit some settings from the submitting shell.
 +
* access to the parallel file system has to be specified by <code>#SBATCH ---constraint=parfs</code> or the command line shortcut <code>-C parfs</code>
 +
* access to hardware performance counters (e.g. to be able to use <code>likwid-perfctr</code>) has to be requested by <code>#SBATCH ---constraint=hwperf</code> or the command line shortcut <code>-C hwperf</code>. Only request that feature if you really want to access the hardware performance counters as the feature interferes with the automatic system monitoring.
 +
* multiple features have to be requested in a single <code>--constraint=</code> statement, listing all required features separated by ampersand, e.g. <code>hwperf&parfs</code>
 +
* for Intel MPI, RRZE recommends the usage of <code>mpirun</code> instead of <code>srun</code>; if <code>srun</code> shall be used, the additional command line argument <code>--mpi=pmi2</code> is required. The command line option <code>-ppn</code> of <code>mpirun</code> only works if you <code>export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off</code> before.
 +
* for <code>squeue</code> the option <code>-u user</code> does not have any effect as you always only see your own jobs
 +
 +
=== RWTH ===
 +
 +
* <code>--mem=<memlimit></code> '''must not''' be used on RWTH's clusters
 +
* OMP_NUM_THREADS envvar '''must not''' be set/overwritten on RWTH's clusters in OpenMP and Hybrid jobs; this envvar is set by the system automatically.
 +
* access to hardware performance counters in order to use [https://doc.itc.rwth-aachen.de/display/CC/likwid <code>likwid-perfctr</code>] or [https://doc.itc.rwth-aachen.de/display/CC/intelvtune Intel VTune] is available using the <code>--hwctr=likwid</code> or  <code>--hwctr=vtune</code>  parameter, respectively.
 +
* in order to start MPI or Hybrid application please use '''$MPIEXEC $FLAGS_MPI_BATCH ./a.out''' instead of ''srun'' sommand; these envvars are set accorfingly to used MPI vendor by the module system.
 +
* the shebang of batch script '''must be''' <code>#!/usr/bin/zsh</code>  (otherwise the modules are not accesssible)
  
 
== References ==
 
== References ==
  
[https://www.lrz.de/services/compute/linux-cluster/batch_parallel/example_jobs/ Advanced SLURM jobscript examples]
+
[https://doku.lrz.de/display/PUBLIC/Example+parallel+job+scripts+on+the+Linux-Cluster/ Advanced SLURM jobscript examples]
  
[http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/ Detailled guide to more advanced scripts]
+
[https://docs.nersc.gov/jobs/examples/ Detailled guide to more advanced scripts]
  
 
[https://slurm.schedmd.com/sbatch.html SBATCH documentation]
 
[https://slurm.schedmd.com/sbatch.html SBATCH documentation]
 
[https://user.cscs.ch/getting_started/running_jobs/jobscript_generator/#slurm-jobscript-generator SLURM jobscript generator]
 

Latest revision as of 09:16, 5 September 2024

General

SLURM is a workload manager / job scheduler. To get an overview of the functionality of a scheduler, go here or to the Scheduling Basics.



#SBATCH Usage

If you are writing a jobscript for a SLURM batch system, the magic cookie is "#SBATCH". To use it, start a new line in your script with "#SBATCH". Following that, you can put one of the parameters shown below, where the word written in <...> should be replaced with a value.

Basic settings:

Parameter Function
--job-name=<name> job name
--output=<path> path to the file where the job (error) output is written to

Requesting resources:

Parameter Function
--time=<runlimit> runtime limit in the format hours:min:sec; once the time specified is up, the job will be killed by the scheduler
--mem=<memlimit> job memory request per node, usually an integer followed by a prefix for the unit (e. g. --mem=1G for 1 GB)

Parallel programming (read more here):

Settings for OpenMP:

Parameter Function
--nodes=1 start a parallel job for a shared-memory system on only one node
--cpus-per-task=<num_threads> number of threads to execute OpenMP application with
--ntasks-per-core=<num_hyperthreads> number of hyperthreads per core; i. e. any value greater than 1 will turn on hyperthreading (the possible maximum depends on your CPU)
--ntasks-per-node=1 for OpenMP, use one task per node only

Settings for MPI:

Parameter Function
--nodes=<num_nodes> start a parallel job for a distributed-memory system on several nodes
--cpus-per-task=1 for MPI, use one task per CPU
--ntasks-per-core=1 disable hyperthreading
--ntasks-per-node=<num_procs> number of processes per node (the possible maximum depends on your nodes)


Email notifications:

Parameter Function
--mail-type=<type> type can be one of BEGIN, END, FAIL, REQUEUE or ALL (where a mail will be sent each time the status of your process changes)
--mail-user=<email_address> email address to send notifications to

A more complete List of sbatch settings can be found in the Official SBATCH documentation.

OpenMP/Multithreading vs. MPI

While there are several ways to request a certain amount of CPU cores for your program (# in the following examples), notice the following distinction:

--ntasks=# / -n #
requests "#" (no of) CPU cores for MPI ranks (distinct processes) → these can be distributed over several compute nodes!


--cpus-per-task=# / -c #
requests "#" (no of) CPU cores for multithreaded applications (eg. OpenMP) → these will always be allocated inside one single compute node, never to several nodes!


For a plain MPI application, use --ntasks=#, using Distributed Memory (across nodes), requires MPI.
For a plain OpenMP/multithreaded application, use --ntasks=1 --cpus-per-task=#, using Shared Memory (inside a single node).
For a hybrid application, use --ntasks=<no of nodes> plus --cpus-per-task=<no of cores per node>, using both SM and DM, requires MPI.

The SBATCH option --ntasks-per-core=# is only suitable for compute nodes having HyperThreading enabled in hardware/BIOS, which is not always the case.

All numbers above are subject to your own scaling tests! If your OpenMP application does not scale up well enough to the number of cores physically available in a compute node, slice your data into smaller chunks and use smaller jobs with --cpus-per-task=<optimum of your scaling test>.

If for example your program scales best up to 24 CPU cores (while your typical compute node has 96), send 4 jobs with --cpus-per-task=24, preferably without #SBATCH --exclusive, so that these four can fit onto the same node.

Job Submission

This command submits the job you defined in your jobscript to the batch system:

$ sbatch jobscript.sh

Just like any other incoming job, your job will first be queued. Then, the scheduler decides when your job will be run. The more resources your job requires, the longer it may be waiting to execute.

You can check the current status of your submitted jobs and their job ids with the following shell command. A job can either be pending PD (waiting for free nodes to run on) or running R (the jobscript is currently being executed). This command will also print the time (hours:min:sec) that your job has been running for.

$ squeue -u <user_id>

Please add the parameter --start to the squeue command in order to report the expected start time and resources to be allocated for pending jobs. Please note that this start time is not guaranteed and might be changed due to high priority jobs or job backfilling.


In case you submitted a job on accident or realised that your job might not be running correctly, you can always remove it from the queue or terminate it when running by typing:

$ scancel <job_id>

Furthermore, Information about current and past jobs can be accessed via:

$ sacct

with more detailed information at the Slurm documentation of this command

Array and Chain Jobs

Arrays are the best way to submit many similar jobs. In short: whenever you are tempted to write a shell loop around sbatch like

 for i in {1..1000} ; do
     sbatch myJobScript ${i}.jpg
 done

do not do it - instead, use Slurm's job array feature.

High-Throughput computing

The above example of an image analysis over 1000 JPG files (named 1.jpg, 2.jpg, 3.jpg, ...) can be written as a job array with

 #SBATCH --array=1-1000
 #SBATCH ...
 myJPGAnalyzer --input=${SLURM_ARRAY_TASK_ID}.jpg > ${SLURM_ARRAY_TASK_ID}.out

Slurm will create 1 job with 1000 elements (subjobs = array tasks), each of these being

  • independent of each other
  • scheduled in any free time slot on any free compute node
  • run as many in parallel as are nodes & time slots free
  • with less than a tenth of the Slurm-internal efforts as was necessary for distinct single jobs (those created by the above shell loop)

These subjobs = array tasks all have the same Slurm job id, followed by an underscore and their array index:

23477687_1
23477687_2
...
23477687_1000
Time series

Job arrays can also be used to create series of jobs, starting one after another.

Imagine a cluster with very few nodes allowing jobs up to 7 days runtime, and a lot of nodes dedicated to jobs <24h. If you submit a job with --time=7-00:00:00, this job would have to wait in "pending" for a long time.

If you have to run a very huge and long simulation, using a program capable of CPR, you can create a time series of 24h jobs:

 #SBATCH --array=1-19%1
 #SBATCH --time=1-00:00:00
 #SBATCH ...
 mySim ... -statefile=Simulation19d.state

With "%#", you can restrict the number of array tasks which Slurm runs in parallel. Our %1 here thus creates a "one after another" suite of follow-up array tasks.

Each array task will

  • run for one day, continuously saving the state of the simulation in "Simulation19d.state"
  • be killed by Slurm after 24:00:01 runtime
  • just to be followed by the next array task, which picks up right at where its predecessor left (by reading in "Simulation19d.state")

That way, you run your 19 day simulation in 19 single-day chunks, using the many more compute nodes available in the 24h queue!


#SBATCH --array=1-4:2%1

This creates an array job with *2* subjobs (numbered 1..4 with a stepping of 2) where only *one* may be executed at a time, in a random order.

An explicit order can be forced by either submitting each sub job at the end of its predecessor (which may prolong pending) or using the dependencies feature, which results in a ...

chain job with dependencies
#SBATCH --dependency=<type>

The available conditions for chain jobs are

Condition Function
after:<jobID> job can start once job <jobID> has started execution
afterany:<jobID> job can start once job <jobID> has terminated
afterok:<jobID> job can start once job <jobID> has terminated successfully
afternotok:<jobID> job can start once job <jobID> has terminated upon failure
singleton job can start once any previous job with identical name and user has terminated

Jobscript Examples

This serial job will run a given executable, in this case "myapp.exe".

#!/bin/bash

### Job name
#SBATCH --job-name=MYJOB

### File for the output
#SBATCH --output=MYJOB_OUTPUT

### Time your job needs to execute, e. g. 15 min 30 sec
#SBATCH --time=00:15:30

### Memory your job needs per node, e. g. 1 GB
#SBATCH --mem=1G

### The last part consists of regular shell commands:
### Change to working directory
cd /home/usr/workingdirectory

### Execute your application
myapp.exe

If you'd like to run a parallel job on a cluster that is managed by SLURM, you have to clarify that. Therefore, use the command "srun <my_executable>" in your jobscript.

This OpenMP job will start the parallel program "myapp.exe" with 24 threads.

#!/bin/bash

### Job name
#SBATCH --job-name=OMPJOB

### File for the output
#SBATCH --output=OMPJOB_OUTPUT

### Time your job needs to execute, e. g. 30 min
#SBATCH --time=00:30:00

### Memory your job needs per node, e. g. 500 MB
#SBATCH --mem=500M

### Number of threads to use, e. g. 24
#SBATCH --cpus-per-task=24

### The last part consists of regular shell commands:
### Set the number of threads in your cluster environment to the value specified above
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

### Change to working directory
cd /home/usr/workingdirectory

### Run your parallel application
srun myapp.exe

This MPI job will start the parallel program "myapp.exe" with 12 processes.

#!/bin/bash

### Job name
#SBATCH --job-name=MPIJOB

### File for the output
#SBATCH --output=MPIJOB_OUTPUT

### Time your job needs to execute, e. g. 50 min
#SBATCH --time=00:50:00

### Memory your job needs per node, e. g. 250 MB
#SBATCH --mem=250M

### Use more than one node for parallel jobs on distributed-memory systems, e. g. 2
#SBATCH --nodes=2

### Number of CPUS per task (for distributed-memory parallelisation, use 1)
#SBATCH --cpus-per-task=1

### Disable hyperthreading by setting the tasks per core to 1
#SBATCH --ntasks-per-core=1

### Number of processes per node, e. g. 6 (6 processes on 2 nodes = 12 processes in total)
#SBATCH --ntasks-per-node=6

### The last part consists of regular shell commands:
### Set the number of threads in your cluster environment to 1, as specified above
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

### Change to working directory
cd /home/usr/workingdirectory

### Run your parallel application
srun myapp.exe

Please find more elaborate SLURM job scripts for running a hybrid MPI+OpenMP program in a batch job and for running multiple shared-memory / OpenMP programs at a time in one batch job.

Site specific notes

RRZE

  • --output= should not be used on RRZE's clusters; the submit filter already sets suitable defaults automatically
  • --mem=<memlimit> must not be used on RRZE's clusters
  • the first line of the job script should be #/bin/bash -l otherwise module commands won't work in te job script
  • to have a clean environment in job scripts, it is recommended to add #SBATCH --export=NONE and unset SLURM_EXPORT_ENV to the job script. Otherwise, the job will inherit some settings from the submitting shell.
  • access to the parallel file system has to be specified by #SBATCH ---constraint=parfs or the command line shortcut -C parfs
  • access to hardware performance counters (e.g. to be able to use likwid-perfctr) has to be requested by #SBATCH ---constraint=hwperf or the command line shortcut -C hwperf. Only request that feature if you really want to access the hardware performance counters as the feature interferes with the automatic system monitoring.
  • multiple features have to be requested in a single --constraint= statement, listing all required features separated by ampersand, e.g. hwperf&parfs
  • for Intel MPI, RRZE recommends the usage of mpirun instead of srun; if srun shall be used, the additional command line argument --mpi=pmi2 is required. The command line option -ppn of mpirun only works if you export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off before.
  • for squeue the option -u user does not have any effect as you always only see your own jobs

RWTH

  • --mem=<memlimit> must not be used on RWTH's clusters
  • OMP_NUM_THREADS envvar must not be set/overwritten on RWTH's clusters in OpenMP and Hybrid jobs; this envvar is set by the system automatically.
  • access to hardware performance counters in order to use likwid-perfctr or Intel VTune is available using the --hwctr=likwid or --hwctr=vtune parameter, respectively.
  • in order to start MPI or Hybrid application please use $MPIEXEC $FLAGS_MPI_BATCH ./a.out instead of srun sommand; these envvars are set accorfingly to used MPI vendor by the module system.
  • the shebang of batch script must be #!/usr/bin/zsh (otherwise the modules are not accesssible)

References

Advanced SLURM jobscript examples

Detailled guide to more advanced scripts

SBATCH documentation