https://hpc-wiki.info/hpc/api.php?action=feedcontributions&user=Dieter-anmey-f9d9%40rwth-aachen.de&feedformat=atomHPC Wiki - User contributions [en]2024-03-29T11:51:39ZUser contributionsMediaWiki 1.35.9https://hpc-wiki.info/hpc/index.php?title=HPC_Wiki&diff=2133HPC Wiki2019-09-17T09:06:23Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>Welcome to the [[HPC_Wiki:About| HPC Wiki]] the source for site-independent High Performance Computing Information.<br />
<br />
<<-- On the left hand there are different target groups with their respective material.<br />
<br />
== Target Groups ==<br />
- '''[[:Category:Basics| Basics]]'''<br />
<br />
- '''[[:Category:HPC-User| HPC-User]]'''<br />
<br />
- '''[[:Category:HPC-Developer| HPC-Developer]]'''<br />
== Categories ==<br />
<br />
[[Getting_Started]] is a basic guide for first-time users. It covers a wide range of topics from access and login to system-independant concepts of Unix systems to data transfers. While this gives an overview, all articles in the Basics Section are written with really inexperienced users in mind, to explain concepts in an easy-to-understand way.<br />
<br />
A similar article in the Users and Developer Section are planned, but not yet finished.<br />
<br />
Look into the [[FAQs]] to see tips and instructions on [[How-to-Contribute]] to this wiki.<br />
<br />
== Overview ==<br />
General: [[How-to-Contribute]]<br />
<br />
<br />
Basics/HPC-User: [[make]], [[cmake]], [[ssh_keys]], [[compiler]], [[Modules]], [[vim]], [[screen/tmux]], [[ssh]] [[python/pip]], [[scp]], [[rsync]], [[git]], [[shell]], [[chmod]], [[tar]], [[sh-file]], [[NUMA]]<br />
<br />
<br />
HPC-Dev: [[Load_Balancing]], [[Performance Engineering]], [[correctness checking]]<br />
<br />
HPC-Programs: [[Measurement-tools]], [[Likwid]], [[Vampir]], [[ScoreP]], [[MUST]]<br />
<br />
<br />
HPC-Pages:<br />
[[Software]], [[Access]], [[Site-specific_documentation]], [[measurement-tools]], [[likwid]]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Measurement_tools&diff=2132Measurement tools2019-09-17T08:58:10Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>[[Category:HPC-Developer]]<br />
== Hardware Performance Counter Measurement Tools ==<br />
=== Low Level ===<br />
* Perf: The main interface in the Linux kernel and a corresponding user-space tool to measure hardware counters<br />
* PAPI (Performance-API): A generic API for applications to measure different aspects of the system. For hardware performance counters it uses the perf backend for measurements. Other plugins for GPU and other components exist<br />
* PCM (Performance Counter Monitor): A higher level tool and API that provides common metrics like memory bandwidth and NUMA traffic. The API also provides access to any hardware counter event<br />
* PMU-Tools: A set of Python scripts that use the perf backend<br />
* [[Likwid|LIKWID]]: Command line applications and API to measure hardware events which can use perf as backend but also provides other backends to be independent of the kernel version<br />
<br />
=== High Level ===<br />
* [[ARMPerfReports|ARM Performance Report]]: A tool that provides a simple one page HTML report that highlights processor, memory, communication and I/O issues and offers advice on how to improve the performance.<br />
* [[Vampir|Vampir]]: A scalable framework for performance analysis using PAPI as backend<br />
* [[Tau|TAU]]: Utilities to sample or instrument code for hardware counters and other metrics<br />
* [[HPCToolkit|HPC-Toolkit]]: Toolkit to sample timers and hardware performance counters for serial and parallel applications<br />
* [[Intel Advisor]]: A vectorization and threading optimization tool<br />
* [[Intel_VTune|Intel VTune]]: A performance profiling tool to analyse algorithms and hardware usage for serial and parallel applications<br />
* [[Scalasca]]: A performance optimisation tool for runtime behaviour measurement and analysis of parallel programs <br />
* [[Score-P]]: A Scalable Performance Measurement Infrastructure for Parallel Codes<br />
* [[Intel Trace Collector/Analyzer]]: Powerful tools that acquire/display information on the communication behavior of MPI programs<br />
* [[Oracle Sampling Collector and Performance Analyzer]]: Pair of tools that can collect and analyze performance data for serial or parallel applications<br />
<br />
=== Links and Further Information ===<br />
* [https://www.vi-hps.org Virtual Institute - High Productivity Supercomputing (VI-HPS) ]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1547Multiple Program Runs in one Slurm Job2019-03-22T11:10:25Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>In certain circumstances it may be profitable to start multiple shared-memory / [[OpenMP]] programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time using the [[SLURM|Slurm]] batch system.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- ([[OpenMP]]-) programs does not always keep track with the increasing core count. In such a case a program cannot profit from such a high number of cores and thus resources may be waisted.<br />
Additional complexity stems from the [[NUMA]] architecure of modern compute nodes, which has to be taken into account for performance reasons.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing compute nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses one node exclusively.<br />
These program runs will still have an impact on each other, but it is under control of a single user and when applied repeatedly the total runtime will be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example 1 ==<br />
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.<br />
The target machine has two Intel SkyLake processors with 24 cores each and provides 192 GB of main memory and some 400 GB of SSD for fast file IO.<br />
Each program run uses 24 threads such that both runs together occupy the whole machine.<br />
<br />
The batch job script requests the full node exclusively.<br />
<br />
Each program run is executed in a separate directory such that file IO does not interfer.<br />
Both programs are started asynchonously and a wait command waits for the termination of both programs.<br />
<br />
In order to make sure that both programs profit from the NUMA architecture of a modern computer in an optimal way, the command '''g09''' to launch the Gaussain package is started under the control of the '''numactl''' command - see explanation in the next paragraph for this aspect. <br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
### launch 2 program runs<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
### wait for the termination of both programs runs<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by '''%nprocshared''' and the amount of main memory for the working array by '''%mem'''. If the (fast) file system for scratch files has limitations also the '''maxdisk''' parameter has to be set accordingly.<br />
<br />
<syntaxhighlight lang="slurm"><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=200GB<br />
</syntaxhighlight><br />
<br />
== NUMA Aspects ==<br />
<br />
As modern multi-core compute nodes typically have a [[NUMA]] architecture, it is profitable to carefully place threads of a program close to their data.<br />
In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, i.e. the compute node has 2 NUMA domains.<br />
<br />
The command<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
provides information about the NUMA characteristic of the machine ( when [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering] is deactivated ):<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 2 nodes (0-1)<br />
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23<br />
node 0 size: 195270 MB<br />
node 0 free: 135391 MB<br />
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47<br />
node 1 size: 196608 MB<br />
node 1 free: 143410 MB<br />
node distances:<br />
node 0 1 <br />
0: 10 21 <br />
1: 21 10 <br />
</syntaxhighlight><br />
<br />
The numbers of 24 cores of each NUMA domain are listed and the size of the attached main memory portion.<br />
It is a bit unfortunate that here the term "node" is used for NUMA domain (versus: compute node = one computer in a compute cluster).<br />
<br />
Also, the relative costs of cores within one NUMA domain accessing memory of another NUMA domain are given in a matrix of node distances.<br />
For example it costs 10 (abstract timing units) if core 2 accesses memory of its own NUMA domain 0, while it costs 21 if the same core accesses memory of the memory domain 1.<br />
<br />
In fact, the machine which has been used for the experiments here has a BIOS setting turned on which is called [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering]. <br />
This setting splits each 24 core processor chip into halves with 12 cores each, reigning over one quarter of the compute node's main memory, as if the compute node would employ 4 chips with 12 cores each.<br />
<br />
But still those two halves of each chip are rather close to each other, as the command<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
reveils:<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
Here, as the matrix of the node distances depicts, NUMA domains 0 and 1 are very close to each other as are NUMA domains 2 und 3.<br />
<br />
As a consequence, when launching two program runs with 24 threads each in the above example, the first run is bound to NUMA domain 0 and 1 and the second run is bound to NUMA domain 2 and 3.<br />
Binding means assigning the threads to the corresponding cores plus allocating memory that is touched by these threads on the corresponding memory area.<br />
<br />
In Order to start a program called '''program.exe''' under the control of numactl, the syntax <br />
<syntaxhighlight lang="slurm">numactl --cpubind=... --membind=... -- program.exe </syntaxhighlight> is used.<br />
When substituting '''program.exe''' by '''numactl -show''', it can be checked if the placement of the threads works as desired:<br />
<br />
<syntaxhighlight lang="slurm"><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
== Example 2 ==<br />
<br />
Using the same setting than Example 1, it is actually profitable to launch 4 program runs at a time in a single batch job. Here, each program run is bound to a single NUMA domain:<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
### launch 4 program runs<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
### wait for the termination of all 4 programs runs<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:<br />
<br />
<syntaxhighlight lang="slurm"><br />
%nprocshared=12<br />
%mem=35000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
<br />
== Timing Experiments ==<br />
<br />
For timing measurements a single small input data set was used. As a consequence all programs runs had about the same execution time - which is of course optimal for the given scenario.<br />
<br />
Running a single program exclusively the program took approximately<br />
250 seconds with 12 threads, <br />
220 seconds with 24 threads,<br />
180 seconds with 48 threads<br />
<br />
When launching 2 program runs at a time with 24 threads each, both took about 285 seconds and <br />
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.<br />
<br />
Thus, for optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.<br />
<br />
<br />
== Links and more Information ==<br />
<br />
[https://linux.die.net/man/8/numactl numactl(8) - Linux man page]<br />
<br />
[https://gaussian.com/techsupport/ Gaussian Technical Information]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1546Multiple Program Runs in one Slurm Job2019-03-21T15:55:42Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>In certain circumstances it may be profitable to start multiple shared-memory / [[OpenMP]] programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time using the [[SLURM|Slurm]] batch system.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- ([[OpenMP]]-) programs does not always keep track with the increasing core count. In such a case a program cannot profit from such a high number of cores and thus resources may be waisted.<br />
Additional complexity stems from the [[NUMA]] architecure of modern compute nodes, which has to be taken into account for performance reasons.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing compute nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses one node exclusively.<br />
These program runs will still have an impact on each other, but it is under control of a single user and when applied repeatedly the total runtime will be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example 1 ==<br />
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.<br />
The target machine has two Intel SkyLake processors with 24 cores each and provides 192 GB of main memory and some 400 GB of SSD for fast file IO.<br />
Each program run uses 24 threads such that both runs together occupy the whole machine.<br />
<br />
The batch job script requests the full node exclusively.<br />
<br />
Each program run is executed in a separate directory such that file IO does not interfer.<br />
Both programs are started asynchonously and a wait command waits for the termination of both programs.<br />
<br />
In order to make sure that both programs profit from the NUMA architecture of a modern computer in an optimal way, the command '''g09''' to launch the Gaussain package is started under the control of the '''numactl''' command - see explanation in the next paragraph for this aspect. <br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
### launch 2 program runs<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
### wait for the termination of both programs runs<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by '''%nprocshared''' and the amount of main memory for the working array by '''%mem'''. If the (fast) file system for scratch files has limitations also the '''maxdisk''' parameter has to be set accordingly.<br />
<br />
<syntaxhighlight lang="slurm"><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=200GB<br />
</syntaxhighlight><br />
<br />
== NUMA Aspects ==<br />
<br />
As modern multi-core compute nodes typically have a [[NUMA]] architecture, it is profitable to carefully place threads of a program close to their data.<br />
In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, i.e. the compute node has 2 NUMA domains.<br />
<br />
The command<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
provides information about the NUMA characteristic of the machine ( when [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering] is deactivated ):<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 2 nodes (0-1)<br />
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23<br />
node 0 size: 195270 MB<br />
node 0 free: 135391 MB<br />
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47<br />
node 1 size: 196608 MB<br />
node 1 free: 143410 MB<br />
node distances:<br />
node 0 1 <br />
0: 10 21 <br />
1: 21 10 <br />
</syntaxhighlight><br />
<br />
The numbers of 24 cores of each NUMA domain are listed and the size of the attached main memory portion.<br />
It is a bit unfortunate that here the term "node" is used for NUMA domain (versus: compute node = one computer in a compute cluster).<br />
<br />
Also, the relative costs of cores within one NUMA domain accessing memory of another NUMA domain are given in a matrix of node distances.<br />
For example it costs 10 (abstract timing units) if core 2 accesses memory of its own NUMA domain 0, while it costs 21 if the same core accesses memory of the memory domain 1.<br />
<br />
In fact, the machine which has been used for the experiments here has a BIOS setting turned on which is called [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering]. <br />
This setting splits each 24 core processor chip into halves with 12 cores each, reigning over one quarter of the compute node's main memory, as if the compute node would employ 4 chips with 12 cores each.<br />
<br />
But still those two halves of each chip are rather close to each other, as the command<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
reveils:<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
Here, as the matrix of the node distances depicts, NUMA domains 0 and 1 are very close to each other as are NUMA domains 2 und 3.<br />
<br />
As a consequence, when launching two program runs with 24 threads each in the above example, the first run is bound to NUMA domain 0 and 1 and the second run is bound to NUMA domain 2 and 3.<br />
Binding means assigning the threads to the corresponding cores plus allocating memory that is touched by these threads on the corresponding memory area.<br />
<br />
In Order to start a program called '''program.exe''' under the control of numactl, the syntax <br />
<syntaxhighlight lang="slurm">numactl --cpubind=... --membind=... -- program.exe </syntaxhighlight> is used.<br />
When substituting '''program.exe''' by '''numactl -show''', it can be checked if the placement of the threads works as desired:<br />
<br />
<syntaxhighlight lang="slurm"><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
== Example 2 ==<br />
<br />
Using the same setting than Example 1, it is actually profitable to launch 4 program runs at a time in a single batch job. Here, each program run is bound to a single NUMA domain:<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
### launch 4 program runs<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
### wait for the termination of all 4 programs runs<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:<br />
<br />
<syntaxhighlight lang="slurm"><br />
%nprocshared=12<br />
%mem=35000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
<br />
== Timing Experiments ==<br />
<br />
For timing measurements a single small input data set was used. As a consequence all programs runs had about the same execution time - which is of course optimal for the given scenario.<br />
<br />
Running a single program exclusively the program took approximately<br />
250 seconds with 12 threads, <br />
220 seconds with 24 threads,<br />
180 seconds with 48 threads<br />
<br />
When launching 2 program runs at a time with 24 threads each, both took about 285 seconds and <br />
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.<br />
<br />
Thus, for optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.<br />
<br />
<br />
== Links and more Information ==<br />
[https://linux.die.net/man/8/numactl numactl(8) - Linux man page]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1545Multiple Program Runs in one Slurm Job2019-03-21T15:14:46Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>In certain circumstances it may be profitable to start multiple shared-memory / [[OpenMP]] programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time using the [[SLURM|Slurm]] batch system.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- ([[OpenMP]]-) programs does not always keep track. In such a case a program cannot profit form such a high number of cores and thus resources may be waisted.<br />
Additional complexity stems from the [[NUMA]] architecure of modern compute nodes.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing compute nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses one node exclusively.<br />
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example 1 ==<br />
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.<br />
The target machine has two processors with 24 cores each and provides 192 GB of main memory and above 400 GB of SSD for fast file IO.<br />
Each program run uses 24 threads such that both runs together occupy the whole machine.<br />
<br />
The batch job script requests the full node exclusively.<br />
<br />
Each program run is executed in a separate directory such that file IO does not interfer.<br />
Both programs are started asynchonously and a wait command waits for the termination of both programs.<br />
<br />
In order to make sure that both programs profit from the NUMA architecture of a modern computer in an optimal way, the command '''g09''' to launch the Gaussain package is started under the control of the '''numactl''' command - see explanation in the next paragraph for this aspect. <br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by '''%nprocshared''' and the amount of main memory for the working array by '''%mem'''. If the (fast) file system for scratch files has limitations also the '''maxdisk''' parameter has to be set accordingly.<br />
<br />
<syntaxhighlight lang="slurm"><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=200GB<br />
</syntaxhighlight><br />
<br />
== NUMA Aspects ==<br />
<br />
As modern multi-core compute nodes typically have a [[NUMA]] architecture, it is profitable to carefully place threads of a program close to their data.<br />
In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, the compute node has 2 NUMA domains.<br />
<br />
The command<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
provides information about the NUMA characteristic of the machine ( when [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering] is deactivated ):<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 2 nodes (0-1)<br />
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23<br />
node 0 size: 195270 MB<br />
node 0 free: 135391 MB<br />
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47<br />
node 1 size: 196608 MB<br />
node 1 free: 143410 MB<br />
node distances:<br />
node 0 1 <br />
0: 10 21 <br />
1: 21 10 <br />
</syntaxhighlight><br />
<br />
The numbers of 24 cores of each NUMA domain are listed and the size of the attached main memory portion.<br />
It is a bit unfortunate that here the term "node" is used for NUMA domain (versus: compute node = one computer in a compute cluster).<br />
<br />
Also, the relative costs of cores within one NUMA domain accessing memory of another NUMA domain are given in a matrix of node distances.<br />
For example it costs 10 (abstract timing units) if core 2 accesses memory of its own NUMA domain 0, while it costs 21 if the same core accesses memory of the memory domain 1.<br />
<br />
In fact, the machine which has been used for the experiments here has a BIOS setting turned on which is called [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering]. <br />
This setting splits each 24 core processor chip into halves with 12 cores each, reigning over one quarter of the compute node's main memory, as if the compute node would employ 4 chips with 12 cores each.<br />
<br />
But still those two halves of each chip are rather close to each other, as the command<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
reveils:<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
Here, as the matrix of the node distances depicts, NUMA domains 0 and 1 are very close to each other as are NUMA domains 2 und 3.<br />
<br />
As a consequence, when launching two program runs with 24 threads each in the above example, the first run is bound to NUMA domain 0 and 1 and the second run is bound to NUMA domain 2 and 3.<br />
Binding means assigning the threads to the corresponding cores plus allocating memory that is touched by these threads on the corresponding memory area.<br />
<br />
In Order to start a program called '''program.exe''' under the control of numactl, the syntax <br />
<syntaxhighlight lang="slurm">numactl --cpubind=... --membind=... -- program.exe </syntaxhighlight> is used.<br />
When substituting '''program.exe''' by '''numactl -show''', it can be checked if the placement of the threads works as desired:<br />
<br />
<syntaxhighlight lang="slurm"><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
== Example 2 ==<br />
<br />
Using the same setting than Example 1, it is actually profitable to launch 4 program runs at a time in a single batch job. Here, each program run is bound to a single NUMA domain:<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:<br />
<br />
<syntaxhighlight lang="slurm"><br />
%nprocshared=12<br />
%mem=35000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
<br />
== Timing Experiments ==<br />
<br />
For timing measurements a single small input data set was used. As a consequence all programs runs had about the same execution time - which is of course optimal for the given scenario.<br />
<br />
Running a single program exclusively the program took approximately<br />
250 seconds with 12 threads, <br />
220 seconds with 24 threads,<br />
180 seconds with 48 threads<br />
<br />
When launching 2 program runs at a time with 24 threads each, both took about 285 seconds and <br />
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.<br />
<br />
For optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.<br />
<br />
<br />
== Links and more Information ==<br />
[https://linux.die.net/man/8/numactl numactl(8) - Linux man page]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1544Multiple Program Runs in one Slurm Job2019-03-21T14:28:51Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>In certain circumstances it may be profitable to start multiple shared-memory / OpenMP programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time using the [[SLURM|Slurm]] batch system.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number of cores and thus resources may be waisted.<br />
Additional complexity stems from the [[NUMA]] architecure of modern compute nodes.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing compute nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses one node exclusively.<br />
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example 1 ==<br />
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.<br />
The target machine has two processors with 24 cores each and provides 192 GB of main memory and above 400 GB of SSD for fast file IO.<br />
Each program run uses 24 threads such that both runs together occupy the whole machine.<br />
<br />
The batch job script requests the full node exclusively.<br />
<br />
Each program run is executed in a separate directory such that file IO does not interfer.<br />
Both programs are started asynchonously and a wait command waits for the termination of both programs.<br />
<br />
In order to make sure that both programs profit from the NUMA architecture of a modern computer in an optimal way, the command '''g09''' to launch the Gaussain package is started under the control of the '''numactl''' command - see explanation in the next paragraph for this aspect. <br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by '''%nprocshared''' and the amount of main memory for the working array by '''%mem'''. If the (fast) file system for scratch files has limitations also the '''maxdisk''' parameter has to be set accordingly.<br />
<br />
<syntaxhighlight lang="slurm"><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=200GB<br />
</syntaxhighlight><br />
<br />
== NUMA Aspects ==<br />
<br />
As modern multi-core compute nodes typically have a [[NUMA]] architecture, it is profitable to carefully place threads of a program close to their data.<br />
In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, the compute node has 2 NUMA domains.<br />
<br />
The command<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
provides information about the NUMA characteristic of the machine ( when [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering] is deactivated ):<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 2 nodes (0-1)<br />
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23<br />
node 0 size: 195270 MB<br />
node 0 free: 135391 MB<br />
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47<br />
node 1 size: 196608 MB<br />
node 1 free: 143410 MB<br />
node distances:<br />
node 0 1 <br />
0: 10 21 <br />
1: 21 10 <br />
</syntaxhighlight><br />
<br />
The numbers of 24 cores of each NUMA domain are listed and the size of the attached main memory portion.<br />
It is a bit unfortunate that here the term "node" is used for NUMA domain (versus: compute node = one computer in a compute cluster).<br />
<br />
Also, the relative costs of cores within one NUMA domain accessing memory of another NUMA domain are given in a matrix of node distances.<br />
For example it costs 10 (abstract timing units) if core 2 accesses memory of its own NUMA domain 0, while it costs 21 if the same core accesses memory of the memory domain 1.<br />
<br />
In fact, the machine which has been used for the experiments here has a BIOS setting turned on which is called [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering]. <br />
This setting splits each 24 core processor chip into halves with 12 cores each, reigning over one quarter of the compute node's main memory, as if the compute node would employ 4 chips with 12 cores each.<br />
<br />
But still those two halves of each chip are rather close to each other, as the command<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
reveils:<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
Here, as the matrix of the node distances depicts, NUMA domains 0 and 1 are very close to each other as are NUMA domains 2 und 3.<br />
<br />
As a consequence, when launching two program runs with 24 threads each in the above example, the first run is bound to NUMA domain 0 and 1 and the second run is bound to NUMA domain 2 and 3.<br />
Binding means assigning the threads to the corresponding cores plus allocating memory that is touched by these threads on the corresponding memory area.<br />
<br />
In Order to start a program called '''program.exe''' under the control of numactl, the syntax <br />
<syntaxhighlight lang="slurm">numactl --cpubind=... --membind=... -- program.exe </syntaxhighlight> is used.<br />
When substituting '''program.exe''' by '''numactl -show''', it can be checked if the placement of the threads works as desired:<br />
<br />
<syntaxhighlight lang="slurm"><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
== Example 2 ==<br />
<br />
Using the same setting than Example 1, it is actually profitable to launch 4 program runs at a time in a single batch job. Here, each program run is bound to a single NUMA domain:<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:<br />
<br />
<syntaxhighlight lang="slurm"><br />
%nprocshared=12<br />
%mem=35000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
<br />
== Timing Experiments ==<br />
<br />
For timing measurements a single small input data set was used. As a consequence all programs runs had about the same execution time - which is of course optimal for the given scenario.<br />
<br />
Running a single program exclusively the program took approximately<br />
250 seconds with 12 threads, <br />
220 seconds with 24 threads,<br />
180 seconds with 48 threads<br />
<br />
When launching 2 program runs at a time with 24 threads each, both took about 285 seconds and <br />
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.<br />
<br />
For optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.<br />
<br />
<br />
== Links and more Information ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1543Multiple Program Runs in one Slurm Job2019-03-21T14:25:48Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>In certain circumstances it may be profitable to start multiple shared-memory / OpenMP programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time using the [[SLURM|Slurm]] batch system.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number of cores and thus resources may be waisted.<br />
Additional complexity stems from the NUMA architecure of modern compute nodes.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing compute nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses one node exclusively.<br />
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example 1 ==<br />
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.<br />
The target machine has two processors with 24 cores each and provides 192 GB of main memory and above 400 GB of SSD for fast file IO.<br />
Each program run uses 24 threads such that both runs together occupy the whole machine.<br />
<br />
The batch job script requests the full node exclusively.<br />
<br />
Each program run is executed in a separate directory such that file IO does not interfer.<br />
Both programs are started asynchonously and a wait command waits for the termination of both programs.<br />
<br />
In order to make sure that both programs profit from the NUMA architecture of a modern computer in an optimal way, the command '''g09''' to launch the Gaussain package is started under the control of the '''numactl''' command - see explanation in the next paragraph for this aspect. <br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by '''%nprocshared''' and the amount of main memory for the working array by '''%mem'''. If the (fast) file system for scratch files has limitations also the '''maxdisk''' parameter has to be set accordingly.<br />
<br />
<syntaxhighlight lang="slurm"><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=200GB<br />
</syntaxhighlight><br />
<br />
== NUMA Aspects ==<br />
<br />
As modern multi-core compute nodes typically have a [[NUMA]] architecture, it is profitable to carefully place threads of a program close to their data.<br />
In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, the compute node has 2 NUMA domains.<br />
<br />
The command<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
provides information about the NUMA characteristic of the machine ( when [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering] is deactivated ):<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 2 nodes (0-1)<br />
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23<br />
node 0 size: 195270 MB<br />
node 0 free: 135391 MB<br />
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47<br />
node 1 size: 196608 MB<br />
node 1 free: 143410 MB<br />
node distances:<br />
node 0 1 <br />
0: 10 21 <br />
1: 21 10 <br />
</syntaxhighlight><br />
<br />
The numbers of 24 cores of each NUMA domain are listed and the size of the attached main memory portion.<br />
It is a bit unfortunate that here the term "node" is used for NUMA domain (versus: compute node = one computer in a compute cluster).<br />
<br />
Also, the relative costs of cores within one NUMA domain accessing memory of another NUMA domain are given in a matrix of node distances.<br />
For example it costs 10 (abstract timing units) if core 2 accesses memory of its own NUMA domain 0, while it costs 21 if the same core accesses memory of the memory domain 1.<br />
<br />
In fact, the machine which has been used for the experiments here has a BIOS setting turned on which is called [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering]. <br />
This setting splits each 24 core processor chip into halves with 12 cores each, reigning over one quarter of the compute node's main memory, as if the compute node would employ 4 chips with 12 cores each.<br />
<br />
But still those two halves of each chip are rather close to each other, as the command<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
reveils:<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
Here, as the matrix of the node distances depicts, NUMA domains 0 and 1 are very close to each other as are NUMA domains 2 und 3.<br />
<br />
As a consequence, when launching two program runs with 24 threads each in the above example, the first run is bound to NUMA domain 0 and 1 and the second run is bound to NUMA domain 2 and 3.<br />
Binding means assigning the threads to the corresponding cores plus allocating memory that is touched by these threads on the corresponding memory area.<br />
<br />
In Order to start a program called '''program.exe''' under the control of numactl, the syntax <br />
<syntaxhighlight lang="slurm">numactl --cpubind=... --membind=... -- program.exe </syntaxhighlight> is used.<br />
When substituting '''program.exe''' by '''numactl -show''', it can be checked if the placement of the threads works as desired:<br />
<br />
<syntaxhighlight lang="slurm"><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
== Example 2 ==<br />
<br />
Using the same setting than Example 1, it is actually profitable to launch 4 program runs at a time in a single batch job. Here, each program run is bound to a single NUMA domain:<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:<br />
<br />
<syntaxhighlight lang="slurm"><br />
%nprocshared=12<br />
%mem=35000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
<br />
== Timing Experiments ==<br />
<br />
For timing measurements a single small input data set was used. As a consequence all programs runs had about the same execution time - which is of course optimal for the given scenario.<br />
<br />
Running a single program exclusively the program took approximately<br />
250 seconds with 12 threads, <br />
220 seconds with 24 threads,<br />
180 seconds with 48 threads<br />
<br />
When launching 2 program runs at a time with 24 threads each, both took about 285 seconds and <br />
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.<br />
<br />
For optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.<br />
<br />
<br />
== Links and more Information ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=SLURM&diff=1542SLURM2019-03-21T14:25:24Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>== General ==<br />
<br />
SLURM is a workload manager / job [[scheduler]]. To get an overview of the functionality of a scheduler, go [[Scheduler#General|here]] or to the [[Scheduling_Basics|Scheduling Basics]].<br />
<br />
<br />
__TOC__<br />
<br />
<br />
== #SBATCH Usage ==<br />
<br />
If you are writing a [[jobscript]] for a SLURM batch system, the magic cookie is "#SBATCH". To use it, start a new line in your script with "#SBATCH". Following that, you can put one of the parameters shown below, where the word written in <...> should be replaced with a value.<br />
<br />
Basic settings:<br />
{| class="wikitable" style="width: 40%;"<br />
| Parameter || Function<br />
|-<br />
| --job-name=<name> || job name<br />
|-<br />
| --output=<path> || path to the file where the job (error) output is written to<br />
|}<br />
<br />
Requesting resources:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --time=<runlimit> || runtime limit in the format hours:min:sec; once the time specified is up, the job will be killed by the [[scheduler]]<br />
|-<br />
| --mem=<memlimit> || job memory request per node, usually an integer followed by a prefix for the unit (e. g. --mem=1G for 1 GB)<br />
|}<br />
<br />
Parallel programming (read more [[Parallel_Programming|here]]):<br />
<br />
Settings for OpenMP:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=1 || start a parallel job for a shared-memory system on only one node<br />
|-<br />
| --cpus-per-task=<num_threads> || number of threads to execute OpenMP application with<br />
|-<br />
| --ntasks-per-core=<num_hyperthreads> || number of hyperthreads per core; i. e. any value greater than 1 will turn on hyperthreading (the possible maximum depends on your CPU)<br />
|-<br />
| --ntasks-per-node=1 || for OpenMP, use one task per node only<br />
|}<br />
<br />
Settings for MPI:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=<num_nodes> || start a parallel job for a distributed-memory system on several nodes<br />
|-<br />
| --cpus-per-task=1 || for MPI, use one task per CPU<br />
|-<br />
| --ntasks-per-core=1 || disable hyperthreading<br />
|-<br />
| --ntasks-per-node=<num_procs> || number of processes per node (the possible maximum depends on your nodes)<br />
|}<br />
<br />
Email notifications:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --mail-type=<type> || type can be one of BEGIN, END, FAIL, REQUEUE or ALL (where a mail will be sent each time the status of your process changes)<br />
|-<br />
| --mail-user=<email_address> || email address to send notifications to<br />
|}<br />
<br />
== Job Submission ==<br />
<br />
This command submits the job you defined in your [[Jobscript|jobscript]] to the batch system:<br />
<br />
$ sbatch jobscript.sh<br />
<br />
Just like any other incoming job, your job will first be queued. Then, the scheduler decides when your job will be run. The more resources your job requires, the longer it may be waiting to execute.<br />
<br />
You can check the current status of your submitted jobs and their job ids with the following shell command. A job can either be pending <code>PD</code> (waiting for free nodes to run on) or running <code>R</code> (the jobscript is currently being executed). This command will also print the time (hours:min:sec) that your job has been running for.<br />
<br />
$ squeue -u <user_id><br />
<br />
In case you submitted a job on accident or realised that your job might not be running correctly, you can always remove it from the queue or terminate it when running by typing:<br />
<br />
$ scancel <job_id><br />
<br />
Furthermore, Information about current and past jobs can be accessed via:<br />
$ sacct<br />
with more detailed information at the [https://slurm.schedmd.com/sacct.html Slurm documentation of this command]<br />
<br />
== Array and Chain Jobs ==<br />
<br />
<syntaxhighlight lang="zsh"><br />
<br />
#SBATCH --array=1-4:2%1<br />
<br />
</syntaxhighlight><br />
<br />
This creates an array job with *2* subjobs (numbered 1..4 with step of 2) where only *one* may be executed at a time in a random order. An explicit order can be forced by either submitting each subjob at the end of the one before (which may prolong pending) or using the dependencies feature, which results in a chain job.<br />
<br />
<syntaxhighlight lang="zsh"><br />
<br />
#SBATCH --dependency=<type><br />
<br />
</syntaxhighlight><br />
<br />
The available conditions for chain jobs are <br />
<br />
{| class="wikitable" style="width: 60%;"<br />
| Condition || Function<br />
|-<br />
| after:<jobID> || job can start once job <jobID> has started execution<br />
|-<br />
| afterany:<jobID> || job can start once job <jobID> has terminated<br />
|-<br />
| afterok:<jobID> || job can start once job <jobID> has terminated successfully<br />
|-<br />
| afternotok:<jobID> || job can start once job <jobID> has terminated upon failure<br />
|-<br />
| singleton || job can start once any previous job with identical name and user has terminated<br />
|}<br />
<br />
== Jobscript Examples ==<br />
<br />
This serial job will run a given executable, in this case "myapp.exe".<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MYJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MYJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 15 min 30 sec<br />
#SBATCH --time=00:15:30<br />
<br />
### Memory your job needs per node, e. g. 1 GB<br />
#SBATCH --mem=1G<br />
<br />
### The last part consists of regular shell commands:<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Execute your application<br />
myapp.exe<br />
</syntaxhighlight><br />
<br />
If you'd like to run a parallel job on a cluster that is managed by SLURM, you have to clarify that. Therefore, use the command "srun <my_executable>" in your jobscript.<br />
<br />
This OpenMP job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 24 threads.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=OMPJOB<br />
<br />
### File for the output<br />
#SBATCH --output=OMPJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 30 min<br />
#SBATCH --time=00:30:00<br />
<br />
### Memory your job needs per node, e. g. 500 MB<br />
#SBATCH --mem=500M<br />
<br />
### Use one node for parallel jobs on shared-memory systems<br />
#SBATCH --nodes=1<br />
<br />
### Number of threads to use, e. g. 24<br />
#SBATCH --cpus-per-task=24<br />
<br />
### Number of hyperthreads per core<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Tasks per node (for shared-memory parallelisation, use 1)<br />
#SBATCH --ntasks-per-node=1<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to the value specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
This MPI job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 12 processes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MPIJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MPIJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 50 min<br />
#SBATCH --time=00:50:00<br />
<br />
### Memory your job needs per node, e. g. 250 MB<br />
#SBATCH --mem=250M<br />
<br />
### Use more than one node for parallel jobs on distributed-memory systems, e. g. 2<br />
#SBATCH --nodes=2<br />
<br />
### Number of CPUS per task (for distributed-memory parallelisation, use 1)<br />
#SBATCH --cpus-per-task=1<br />
<br />
### Disable hyperthreading by setting the tasks per core to 1<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Number of processes per node, e. g. 6 (6 processes on 2 nodes = 12 processes in total)<br />
#SBATCH --ntasks-per-node=6<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to 1, as specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
Please find more elaborate SLURM job scripts for <br />
[[hybrid slurm job|running a hybrid MPI+OpenMP program in a batch job]] and for<br />
[[multiple runs in one slurm job|running multiple shared-memory / OpenMP programs at a time in one batch job]].<br />
<br />
<br />
<br />
== Site specific notes ==<br />
<br />
=== RRZE ===<br />
<br />
* <code>--output=</code> ''should not'' be used on RRZE's clusters; the submit filter already sets suitable defaults automatically<br />
* <code>--mem=<memlimit></code> '''must not''' be used on RRZE's clusters<br />
* the first line of the job script ''should be'' <code>#/bin/bash -l</code> otherwise <code>module</code> commands won't work in te job script<br />
* to have a clean environment in job scripts, it is recommended to add <code>#SBATCH --export=NONE</code> '''and''' <code>unset SLURM_EXPORT_ENV</code> to the job script. Otherwise, the job will inherit some settings from the submitting shell.<br />
* access to the parallel file system has to be specified by <code>#SBATCH ---constraint=parfs</code> or the command line shortcut <code>-C parfs</code><br />
* access to hardware performance counters (e.g. to be able to use <code>likwid-perfctr</code>) has to be requested by <code>#SBATCH ---constraint=hwperf</code> or the command line shortcut <code>-C hwperf</code>. Only request that feature if you really want to access the hardware performance counters as the feature interferes with the automatic system monitoring.<br />
* multiple features have to be requested in a single <code>--constraint=</code> statement, listing all required features separated by ampersand, e.g. <code>hwperf&parfs</code><br />
* for Intel MPI, RRZE recommends the usage of <code>mpirun</code> instead of <code>srun</code>; if <code>srun</code> shall be used, the additional command line argument <code>--mpi=pmi2</code> is required. The command line option <code>-ppn</code> of <code>mpirun</code> only works if you <code>export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off</code> before.<br />
* for <code>squeue</code> the option <code>-u user</code> does not have any effect as you always only see your own jobs<br />
<br />
== References ==<br />
<br />
[https://doku.lrz.de/display/PUBLIC/Example+parallel+job+scripts+on+the+Linux-Cluster/ Advanced SLURM jobscript examples]<br />
<br />
[http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/ Detailled guide to more advanced scripts]<br />
<br />
[https://slurm.schedmd.com/sbatch.html SBATCH documentation]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1536Multiple Program Runs in one Slurm Job2019-03-20T19:34:51Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>In certain circumstances it may be profitable to start multiple OpenMP programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time using the [[SLURM|Slurm]] batch system.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number of cores and thus resources may be waisted.<br />
Additional complexity stems from the NUMA architecure of modern compute nodes.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing compute nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses one node exclusively.<br />
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example 1 ==<br />
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.<br />
The target machine has two processors with 24 cores each and provides 192 GB of main memory and above 400 GB of SSD for fast file IO.<br />
Each program run uses 24 threads such that both runs together occupy the whole machine.<br />
<br />
The batch job script requests the full node exclusively.<br />
<br />
Each program run is executed in a separate directory such that file IO does not interfer.<br />
Both programs are started asynchonously and a wait command waits for the termination of both programs.<br />
<br />
In order to make sure that both programs profit from the NUMA architecture of a modern computer in an optimal way, the command '''g09''' to launch the Gaussain package is started under the control of the '''numactl''' command - see explanation in the next paragraph for this aspect. <br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by '''%nprocshared''' and the amount of main memory for the working array by '''%mem'''. If the (fast) file system for scratch files has limitations also the '''maxdisk''' parameter has to be set accordingly.<br />
<br />
<syntaxhighlight><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=200GB<br />
</syntaxhighlight><br />
<br />
== NUMA Aspects ==<br />
<br />
As modern multi-core compute nodes typically have a [[NUMA]] architecture, it is profitable to carefully place threads of a program close to their data.<br />
In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, the compute node has 2 NUMA domains.<br />
<br />
The command<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
provides information about the NUMA characteristic of the machine ( when [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering] is deactivated ):<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 2 nodes (0-1)<br />
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23<br />
node 0 size: 195270 MB<br />
node 0 free: 135391 MB<br />
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47<br />
node 1 size: 196608 MB<br />
node 1 free: 143410 MB<br />
node distances:<br />
node 0 1 <br />
0: 10 21 <br />
1: 21 10 <br />
</syntaxhighlight><br />
<br />
The numbers of 24 cores of each NUMA domain are listed and the size of the attached main memory portion.<br />
It is a bit unfortunate that here the term "node" is used for NUMA domain (versus: compute node = one computer in a compute cluster).<br />
<br />
Also, the relative costs of cores within one NUMA domain accessing memory of another NUMA domain are given in a matrix of node distances.<br />
For example it costs 10 (abstract timing units) if core 2 accesses memory of its own NUMA domain 0, while it costs 21 if the same core accesses memory of the memory domain 1.<br />
<br />
In fact, the machine which has been used for the experiments here has a BIOS setting turned on which is called [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering]. <br />
This setting splits each 24 core processor chip into halves with 12 cores each, reigning over one quarter of the compute node's main memory, as if the compute node would employ 4 chips with 12 cores each.<br />
<br />
But still those two halves of each chip are rather close to each other, as the command<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
reveils:<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
Here, as the matrix of the node distances depicts, NUMA domains 0 and 1 are very close to each other as are NUMA domains 2 und 3.<br />
<br />
As a consequence, when launching two program runs with 24 threads each in the above example, the first run is bound to NUMA domain 0 and 1 and the second run is bound to NUMA domain 2 and 3.<br />
Binding means assigning the threads to the corresponding cores plus allocating memory that is touched by these threads on the corresponding memory area.<br />
<br />
In Order to start a program called '''program.exe''' under the control of numactl, the syntax <br />
<syntaxhighlight>numactl --cpubind=... --membind=... -- program.exe </syntaxhighlight> is used.<br />
When substituting '''program.exe''' by '''numactl -show''', it can be checked if the placement of the threads works as desired:<br />
<br />
<syntaxhighlight><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
== Example 2 ==<br />
<br />
Using the same setting than Example 1, it is actually profitable to launch 4 program runs at a time in a single batch job. Here, each program run is bound to a single NUMA domain:<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:<br />
<br />
<syntaxhighlight><br />
%nprocshared=12<br />
%mem=35000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
<br />
== Timing Experiments ==<br />
<br />
For timing measurements a single small input data set was used. As a consequence all programs runs had about the same execution time - which is of course optimal for the given scenario.<br />
<br />
Running a single program exclusively the program took approximately<br />
250 seconds with 12 threads, <br />
220 seconds with 24 threads,<br />
180 seconds with 48 threads<br />
<br />
When launching 2 program runs at a time with 24 threads each, both took about 285 seconds and <br />
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.<br />
<br />
For optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.<br />
<br />
<br />
== Links and more Information ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1535Multiple Program Runs in one Slurm Job2019-03-20T19:30:14Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>In certain circumstances it may be profitable to start multiple OpenMP programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number of cores and thus resources may be waisted.<br />
Additional complexity stems from the NUMA architecure of modern compute nodes.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing compute nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses one node exclusively.<br />
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example 1 ==<br />
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.<br />
The target machine has two processors with 24 cores each and provides 192 GB of main memory and above 400 GB of SSD for fast file IO.<br />
Each program run uses 24 threads such that both runs together occupy the whole machine.<br />
<br />
The batch job script requests the full node exclusively.<br />
<br />
Each program run is executed in a separate directory such that file IO does not interfer.<br />
Both programs are started asynchonously and a wait command waits for the termination of both programs.<br />
<br />
In order to make sure that both programs profit from the NUMA architecture of a modern computer in an optimal way, the command '''g09''' to launch the Gaussain package is started under the control of the '''numactl''' command - see explanation in the next paragraph for this aspect. <br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by '''%nprocshared''' and the amount of main memory for the working array by '''%mem'''. If the (fast) file system for scratch files has limitations also the '''maxdisk''' parameter has to be set accordingly.<br />
<br />
<syntaxhighlight><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=200GB<br />
</syntaxhighlight><br />
<br />
== NUMA Aspects ==<br />
<br />
As modern multi-core compute nodes typically have a [[NUMA]] architecture, it is profitable to carefully place threads of a program close to their data.<br />
In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, the compute node has 2 NUMA domains.<br />
<br />
The command<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
provides information about the NUMA characteristic of the machine ( when [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering] is deactivated ):<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 2 nodes (0-1)<br />
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23<br />
node 0 size: 195270 MB<br />
node 0 free: 135391 MB<br />
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47<br />
node 1 size: 196608 MB<br />
node 1 free: 143410 MB<br />
node distances:<br />
node 0 1 <br />
0: 10 21 <br />
1: 21 10 <br />
</syntaxhighlight><br />
<br />
The numbers of 24 cores of each NUMA domain are listed and the size of the attached main memory portion.<br />
It is a bit unfortunate that here the term "node" is used for NUMA domain (versus: compute node = one computer in a compute cluster).<br />
<br />
Also, the relative costs of cores within one NUMA domain accessing memory of another NUMA domain are given in a matrix of node distances.<br />
For example it costs 10 (abstract timing units) if core 2 accesses memory of its own NUMA domain 0, while it costs 21 if the same core accesses memory of the memory domain 1.<br />
<br />
In fact, the machine which has been used for the experiments here has a BIOS setting turned on which is called [https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview Sub-NUMA Clustering]. <br />
This setting splits each 24 core processor chip into halves with 12 cores each, reigning over one quarter of the compute node's main memory, as if the compute node would employ 4 chips with 12 cores each.<br />
<br />
But still those two halves of each chip are rather close to each other, as the command<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
reveils:<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
Here, as the matrix of the node distances depicts, NUMA domains 0 and 1 are very close to each other as are NUMA domains 2 und 3.<br />
<br />
As a consequence, when launching two program runs with 24 threads each in the above example, the first run is bound to NUMA domain 0 and 1 and the second run is bound to NUMA domain 2 and 3.<br />
Binding means assigning the threads to the corresponding cores plus allocating memory that is touched by these threads on the corresponding memory area.<br />
<br />
In Order to start a program called '''program.exe''' under the control of numactl, the syntax <br />
<syntaxhighlight>numactl --cpubind=... --membind=... -- program.exe </syntaxhighlight> is used.<br />
When substituting '''program.exe''' by '''numactl -show''', it can be checked if the placement of the threads works as desired:<br />
<br />
<syntaxhighlight><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
== Example 2 ==<br />
<br />
Using the same setting than Example 1, it is actually profitable to launch 4 program runs at a time in a single batch job. Here, each program run is bound to a single NUMA domain:<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:<br />
<br />
<syntaxhighlight><br />
%nprocshared=12<br />
%mem=35000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
<br />
== Timing Experiments ==<br />
<br />
For timing measurements a single small input data set was used. As a consequence all programs runs had about the same execution time - which is of course optimal for the given scenario.<br />
<br />
Running a single program exclusively the program took approximately<br />
250 seconds with 12 threads, <br />
220 seconds with 24 threads,<br />
180 seconds with 48 threads<br />
<br />
When launching 2 program runs at a time with 24 threads each, both took about 285 seconds and <br />
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.<br />
<br />
For optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.<br />
<br />
<br />
== Links and more Information ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1534Multiple Program Runs in one Slurm Job2019-03-20T19:00:42Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>In certain circumstances it may be profitable to start multiple OpenMP programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number cores and thus resources may be waisted.<br />
Additional complexity stems from the NUMA architecure of modern compute nodes.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses a node exclusively.<br />
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example 1 ==<br />
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.<br />
The target machine has two processors with 24 cores each and provides 192 GB of main memory and above 400 GB of SSD for fast file IO.<br />
Each program run uses 24 threads such that both runs occupy the whole machine.<br />
<br />
The batch job script requests the full node exclusively.<br />
<br />
Each program run is executed in a separate directory such that file IO does not interfer.<br />
Both programs are started asynchonously and a wait command waits for the termination of both programs.<br />
<br />
In order to make sure that both programs profit from the NUMA architecture in an optimal way, the numactl command is used - see explanation below. <br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by %nprocshared and the amount of main memory for the working array by %mem. If the (fast) file system for scratch files has limitations also the maxdisk parameter has to be set accordingly.<br />
<br />
<syntaxhighlight><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=200GB<br />
</syntaxhighlight><br />
<br />
== NUMA Aspects ==<br />
<br />
As modern multi-core compute nodes typically have a [[NUMA]] architecture, it is profitable to carefully place threads of a program close to their data.<br />
In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, the compute node has 2 NUMA domains.<br />
<br />
The command<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
provides information about the NUMA characteristic of the machine ( when [[Sub-NUMA Clustering|https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview]] is deactivated ):<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 2 nodes (0-1)<br />
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23<br />
node 0 size: 195270 MB<br />
node 0 free: 135391 MB<br />
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47<br />
node 1 size: 196608 MB<br />
node 1 free: 143410 MB<br />
node distances:<br />
node 0 1 <br />
0: 10 21 <br />
1: 21 10 <br />
</syntaxhighlight><br />
<br />
The numbers of 24 cores of each NUMA domain is listed and size of the attached main memory portion.<br />
It is a bit unfortunate that here the term "node" is used for NUMA domain (versus: compute node = one computer in a compute cluster).<br />
<br />
Also, the relative costs of cores within one NUMA domain accessing memory of another NUMA domain are given in a matrix of node distances.<br />
For example it costs 10 (abstract timing units) if core 2 accesses memory of its own NUMA domain 0, while it costs 21 if the same core accesses memory of the memory domain 1.<br />
<br />
In fact the machine which has been used for the experiments here has a BIOS setting turned on which is called [[Sub-NUMA Clustering|https://software.intel.com/en-us/articles/intel-xeon-processor-scalable-family-technical-overview]]. <br />
This setting splits each 24 core processor chip into halfes with 12 cores each, reigning over one quarter of the compute node's main memory, as if the compute node would employ 4 chips with 12 cores each.<br />
<br />
But still those two halfs of each chip are rather close to each other, as the command<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
reveils:<br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
<syntaxhighlight><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
<br />
== Example 2 ==<br />
<br />
Using the same setting than Example 1, it is actually profitable to launch 4 program runs at a time in a single batch job:<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:<br />
<br />
<syntaxhighlight><br />
%nprocshared=12<br />
%mem=35000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
== Timing Experiments ==<br />
<br />
For timing measurements a single small input data set was used. As a consequence all programs runs had about the same execution time - which is of course optimal for the given scenario.<br />
<br />
Running a single program exclusively the program took approximately<br />
250 seconds with 12 threads, <br />
220 seconds with 24 threads,<br />
180 seconds with 48 threads<br />
<br />
When launching 2 program runs at a time with 24 threads each, both took about 285 seconds and <br />
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.<br />
<br />
For optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.<br />
<br />
<br />
== Links and more Information ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1533Multiple Program Runs in one Slurm Job2019-03-20T18:30:33Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>In certain circumstances it may be profitable to start multiple OpenMP programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number cores and thus resources may be waisted.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses a node exclusively.<br />
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example 1 ==<br />
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.<br />
The target machine has two processors with 24 cores each and provides 192 GB of main memory and above 400 GB of SSD for fast file IO.<br />
Each program run uses 24 threads such that both runs occupy the whole machine.<br />
<br />
The batch job script requests the full node exclusively.<br />
<br />
Each program run is executed in a separate directory such that file IO does not interfer.<br />
Both programs are started asynchonously and a wait command waits for the termination of both programs.<br />
<br />
In order to make sure that both programs profit from the NUMA architecture in an optimal way, the numactl command is used - see explanation below. <br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by %nprocshared and the amount of main memory for the working array by %mem. If the (fast) file system for scratch files has limitations also the maxdisk parameter has to be set accordingly.<br />
<br />
<syntaxhighlight><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=200GB<br />
</syntaxhighlight><br />
<br />
== NUMA Aspects ==<br />
<br />
Bei üblicher NUMA-Architektur von solchen Multicore-Knoten ist es wichtig die einzelnen Programmläufe sorgfältig zu platzieren - z.B. ein Programmlauf pro NUMA-Node.<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
<syntaxhighlight><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
<br />
== Example 2 ==<br />
<br />
Using the same setting than Example 1, it is actually profitable to launch 4 program runs at a time in a single batch job:<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:<br />
<br />
<syntaxhighlight><br />
%nprocshared=12<br />
%mem=35000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
== Timing Experiments ==<br />
<br />
For timing measurements a single small input data set was used. As a consequence all programs runs had about the same execution time - which is of course optimal for the given scenario.<br />
<br />
Running a single program exclusively the program took approximately<br />
250 seconds with 12 threads, <br />
220 seconds with 24 threads,<br />
180 seconds with 48 threads<br />
<br />
When launching 2 program runs at a time with 24 threads each, both took about 285 seconds and <br />
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.<br />
<br />
For optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.<br />
<br />
<br />
== Links and more Information ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1532Multiple Program Runs in one Slurm Job2019-03-20T18:13:10Z<p>Dieter-anmey-f9d9@rwth-aachen.de: /* Example */</p>
<hr />
<div>In certain circumstances it may be profitable to start multiple OpenMP programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number cores and thus resources may be waisted.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses a node exclusively.<br />
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example ==<br />
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.<br />
The target machine has two processors with 24 cores each and provides 192 GB of main memory.<br />
Each program run uses 24 threads such that both runs occupy the whole machine.<br />
<br />
The batch job script requests the full node exclusively.<br />
<br />
Each program run is executed in a separate directory such that file IO does not interfer.<br />
Both programs are started asynchonously and a wait command waits for the termination of both programs.<br />
<br />
In order to make sure that both programs profit from the NUMA architecture in an optimal way, the numactl command is used - see explanation below. <br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by %nprocshared and the amount of main memory for the working array by %mem. If the (fast) file system for scratch files has limitations also the maxdisk parameter has to be set accordingly.<br />
<br />
<syntaxhighlight><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
== NUMA Aspects ==<br />
<br />
Bei üblicher NUMA-Architektur von solchen Multicore-Knoten ist es wichtig die einzelnen Programmläufe sorgfältig zu platzieren - z.B. ein Programmlauf pro NUMA-Node.<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
<syntaxhighlight><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
<br />
<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
<br />
== Links and more Information ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1531Multiple Program Runs in one Slurm Job2019-03-20T18:02:51Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>In certain circumstances it may be profitable to start multiple OpenMP programs at a time in one single batch job.<br />
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time.<br />
<br />
__TOC__<br />
<br />
== Problem Position ==<br />
<br />
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.<br />
But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number cores and thus resources may be waisted.<br />
<br />
== Shared or Exclusive Operation ==<br />
<br />
One way of operating a cluster of multi-core nodes is to allow sharing nodes between multiple jobs. <br />
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run. <br />
<br />
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses a node exclusively.<br />
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable. <br />
In such a case input data, execution environment and the batch job script has to be adjusted properly.<br />
<br />
<br />
== Example ==<br />
Problem: Ein Programm (hier Gaussian) skaliert nicht gut über alle Cores eines Multicore-Knoten.<br />
Bei nicht-exklusiver Nutzung von solchen Rechenknoten laufen Jobs mehrere Nutzer gleichzeitig und beeinflussen sich gegenseitig in ihrer Laufzeit. Eine zuverlässige Abschätzung der Laufzeit zur Angabe des Rechenzeitlimits fällt dadurch schwer.<br />
Eine Maßnahme dagegen wäre das Starten von mehreren Programmläufen innerhalb eines Batchjobs, das einen Knoten exklusiv nutzt.<br />
<br />
<br />
<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one node, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=24<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<br />
<br />
In the case of the Gaussion chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by %nprocshared and the amount of main memory for the working array by %mem. If the (fast) file system for scratch files has limitations also the maxdisk parameter has to be set accordingly.<br />
<br />
<syntaxhighlight><br />
%nprocshared=24<br />
%mem=70000MB<br />
...<br />
#p ... maxdisk=100GB<br />
</syntaxhighlight><br />
<br />
<br />
== NUMA Aspects ==<br />
<br />
Bei üblicher NUMA-Architektur von solchen Multicore-Knoten ist es wichtig die einzelnen Programmläufe sorgfältig zu platzieren - z.B. ein Programmlauf pro NUMA-Node.<br />
<br />
<syntaxhighlight lang="bash"><br />
numactl -H<br />
</syntaxhighlight><br />
<br />
<syntaxhighlight lang="bash"><br />
available: 4 nodes (0-3)<br />
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20<br />
node 0 size: 47820 MB<br />
node 0 free: 37007 MB<br />
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23<br />
node 1 size: 49152 MB<br />
node 1 free: 41 MB<br />
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44<br />
node 2 size: 49152 MB<br />
node 2 free: 47613 MB<br />
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47<br />
node 3 size: 49152 MB<br />
node 3 free: 47554 MB<br />
node distances:<br />
node 0 1 2 3 <br />
0: 10 11 21 21 <br />
1: 11 10 21 21 <br />
2: 21 21 10 11 <br />
3: 21 21 11 10 <br />
</syntaxhighlight><br />
<br />
<syntaxhighlight><br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
policy: bind<br />
preferred node: 0<br />
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 <br />
cpubind: 0 1 <br />
nodebind: 0 1 <br />
membind: 0 1 <br />
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
policy: bind<br />
preferred node: 2<br />
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 <br />
cpubind: 2 3 <br />
nodebind: 2 3 <br />
membind: 2 3 <br />
</syntaxhighlight><br />
<br />
<br />
<br />
<br />
<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores of one, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### this is not necessary in the case of Gaussian program runs<br />
### but it may be important in other cases<br />
export OMP_NUM_THREADS=12<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
### display NUMA characteristics<br />
numaclt -H<br />
numactl --cpubind=0 --membind=0 -- numactl -show<br />
numactl --cpubind=1 --membind=1 -- numactl -show<br />
numactl --cpubind=2 --membind=2 -- numactl -show<br />
numactl --cpubind=3 --membind=3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
<br />
== Links and more Information ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Hybrid_Slurm_Job&diff=1530Hybrid Slurm Job2019-03-20T17:14:22Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>[[SLURM|Slurm]] is a popular workload manager / job scheduler. <br />
Here you can find an example of job script to launch a program which is parallelized using MPI and OpenMP at the same time.<br />
You may find the toy program useful to get started.<br />
<br />
__TOC__<br />
<br />
== Slurm Job Script ==<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "hello.exe" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun hello.exe<br />
</syntaxhighlight><br />
<br />
== Hybrid Fortran Toy Program ==<br />
You can use this hybrid toy Fortran90 program to test the above job script<br />
<syntaxhighlight lang="bash"><br />
program hello<br />
use mpi<br />
use omp_lib<br />
<br />
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),threadid<br />
character*(MPI_MAX_PROCESSOR_NAME) name<br />
<br />
call MPI_INIT(ierror)<br />
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)<br />
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)<br />
call MPI_GET_PROCESSOR_NAME(name,len,ierror)<br />
<br />
!$omp parallel private(threadid)<br />
threadid=omp_get_thread_num()<br />
print*, 'node: ', trim(name), ' rank:', rank, ', thread_id:', threadid<br />
!$omp end parallel<br />
<br />
call MPI_FINALIZE(ierror)<br />
<br />
end program<br />
</syntaxhighlight><br />
<br />
== Job Output Example ==<br />
When sorting the program output it may look like <br />
<syntaxhighlight lang="fortran"><br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 2<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 2<br />
</syntaxhighlight><br />
<br />
== Taking NUMA into Account ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1529Multiple Program Runs in one Slurm Job2019-03-20T17:04:58Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>Work in progress ...<br />
<br />
<br />
Mehrfaches Starten von Programm mit etwas gleicher Laufzeit in einem Batchjob auf einem Multicore-Knoten am Beispiel von Gaussian.<br />
<br />
__TOC__<br />
<br />
<br />
== Basic usage ==<br />
Problem: Ein Programm (hier Gaussian) skaliert nicht gut über alle Cores eines Multicore-Knoten.<br />
Bei nicht-exklusiver Nutzung von solchen Rechenknoten laufen Jobs mehrere Nutzer gleichzeitig und beeinflussen sich gegenseitig in ihrer Laufzeit. Eine zuverlässige Abschätzung der Laufzeit zur Angabe des Rechenzeitlimits fällt dadurch schwer.<br />
Eine Maßnahme dagegen wäre das Starten von mehreren Programmläufen innerhalb eines Batchjobs, das einen Knoten exklusiv nutzt.<br />
Bei üblicher NUMA-Architektur von solchen Multicore-Knoten ist es wichtig die einzelnen Programmläufe sorgfältig zu platzieren - z.B. ein Programmlauf pro NUMA-Node.<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp24<br />
export INP2=small2.inp24<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
#SBATCH --job-name=run4x12 <br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
#SBATCH --time=00-01:00:00<br />
#SBATCH --mem=180G<br />
<br />
### exclusive usage of a single node<br />
#SBATCH --exclusive<br />
### use all cores, one thread per core<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### prepare your environment for running gaussian<br />
module load CHEMISTRY gaussian<br />
### make sure this environment variable points to a suitable location<br />
### here the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=....<br />
<br />
export INP1=small1.inp12<br />
export INP2=small2.inp12<br />
export INP3=small3.inp12<br />
export INP4=small4.inp12<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
</syntaxhighlight><br />
<br />
<br />
== Links and more Information ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1528Multiple Program Runs in one Slurm Job2019-03-20T16:55:19Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>Work in progress ...<br />
<br />
<br />
Mehrfaches Starten von Programm mit etwas gleicher Laufzeit in einem Batchjob auf einem Multicore-Knoten am Beispiel von Gaussian.<br />
<br />
__TOC__<br />
<br />
<br />
== Basic usage ==<br />
Problem: Ein Programm (hier Gaussian) skaliert nicht gut über alle Cores eines Multicore-Knoten.<br />
Bei nicht-exklusiver Nutzung von solchen Rechenknoten laufen Jobs mehrere Nutzer gleichzeitig und beeinflussen sich gegenseitig in ihrer Laufzeit. Eine zuverlässige Abschätzung der Laufzeit zur Angabe des Rechenzeitlimits fällt dadurch schwer.<br />
Eine Maßnahme dagegen wäre das Starten von mehreren Programmläufen innerhalb eines Batchjobs, das einen Knoten exklusiv nutzt.<br />
Bei üblicher NUMA-Architektur von solchen Multicore-Knoten ist es wichtig die einzelnen Programmläufe sorgfältig zu platzieren - z.B. ein Programmlauf pro NUMA-Node.<br />
<br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=run2x24 <br />
<br />
### output-Pfad und error-file-Pfad<br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
<br />
### Anzahl Stunden dd-hh:mm:ss<br />
#SBATCH --time=00-01:00:00<br />
<br />
#SBATCH --mem=180G<br />
<br />
### Anzahl Prozessoren<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### exclusive usage of a single node ? <br />
#SBATCH --exclusive<br />
<br />
### CLAIX-2018<br />
#SBATCH --partition=c18m<br />
<br />
### use Project account once accounting is implemented ##SBATCH --account rwth0303 <br />
<br />
### send email at job start and end<br />
#SBATCH --mail-type=ALL<br />
#SBATCH --mail-user=anmey@itc.rwth-aachen.de<br />
<br />
module load CHEMISTRY gaussian<br />
### the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=/home/da026566/hpc/benchmarks/Gaussian/Raabe<br />
<br />
export INP1=small.inp24-10gb<br />
export INP2=small.inp24-10gb<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
<br />
numactl --cpubind=0,1 --membind=0,1 -- numactl -show<br />
numactl --cpubind=2,3 --membind=2,3 -- numactl -show<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &<br />
pid2=$!<br />
<br />
wait $pid1 $pid2<br />
</syntaxhighlight><br />
<syntaxhighlight lang="bash"><br />
#!/usr/local_rwth/bin/zsh<br />
<br />
#SBATCH --job-name=Salvarsan_hexa_S_DFT_fine_CAM_small <br />
<br />
### output-Pfad und error-file-Pfad<br />
#SBATCH --output=%j.log<br />
#SBATCH --error=%j.err<br />
<br />
### require access to Lustre Filesystem (HPCWORK) ? ## SBATCH -C hpcwork<br />
<br />
### Anzahl Stunden dd-hh:mm:ss<br />
#SBATCH --time=00-01:00:00<br />
<br />
#SBATCH --mem=45G<br />
<br />
### Anzahl Prozessoren<br />
#SBATCH --ntasks=1 --nodes=1<br />
#SBATCH --cpus-per-task=48<br />
#SBATCH --threads-per-core=1<br />
<br />
### exclusive usage of a single node ? <br />
#SBATCH --exclusive<br />
<br />
### CLAIX-2018<br />
#SBATCH --partition=c18m<br />
<br />
### use Project account once accounting is implemented ##SBATCH --account rwth0303 <br />
<br />
### send email at job start and end<br />
#SBATCH --mail-type=ALL<br />
#SBATCH --mail-user=anmey@itc.rwth-aachen.de<br />
<br />
module load CHEMISTRY gaussian<br />
### the gaussian module allocates the scratch directory<br />
echo $GAUSS_SCRDIR<br />
<br />
### adjust working directory and input file names and output directory names<br />
export WDIR=/home/da026566/hpc/benchmarks/Gaussian/Raabe<br />
<br />
export INP1=small.inp12-5gb<br />
export INP2=small.inp12-5gb<br />
export INP3=small.inp12-5gb<br />
export INP4=small.inp12-5gb<br />
<br />
export OUT1=run1<br />
export OUT2=run2<br />
export OUT3=run3<br />
export OUT4=run4<br />
<br />
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx<br />
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx<br />
### Input files are assumed to be in $WDIR/$INPx<br />
<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3<br />
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4<br />
<br />
uptime<br />
date<br />
<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \<br />
numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out ) &<br />
pid1=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \<br />
numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out ) &<br />
pid2=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \<br />
numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out ) &<br />
pid3=$!<br />
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \<br />
export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \<br />
numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out ) &<br />
<br />
wait $pid1 $pid2 $pid3 $pid4<br />
<br />
date<br />
uptime<br />
cd $WDIR<br />
ls -l $WDIR/$SLURM_JOB_ID/*/*<br />
ls -l $GAUSS_SCRDIR/$SLURM_JOB_ID/*/*<br />
</syntaxhighlight><br />
<br />
<br />
== Links and more Information ==<br />
t.b.a.</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=SLURM&diff=1527SLURM2019-03-20T16:47:21Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>== General ==<br />
<br />
SLURM is a workload manager / job [[scheduler]]. To get an overview of the functionality of a scheduler, go [[Scheduler#General|here]] or to the [[Scheduling_Basics|Scheduling Basics]].<br />
<br />
<br />
__TOC__<br />
<br />
<br />
== #SBATCH Usage ==<br />
<br />
If you are writing a [[jobscript]] for a SLURM batch system, the magic cookie is "#SBATCH". To use it, start a new line in your script with "#SBATCH". Following that, you can put one of the parameters shown below, where the word written in <...> should be replaced with a value.<br />
<br />
Basic settings:<br />
{| class="wikitable" style="width: 40%;"<br />
| Parameter || Function<br />
|-<br />
| --job-name=<name> || job name<br />
|-<br />
| --output=<path> || path to the file where the job (error) output is written to<br />
|}<br />
<br />
Requesting resources:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --time=<runlimit> || runtime limit in the format hours:min:sec; once the time specified is up, the job will be killed by the [[scheduler]]<br />
|-<br />
| --mem=<memlimit> || job memory request per node, usually an integer followed by a prefix for the unit (e. g. --mem=1G for 1 GB)<br />
|}<br />
<br />
Parallel programming (read more [[Parallel_Programming|here]]):<br />
<br />
Settings for OpenMP:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=1 || start a parallel job for a shared-memory system on only one node<br />
|-<br />
| --cpus-per-task=<num_threads> || number of threads to execute OpenMP application with<br />
|-<br />
| --ntasks-per-core=<num_hyperthreads> || number of hyperthreads per core; i. e. any value greater than 1 will turn on hyperthreading (the possible maximum depends on your CPU)<br />
|-<br />
| --ntasks-per-node=1 || for OpenMP, use one task per node only<br />
|}<br />
<br />
Settings for MPI:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=<num_nodes> || start a parallel job for a distributed-memory system on several nodes<br />
|-<br />
| --cpus-per-task=1 || for MPI, use one task per CPU<br />
|-<br />
| --ntasks-per-core=1 || disable hyperthreading<br />
|-<br />
| --ntasks-per-node=<num_procs> || number of processes per node (the possible maximum depends on your nodes)<br />
|}<br />
<br />
Email notifications:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --mail-type=<type> || type can be one of BEGIN, END, FAIL, REQUEUE or ALL (where a mail will be sent each time the status of your process changes)<br />
|-<br />
| --mail-user=<email_address> || email address to send notifications to<br />
|}<br />
<br />
== Job Submission ==<br />
<br />
This command submits the job you defined in your [[Jobscript|jobscript]] to the batch system:<br />
<br />
$ sbatch jobscript.sh<br />
<br />
Just like any other incoming job, your job will first be queued. Then, the scheduler decides when your job will be run. The more resources your job requires, the longer it may be waiting to execute.<br />
<br />
You can check the current status of your submitted jobs and their job ids with the following shell command. A job can either be pending <code>PD</code> (waiting for free nodes to run on) or running <code>R</code> (the jobscript is currently being executed). This command will also print the time (hours:min:sec) that your job has been running for.<br />
<br />
$ squeue -u <user_id><br />
<br />
In case you submitted a job on accident or realised that your job might not be running correctly, you can always remove it from the queue or terminate it when running by typing:<br />
<br />
$ scancel <job_id><br />
<br />
Furthermore, Information about current and past jobs can be accessed via:<br />
$ sacct<br />
with more detailed information at the [https://slurm.schedmd.com/sacct.html Slurm documentation of this command]<br />
<br />
== Array and Chain Jobs ==<br />
<br />
<syntaxhighlight lang="zsh"><br />
<br />
#SBATCH ---array=1-4 -N1 somejob.sh<br />
<br />
</syntaxhighlight><br />
<br />
This creates an array job with 4 subjobs where only one may be executed at a time in a random order. An explicit order can be forced by either submitting each subjob at the end of the one before (which may prolong pending) or using the dependencies feature, which results in a chain job.<br />
<br />
<syntaxhighlight lang="zsh"><br />
<br />
#SBATCH --dependency=<type><br />
<br />
</syntaxhighlight><br />
<br />
The available conditions for chain jobs are <br />
<br />
{| class="wikitable" style="width: 60%;"<br />
| Condition || Function<br />
|-<br />
| after:<jobID> || job can start once job <jobID> has started execution<br />
|-<br />
| afterany:<jobID> || job can start once job <jobID> has terminated<br />
|-<br />
| afterok:<jobID> || job can start once job <jobID> has terminated successfully<br />
|-<br />
| afternotok:<jobID> || job can start once job <jobID> has terminated upon failure<br />
|-<br />
| singleton || job can start once any previous job with identical name and user has terminated<br />
|}<br />
<br />
== Jobscript Examples ==<br />
<br />
This serial job will run a given executable, in this case "myapp.exe".<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MYJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MYJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 15 min 30 sec<br />
#SBATCH --time=00:15:30<br />
<br />
### Memory your job needs per node, e. g. 1 GB<br />
#SBATCH --mem=1G<br />
<br />
### The last part consists of regular shell commands:<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Execute your application<br />
myapp.exe<br />
</syntaxhighlight><br />
<br />
If you'd like to run a parallel job on a cluster that is managed by SLURM, you have to clarify that. Therefore, use the command "srun <my_executable>" in your jobscript.<br />
<br />
This OpenMP job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 24 threads.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=OMPJOB<br />
<br />
### File for the output<br />
#SBATCH --output=OMPJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 30 min<br />
#SBATCH --time=00:30:00<br />
<br />
### Memory your job needs per node, e. g. 500 MB<br />
#SBATCH --mem=500M<br />
<br />
### Use one node for parallel jobs on shared-memory systems<br />
#SBATCH --nodes=1<br />
<br />
### Number of threads to use, e. g. 24<br />
#SBATCH --cpus-per-task=24<br />
<br />
### Number of hyperthreads per core<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Tasks per node (for shared-memory parallelisation, use 1)<br />
#SBATCH --ntasks-per-node=1<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to the value specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
This MPI job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 12 processes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MPIJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MPIJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 50 min<br />
#SBATCH --time=00:50:00<br />
<br />
### Memory your job needs per node, e. g. 250 MB<br />
#SBATCH --mem=250M<br />
<br />
### Use more than one node for parallel jobs on distributed-memory systems, e. g. 2<br />
#SBATCH --nodes=2<br />
<br />
### Number of CPUS per task (for distributed-memory parallelisation, use 1)<br />
#SBATCH --cpus-per-task=1<br />
<br />
### Disable hyperthreading by setting the tasks per core to 1<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Number of processes per node, e. g. 6 (6 processes on 2 nodes = 12 processes in total)<br />
#SBATCH --ntasks-per-node=6<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to 1, as specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
Please find more elaborate SLURM job scripts for <br />
[[hybrid slurm job|running a hybrid MPI+OpenMP program in a batch job]] and for<br />
[[multiple runs in one slurm job|running multiple OpenMP programs at a time in one batch job]].<br />
<br />
<br />
<br />
== Site specific notes ==<br />
<br />
=== RRZE ===<br />
<br />
* <code>--output=</code> ''should not'' be used on RRZE's clusters; the submit filter already sets suitable defaults automatically<br />
* <code>--mem=<memlimit></code> '''must not''' be used on RRZE's clusters<br />
* the first line of the job script ''should be'' <code>#/bin/bash -l</code> otherwise <code>module</code> commands won't work in te job script<br />
* to have a clean environment in job scripts, it is recommended to add <code>#SBATCH --export=NONE</code> '''and''' <code>unset SLURM_EXPORT_ENV</code> to the job script. Otherwise, the job will inherit some settings from the submitting shell.<br />
* access to the parallel file system has to be specified by <code>#SBATCH ---constraint=parfs</code> or the command line shortcut <code>-C parfs</code><br />
* access to hardware performance counters (e.g. to be able to use <code>likwid-perfctr</code>) has to be requested by <code>#SBATCH ---constraint=hwperf</code> or the command line shortcut <code>-C hwperf</code>. Only request that feature if you really want to access the hardware performance counters as the feature interferes with the automatic system monitoring.<br />
* multiple features have to be requested in a single <code>--constraint=</code> statement, listing all required features separated by ampersand, e.g. <code>hwperf&parfs</code><br />
* for Intel MPI, RRZE recommends the usage of <code>mpirun</code> instead of <code>srun</code>; if <code>srun</code> shall be used, the additional command line argument <code>--mpi=pmi2</code> is required. The command line option <code>-ppn</code> of <code>mpirun</code> only works if you <code>export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off</code> before.<br />
* for <code>squeue</code> the option <code>-u user</code> does not have any effect as you always only see your own jobs<br />
<br />
== References ==<br />
<br />
[https://doku.lrz.de/display/PUBLIC/Example+parallel+job+scripts+on+the+Linux-Cluster/ Advanced SLURM jobscript examples]<br />
<br />
[http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/ Detailled guide to more advanced scripts]<br />
<br />
[https://slurm.schedmd.com/sbatch.html SBATCH documentation]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Hybrid_Slurm_Job&diff=1526Hybrid Slurm Job2019-03-20T16:46:31Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>[[SLURM|Slurm]] is a popular workload manager / job scheduler. <br />
Here you can find an example of job script to launch a program which is parallelized using MPI and OpenMP at the same time.<br />
You may find the toy program useful to get started.<br />
<br />
__TOC__<br />
<br />
== Slurm Job Script ==<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "hello.exe" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun hello.exe<br />
</syntaxhighlight><br />
<br />
== Hybrid Fortran Toy Program ==<br />
You can use this hybrid toy Fortran90 program to test the above job script<br />
<syntaxhighlight lang="bash"><br />
program hello<br />
use mpi<br />
use omp_lib<br />
<br />
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),threadid<br />
character*(MPI_MAX_PROCESSOR_NAME) name<br />
<br />
call MPI_INIT(ierror)<br />
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)<br />
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)<br />
call MPI_GET_PROCESSOR_NAME(name,len,ierror)<br />
<br />
!$omp parallel private(threadid)<br />
threadid=omp_get_thread_num()<br />
print*, 'node: ', trim(name), ' rank:', rank, ', thread_id:', threadid<br />
!$omp end parallel<br />
<br />
call MPI_FINALIZE(ierror)<br />
<br />
end program<br />
</syntaxhighlight><br />
<br />
== Job Output Example ==<br />
When sorting the program output it may look like <br />
<syntaxhighlight lang="fortran"><br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 2<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 2<br />
</syntaxhighlight></div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Hybrid_Slurm_Job&diff=1525Hybrid Slurm Job2019-03-20T16:45:35Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>[[SLURM|Slurm]] is a popular workload manager / job scheduler. <br />
Here you can find an example of job script to launch a program which is parallelized using MPI and OpenMP at the same time.<br />
You may find the toy program useful to get started.<br />
<br />
== Slurm Job Script ==<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "hello.exe" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun hello.exe<br />
</syntaxhighlight><br />
<br />
== Hybrid Fortran Toy Program ==<br />
You can use this hybrid toy Fortran90 program to test the above job script<br />
<syntaxhighlight lang="bash"><br />
program hello<br />
use mpi<br />
use omp_lib<br />
<br />
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),threadid<br />
character*(MPI_MAX_PROCESSOR_NAME) name<br />
<br />
call MPI_INIT(ierror)<br />
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)<br />
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)<br />
call MPI_GET_PROCESSOR_NAME(name,len,ierror)<br />
<br />
!$omp parallel private(threadid)<br />
threadid=omp_get_thread_num()<br />
print*, 'node: ', trim(name), ' rank:', rank, ', thread_id:', threadid<br />
!$omp end parallel<br />
<br />
call MPI_FINALIZE(ierror)<br />
<br />
end program<br />
</syntaxhighlight><br />
<br />
== Job Output Example ==<br />
When sorting the program output it may look like <br />
<syntaxhighlight lang="fortran"><br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 2<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 2<br />
</syntaxhighlight></div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Hybrid_Slurm_Job&diff=1524Hybrid Slurm Job2019-03-20T16:45:11Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>[[SLURM|SLURM]] is a popular workload manager / job scheduler. <br />
Here you can find an example of job script to launch a program which is parallelized using MPI and OpenMP at the same time.<br />
You may find the toy program useful to get started.<br />
<br />
== Slurm Job Script ==<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "hello.exe" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun hello.exe<br />
</syntaxhighlight><br />
<br />
== Hybrid Fortran Toy Program ==<br />
You can use this hybrid toy Fortran90 program to test the above job script<br />
<syntaxhighlight lang="bash"><br />
program hello<br />
use mpi<br />
use omp_lib<br />
<br />
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),threadid<br />
character*(MPI_MAX_PROCESSOR_NAME) name<br />
<br />
call MPI_INIT(ierror)<br />
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)<br />
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)<br />
call MPI_GET_PROCESSOR_NAME(name,len,ierror)<br />
<br />
!$omp parallel private(threadid)<br />
threadid=omp_get_thread_num()<br />
print*, 'node: ', trim(name), ' rank:', rank, ', thread_id:', threadid<br />
!$omp end parallel<br />
<br />
call MPI_FINALIZE(ierror)<br />
<br />
end program<br />
</syntaxhighlight><br />
<br />
== Job Output Example ==<br />
When sorting the program output it may look like <br />
<syntaxhighlight lang="fortran"><br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 2<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 2<br />
</syntaxhighlight></div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Hybrid_Slurm_Job&diff=1523Hybrid Slurm Job2019-03-20T16:44:04Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>[[Slurm|SLURM]] is a popular workload manager / job scheduler. <br />
Here you can find an example of job script to launch a program which is parallelized using MPI and OpenMP at the same time.<br />
You may find the toy program useful to get started.<br />
<br />
== Slurm Job Script ==<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "hello.exe" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun hello.exe<br />
</syntaxhighlight><br />
<br />
== Hybrid Fortran Toy Program ==<br />
You can use this hybrid toy Fortran90 program to test the above job script<br />
<syntaxhighlight lang="bash"><br />
program hello<br />
use mpi<br />
use omp_lib<br />
<br />
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),threadid<br />
character*(MPI_MAX_PROCESSOR_NAME) name<br />
<br />
call MPI_INIT(ierror)<br />
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)<br />
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)<br />
call MPI_GET_PROCESSOR_NAME(name,len,ierror)<br />
<br />
!$omp parallel private(threadid)<br />
threadid=omp_get_thread_num()<br />
print*, 'node: ', trim(name), ' rank:', rank, ', thread_id:', threadid<br />
!$omp end parallel<br />
<br />
call MPI_FINALIZE(ierror)<br />
<br />
end program<br />
</syntaxhighlight><br />
<br />
== Job Output Example ==<br />
When sorting the program output it may look like <br />
<syntaxhighlight lang="fortran"><br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 2<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 2<br />
</syntaxhighlight></div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Hybrid_slurm_job&diff=1522Hybrid slurm job2019-03-20T16:34:17Z<p>Dieter-anmey-f9d9@rwth-aachen.de: Dieter-anmey-6410@rwth-aachen.de moved page Hybrid slurm job to Hybrid Slurm Job</p>
<hr />
<div>#REDIRECT [[Hybrid Slurm Job]]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Hybrid_Slurm_Job&diff=1521Hybrid Slurm Job2019-03-20T16:34:16Z<p>Dieter-anmey-f9d9@rwth-aachen.de: Dieter-anmey-6410@rwth-aachen.de moved page Hybrid slurm job to Hybrid Slurm Job</p>
<hr />
<div>Work in progress ....<br />
<br />
<br />
<br />
Short introduction: Sample Page is a page that shows how to layout a Wiki-Page. In the introduction you should describe what this is and what it is used for.<br />
<br />
[[File:ProPE_Logo.PNG|thumb|200px|ProPE Logo]]<br />
<br />
<br />
== Basic usage ==<br />
<syntaxhighlight lang="bash"><br />
$ cd ..<br />
</syntaxhighlight><br />
tut a, b, c<br />
<syntaxhighlight lang="bash"><br />
$ ls -l<br />
</syntaxhighlight><br />
tut d, e und f<br />
<br />
== Tips and Tricks ==<br />
Get the source code of this sample Page and copy it into new pages to start with a resonable structure.<br />
<br />
== Common Pitfalls ==<br />
blabla ... Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.<br />
<br />
<br />
== Links and more Information ==<br />
For some information how to do some things like LaTeX, Code Highlight or pictures, check the [[Wiki Syntax]]<br />
<br />
<br />
<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "hello.exe" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun hello.exe<br />
</syntaxhighlight><br />
<br />
You can use this hybrid toy Fortran90 program to test the above job script<br />
<syntaxhighlight lang="bash"><br />
program hello<br />
use mpi<br />
use omp_lib<br />
<br />
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),threadid<br />
character*(MPI_MAX_PROCESSOR_NAME) name<br />
<br />
call MPI_INIT(ierror)<br />
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)<br />
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)<br />
call MPI_GET_PROCESSOR_NAME(name,len,ierror)<br />
<br />
!$omp parallel private(threadid)<br />
threadid=omp_get_thread_num()<br />
print*, 'node: ', trim(name), ' rank:', rank, ', thread_id:', threadid<br />
!$omp end parallel<br />
<br />
call MPI_FINALIZE(ierror)<br />
<br />
end program<br />
</syntaxhighlight><br />
<br />
When sorting the program output it may look like <br />
<syntaxhighlight lang="fortran"><br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 2<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 2<br />
</syntaxhighlight></div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_runs_in_one_slurm_job&diff=1520Multiple runs in one slurm job2019-03-20T16:33:21Z<p>Dieter-anmey-f9d9@rwth-aachen.de: Dieter-anmey-6410@rwth-aachen.de moved page Multiple runs in one slurm job to Multiple Program Runs in one Slurm Job</p>
<hr />
<div>#REDIRECT [[Multiple Program Runs in one Slurm Job]]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1519Multiple Program Runs in one Slurm Job2019-03-20T16:33:20Z<p>Dieter-anmey-f9d9@rwth-aachen.de: Dieter-anmey-6410@rwth-aachen.de moved page Multiple runs in one slurm job to Multiple Program Runs in one Slurm Job</p>
<hr />
<div>Work in progress ...<br />
<br />
<br />
Short introduction: Sample Page is a page that shows how to layout a Wiki-Page. In the introduction you should describe what this is and what it is used for.<br />
<br />
[[File:ProPE_Logo.PNG|thumb|200px|ProPE Logo]]<br />
<br />
<br />
== Basic usage ==<br />
<syntaxhighlight lang="bash"><br />
$ cd ..<br />
</syntaxhighlight><br />
tut a, b, c<br />
<syntaxhighlight lang="bash"><br />
$ ls -l<br />
</syntaxhighlight><br />
tut d, e und f<br />
<br />
== Tips and Tricks ==<br />
Get the source code of this sample Page and copy it into new pages to start with a resonable structure.<br />
<br />
== Common Pitfalls ==<br />
blabla ... Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.<br />
<br />
<br />
== Links and more Information ==<br />
For some information how to do some things like LaTeX, Code Highlight or pictures, check the [[Wiki Syntax]]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Multiple_Program_Runs_in_one_Slurm_Job&diff=1518Multiple Program Runs in one Slurm Job2019-03-20T16:31:18Z<p>Dieter-anmey-f9d9@rwth-aachen.de: Created page with "Work in progress ... Short introduction: Sample Page is a page that shows how to layout a Wiki-Page. In the introduction you should describe what this is and what it is used..."</p>
<hr />
<div>Work in progress ...<br />
<br />
<br />
Short introduction: Sample Page is a page that shows how to layout a Wiki-Page. In the introduction you should describe what this is and what it is used for.<br />
<br />
[[File:ProPE_Logo.PNG|thumb|200px|ProPE Logo]]<br />
<br />
<br />
== Basic usage ==<br />
<syntaxhighlight lang="bash"><br />
$ cd ..<br />
</syntaxhighlight><br />
tut a, b, c<br />
<syntaxhighlight lang="bash"><br />
$ ls -l<br />
</syntaxhighlight><br />
tut d, e und f<br />
<br />
== Tips and Tricks ==<br />
Get the source code of this sample Page and copy it into new pages to start with a resonable structure.<br />
<br />
== Common Pitfalls ==<br />
blabla ... Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.<br />
<br />
<br />
== Links and more Information ==<br />
For some information how to do some things like LaTeX, Code Highlight or pictures, check the [[Wiki Syntax]]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Hybrid_Slurm_Job&diff=1517Hybrid Slurm Job2019-03-20T16:30:58Z<p>Dieter-anmey-f9d9@rwth-aachen.de: Created page with "Work in progress .... Short introduction: Sample Page is a page that shows how to layout a Wiki-Page. In the introduction you should describe what this is and what it is us..."</p>
<hr />
<div>Work in progress ....<br />
<br />
<br />
<br />
Short introduction: Sample Page is a page that shows how to layout a Wiki-Page. In the introduction you should describe what this is and what it is used for.<br />
<br />
[[File:ProPE_Logo.PNG|thumb|200px|ProPE Logo]]<br />
<br />
<br />
== Basic usage ==<br />
<syntaxhighlight lang="bash"><br />
$ cd ..<br />
</syntaxhighlight><br />
tut a, b, c<br />
<syntaxhighlight lang="bash"><br />
$ ls -l<br />
</syntaxhighlight><br />
tut d, e und f<br />
<br />
== Tips and Tricks ==<br />
Get the source code of this sample Page and copy it into new pages to start with a resonable structure.<br />
<br />
== Common Pitfalls ==<br />
blabla ... Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.<br />
<br />
<br />
== Links and more Information ==<br />
For some information how to do some things like LaTeX, Code Highlight or pictures, check the [[Wiki Syntax]]<br />
<br />
<br />
<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "hello.exe" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun hello.exe<br />
</syntaxhighlight><br />
<br />
You can use this hybrid toy Fortran90 program to test the above job script<br />
<syntaxhighlight lang="bash"><br />
program hello<br />
use mpi<br />
use omp_lib<br />
<br />
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),threadid<br />
character*(MPI_MAX_PROCESSOR_NAME) name<br />
<br />
call MPI_INIT(ierror)<br />
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)<br />
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)<br />
call MPI_GET_PROCESSOR_NAME(name,len,ierror)<br />
<br />
!$omp parallel private(threadid)<br />
threadid=omp_get_thread_num()<br />
print*, 'node: ', trim(name), ' rank:', rank, ', thread_id:', threadid<br />
!$omp end parallel<br />
<br />
call MPI_FINALIZE(ierror)<br />
<br />
end program<br />
</syntaxhighlight><br />
<br />
When sorting the program output it may look like <br />
<syntaxhighlight lang="fortran"><br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 2<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 2<br />
</syntaxhighlight></div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=SLURM&diff=1516SLURM2019-03-20T16:23:41Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>== General ==<br />
<br />
SLURM is a workload manager / job [[scheduler]]. To get an overview of the functionality of a scheduler, go [[Scheduler#General|here]] or to the [[Scheduling_Basics|Scheduling Basics]].<br />
<br />
<br />
__TOC__<br />
<br />
<br />
== #SBATCH Usage ==<br />
<br />
If you are writing a [[jobscript]] for a SLURM batch system, the magic cookie is "#SBATCH". To use it, start a new line in your script with "#SBATCH". Following that, you can put one of the parameters shown below, where the word written in <...> should be replaced with a value.<br />
<br />
Basic settings:<br />
{| class="wikitable" style="width: 40%;"<br />
| Parameter || Function<br />
|-<br />
| --job-name=<name> || job name<br />
|-<br />
| --output=<path> || path to the file where the job (error) output is written to<br />
|}<br />
<br />
Requesting resources:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --time=<runlimit> || runtime limit in the format hours:min:sec; once the time specified is up, the job will be killed by the [[scheduler]]<br />
|-<br />
| --mem=<memlimit> || job memory request per node, usually an integer followed by a prefix for the unit (e. g. --mem=1G for 1 GB)<br />
|}<br />
<br />
Parallel programming (read more [[Parallel_Programming|here]]):<br />
<br />
Settings for OpenMP:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=1 || start a parallel job for a shared-memory system on only one node<br />
|-<br />
| --cpus-per-task=<num_threads> || number of threads to execute OpenMP application with<br />
|-<br />
| --ntasks-per-core=<num_hyperthreads> || number of hyperthreads per core; i. e. any value greater than 1 will turn on hyperthreading (the possible maximum depends on your CPU)<br />
|-<br />
| --ntasks-per-node=1 || for OpenMP, use one task per node only<br />
|}<br />
<br />
Settings for MPI:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=<num_nodes> || start a parallel job for a distributed-memory system on several nodes<br />
|-<br />
| --cpus-per-task=1 || for MPI, use one task per CPU<br />
|-<br />
| --ntasks-per-core=1 || disable hyperthreading<br />
|-<br />
| --ntasks-per-node=<num_procs> || number of processes per node (the possible maximum depends on your nodes)<br />
|}<br />
<br />
Email notifications:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --mail-type=<type> || type can be one of BEGIN, END, FAIL, REQUEUE or ALL (where a mail will be sent each time the status of your process changes)<br />
|-<br />
| --mail-user=<email_address> || email address to send notifications to<br />
|}<br />
<br />
== Job Submission ==<br />
<br />
This command submits the job you defined in your [[Jobscript|jobscript]] to the batch system:<br />
<br />
$ sbatch jobscript.sh<br />
<br />
Just like any other incoming job, your job will first be queued. Then, the scheduler decides when your job will be run. The more resources your job requires, the longer it may be waiting to execute.<br />
<br />
You can check the current status of your submitted jobs and their job ids with the following shell command. A job can either be pending <code>PD</code> (waiting for free nodes to run on) or running <code>R</code> (the jobscript is currently being executed). This command will also print the time (hours:min:sec) that your job has been running for.<br />
<br />
$ squeue -u <user_id><br />
<br />
In case you submitted a job on accident or realised that your job might not be running correctly, you can always remove it from the queue or terminate it when running by typing:<br />
<br />
$ scancel <job_id><br />
<br />
Furthermore, Information about current and past jobs can be accessed via:<br />
$ sacct<br />
with more detailed information at the [https://slurm.schedmd.com/sacct.html Slurm documentation of this command]<br />
<br />
== Array and Chain Jobs ==<br />
<br />
<syntaxhighlight lang="zsh"><br />
<br />
#SBATCH ---array=1-4 -N1 somejob.sh<br />
<br />
</syntaxhighlight><br />
<br />
This creates an array job with 4 subjobs where only one may be executed at a time in a random order. An explicit order can be forced by either submitting each subjob at the end of the one before (which may prolong pending) or using the dependencies feature, which results in a chain job.<br />
<br />
<syntaxhighlight lang="zsh"><br />
<br />
#SBATCH --dependency=<type><br />
<br />
</syntaxhighlight><br />
<br />
The available conditions for chain jobs are <br />
<br />
{| class="wikitable" style="width: 60%;"<br />
| Condition || Function<br />
|-<br />
| after:<jobID> || job can start once job <jobID> has started execution<br />
|-<br />
| afterany:<jobID> || job can start once job <jobID> has terminated<br />
|-<br />
| afterok:<jobID> || job can start once job <jobID> has terminated successfully<br />
|-<br />
| afternotok:<jobID> || job can start once job <jobID> has terminated upon failure<br />
|-<br />
| singleton || job can start once any previous job with identical name and user has terminated<br />
|}<br />
<br />
== Jobscript Examples ==<br />
<br />
This serial job will run a given executable, in this case "myapp.exe".<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MYJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MYJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 15 min 30 sec<br />
#SBATCH --time=00:15:30<br />
<br />
### Memory your job needs per node, e. g. 1 GB<br />
#SBATCH --mem=1G<br />
<br />
### The last part consists of regular shell commands:<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Execute your application<br />
myapp.exe<br />
</syntaxhighlight><br />
<br />
If you'd like to run a parallel job on a cluster that is managed by SLURM, you have to clarify that. Therefore, use the command "srun <my_executable>" in your jobscript.<br />
<br />
This OpenMP job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 24 threads.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=OMPJOB<br />
<br />
### File for the output<br />
#SBATCH --output=OMPJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 30 min<br />
#SBATCH --time=00:30:00<br />
<br />
### Memory your job needs per node, e. g. 500 MB<br />
#SBATCH --mem=500M<br />
<br />
### Use one node for parallel jobs on shared-memory systems<br />
#SBATCH --nodes=1<br />
<br />
### Number of threads to use, e. g. 24<br />
#SBATCH --cpus-per-task=24<br />
<br />
### Number of hyperthreads per core<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Tasks per node (for shared-memory parallelisation, use 1)<br />
#SBATCH --ntasks-per-node=1<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to the value specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
This MPI job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 12 processes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MPIJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MPIJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 50 min<br />
#SBATCH --time=00:50:00<br />
<br />
### Memory your job needs per node, e. g. 250 MB<br />
#SBATCH --mem=250M<br />
<br />
### Use more than one node for parallel jobs on distributed-memory systems, e. g. 2<br />
#SBATCH --nodes=2<br />
<br />
### Number of CPUS per task (for distributed-memory parallelisation, use 1)<br />
#SBATCH --cpus-per-task=1<br />
<br />
### Disable hyperthreading by setting the tasks per core to 1<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Number of processes per node, e. g. 6 (6 processes on 2 nodes = 12 processes in total)<br />
#SBATCH --ntasks-per-node=6<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to 1, as specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
Please find more elaborate SLURM job scripts for <br />
[[hybrid slurm job|running a hybrid MPI+OpenMP program in a batch job]] and for<br />
[[multiple runs in one slurm job|running multiple OpenMP programs at a time in one batch job]].<br />
<br />
<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "hello.exe" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun hello.exe<br />
</syntaxhighlight><br />
<br />
You can use this hybrid toy Fortran90 program to test the above job script<br />
<syntaxhighlight lang="bash"><br />
program hello<br />
use mpi<br />
use omp_lib<br />
<br />
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),threadid<br />
character*(MPI_MAX_PROCESSOR_NAME) name<br />
<br />
call MPI_INIT(ierror)<br />
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)<br />
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)<br />
call MPI_GET_PROCESSOR_NAME(name,len,ierror)<br />
<br />
!$omp parallel private(threadid)<br />
threadid=omp_get_thread_num()<br />
print*, 'node: ', trim(name), ' rank:', rank, ', thread_id:', threadid<br />
!$omp end parallel<br />
<br />
call MPI_FINALIZE(ierror)<br />
<br />
end program<br />
</syntaxhighlight><br />
<br />
When sorting the program output it may look like <br />
<syntaxhighlight lang="fortran"><br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 2<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 2<br />
</syntaxhighlight><br />
<br />
== Site specific notes ==<br />
<br />
=== RRZE ===<br />
<br />
* <code>--output=</code> ''should not'' be used on RRZE's clusters; the submit filter already sets suitable defaults automatically<br />
* <code>--mem=<memlimit></code> '''must not''' be used on RRZE's clusters<br />
* the first line of the job script ''should be'' <code>#/bin/bash -l</code> otherwise <code>module</code> commands won't work in te job script<br />
* to have a clean environment in job scripts, it is recommended to add <code>#SBATCH --export=NONE</code> '''and''' <code>unset SLURM_EXPORT_ENV</code> to the job script. Otherwise, the job will inherit some settings from the submitting shell.<br />
* access to the parallel file system has to be specified by <code>#SBATCH ---constraint=parfs</code> or the command line shortcut <code>-C parfs</code><br />
* access to hardware performance counters (e.g. to be able to use <code>likwid-perfctr</code>) has to be requested by <code>#SBATCH ---constraint=hwperf</code> or the command line shortcut <code>-C hwperf</code>. Only request that feature if you really want to access the hardware performance counters as the feature interferes with the automatic system monitoring.<br />
* multiple features have to be requested in a single <code>--constraint=</code> statement, listing all required features separated by ampersand, e.g. <code>hwperf&parfs</code><br />
* for Intel MPI, RRZE recommends the usage of <code>mpirun</code> instead of <code>srun</code>; if <code>srun</code> shall be used, the additional command line argument <code>--mpi=pmi2</code> is required. The command line option <code>-ppn</code> of <code>mpirun</code> only works if you <code>export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off</code> before.<br />
* for <code>squeue</code> the option <code>-u user</code> does not have any effect as you always only see your own jobs<br />
<br />
== References ==<br />
<br />
[https://doku.lrz.de/display/PUBLIC/Example+parallel+job+scripts+on+the+Linux-Cluster/ Advanced SLURM jobscript examples]<br />
<br />
[http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/ Detailled guide to more advanced scripts]<br />
<br />
[https://slurm.schedmd.com/sbatch.html SBATCH documentation]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=HPC_Wiki&diff=1390HPC Wiki2019-02-26T10:06:25Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>Welcome to the HPC Wiki! This aims to be a site-independent HPC documentation. This means all specific information about computing centers in different locations and their respective details are bundled in the site specifics section on the left hand site and all other articles should be held as general and site-independent as possible. This means everybody can use the same (this!! :) ) knowledge base, regardless where they are from and if information about the configuration of a particular system/computing center is needed, the site-specifics section should give an overview about where to find that.<br />
<br />
Furthermore, there are different target groups with their respective material findable on the left hand menu.<br />
<br />
== Categories ==<br />
<br />
[[Getting_Started]] is a basic guide for first-time users. It covers a wide range of topics from access and login to system-independant concepts of Unix systems to data transfers. While this gives an overview, all articles in the Basics Section are written with really inexperienced users in mind, to explain concepts in an easy-to-understand way.<br />
<br />
A similar article in the Users and Developer Section are planned, but not yet finished.<br />
<br />
Look into the [[FAQs]] to see tips and instructions on [[How-to-Contribute]] to this wiki.<br />
<br />
== In Progress ==<br />
General: [[How-to-Contribute]]<br />
<br />
<br />
Basics/HPC-User: [[make]], [[cmake]], [[Ssh_keys]], [[compiler]], [[Modules]], [[Vim]], [[screen/tmux]], [[ssh]] [[python/pip]], [[scp]], [[rsync]], [[git]], [[shell]], [[chmod]], [[tar]], [[sh-file]], [[NUMA]]<br />
<br />
<br />
HPC-Dev: [[Load_Balancing]], [[Performance Engineering]], [[correctness checking]]<br />
<br />
HPC-Programs: [[Measurement-tools]], [[Likwid]], [[Vampir]], [[ScoreP]], [[MUST]]<br />
<br />
<br />
HPC-Pages:<br />
[[Software]], [[Access]], [[Site-specific_documentation]], [[measurement-tools]], [[likwid]]<br />
<br />
== ToDo ==<br />
<br />
expand [[cmake]], link examples to chain jobs in the sheduling articles, expand [[OpenMP]] (Jenni) & [[MPI]] (Jan), Basics Benchmarking, scaling tests, Resource planning, Tickettool, [[How-to-Contribute]] (Stefan), Tools-Overview, Remove RWTH Reference from [[Must]], [[Intel VTune]] and check with local documentation</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=HPC_Wiki&diff=1201HPC Wiki2019-01-15T10:01:43Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>Welcome to HPC Wiki started by the ProPE Project.<br />
<br />
This website is currently work-in-progress and aims to provide a *site independant* HPC documentation.<br />
<br />
On the left you can see the different target groups and some of the respective material. All [[site specific information]] please only into the site-specifics section, while keeping everything else generally applicable.<br />
<br />
<br />
== Categories ==<br />
<br />
[[Getting_Started]] is a basic guide for first-time users. It covers a wide range of topics from access and login to system-independant concepts of Unix systems to data transfers.<br />
<br />
[[FAQs]]<br />
<br />
== In Progress ==<br />
Create pages with help of the [[Sample_Page|Sample Page]] and [[Wiki|Wiki FAQ]]<br />
<br />
<br />
<br />
Basics/HPC-User: [[make]], [[cmake]], [[Ssh_keys]], [[compiler]], [[Modules]], [[vi/vim]], [[screen/tmux]], [[ssh]] [[python/pip]], [[scp]], [[rsync]], [[git]], [[shell]], [[chmod]], [[tar]], [[sh-file]], [[NUMA]]<br />
<br />
<br />
HPC-Dev: [[Load_Balancing]], [[PE-Process]], [[correctness checking]]<br />
<br />
HPC-Programs: [[Measurement-tools]], [[Likwid]], [[Vampir]], [[ScoreP]], [[MUST]]<br />
<br />
<br />
HPC-Pages:<br />
[[Software]], [[Access]], [[Site-specific_documentation]], [[measurement-tools]], [[likwid]]<br />
<br />
== ToDo ==<br />
<br />
[[cmake]], include chain jobs in the sheduling articles, [[Load_Balancing]] braucht ein Bild, weniger Theorie und mehr Praxis, [[OpenMP]], [[MPI]],<br />
Basics Benchmarking, scaling tests, Resource planning, Tickettool, Anleitung - how to Wiki</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=HPC_Wiki&diff=1199HPC Wiki2019-01-15T10:00:29Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>Welcome to HPC Wiki started by the ProPE Project.<br />
<br />
This website is currently work-in-progress and aims to provide a *site independant* HPC documentation.<br />
<br />
On the left you can see the different target groups and some of the respective material. All site specific information please only into the site-specifics section, while keeping everything else generally applicable.<br />
<br />
<br />
== Categories ==<br />
<br />
[[Getting_Started]] is a basic guide for first-time users. It covers a wide range of topics from access and login to system-independant concepts of Unix systems to data transfers.<br />
<br />
[[FAQs]]<br />
<br />
== In Progress ==<br />
Create pages with help of the [[Sample_Page|Sample Page]] and [[Wiki|Wiki FAQ]]<br />
<br />
<br />
<br />
Basics/HPC-User: [[make]], [[cmake]], [[Ssh_keys]], [[compiler]], [[Modules]], [[vi/vim]], [[screen/tmux]], [[ssh]] [[python/pip]], [[scp]], [[rsync]], [[git]], [[shell]], [[chmod]], [[tar]], [[sh-file]], [[NUMA]]<br />
<br />
<br />
HPC-Dev: [[Load_Balancing]], [[PE-Process]], [[correctness checking]]<br />
<br />
HPC-Programs: [[Measurement-tools]], [[Likwid]], [[Vampir]], [[ScoreP]], [[MUST]]<br />
<br />
<br />
HPC-Pages:<br />
[[Software]], [[Access]], [[Site-specific_documentation]], [[measurement-tools]], [[likwid]]<br />
<br />
== ToDo ==<br />
<br />
[[cmake]], include chain jobs in the sheduling articles, [[Load_Balancing]] braucht ein Bild, weniger Theorie und mehr Praxis, [[OpenMP]], [[MPI]],<br />
Basics Benchmarking, scaling tests, Resource planning, Tickettool</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=SLURM&diff=1145SLURM2018-12-20T17:50:27Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>== General ==<br />
<br />
SLURM is a workload manager / job [[scheduler]]. To get an overview of the functionality of a scheduler, go [[Scheduler#General|here]] or to the [[Scheduling_Basics|Scheduling Basics]].<br />
<br />
<br />
__TOC__<br />
<br />
<br />
== #SBATCH Usage ==<br />
<br />
If you are writing a [[jobscript]] for a SLURM batch system, the magic cookie is "#SBATCH". To use it, start a new line in your script with "#SBATCH". Following that, you can put one of the parameters shown below, where the word written in <...> should be replaced with a value.<br />
<br />
Basic settings:<br />
{| class="wikitable" style="width: 40%;"<br />
| Parameter || Function<br />
|-<br />
| --job-name=<name> || job name<br />
|-<br />
| --output=<path> || path to the file where the job (error) output is written to<br />
|}<br />
<br />
Requesting resources:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --time=<runlimit> || runtime limit in the format hours:min:sec; once the time specified is up, the job will be killed by the [[scheduler]]<br />
|-<br />
| --mem=<memlimit> || job memory request per node, usually an integer followed by a prefix for the unit (e. g. --mem=1G for 1 GB)<br />
|}<br />
<br />
Parallel programming (read more [[Parallel_Programming|here]]):<br />
<br />
Settings for OpenMP:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=1 || start a parallel job for a shared-memory system on only one node<br />
|-<br />
| --cpus-per-task=<num_threads> || number of threads to execute OpenMP application with<br />
|-<br />
| --ntasks-per-core=<num_hyperthreads> || number of hyperthreads per core; i. e. any value greater than 1 will turn on hyperthreading (the possible maximum depends on your CPU)<br />
|-<br />
| --ntasks-per-node=1 || for OpenMP, use one task per node only<br />
|}<br />
<br />
Settings for MPI:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=<num_nodes> || start a parallel job for a distributed-memory system on several nodes<br />
|-<br />
| --cpus-per-task=1 || for MPI, use one task per CPU<br />
|-<br />
| --ntasks-per-core=1 || disable hyperthreading<br />
|-<br />
| --ntasks-per-node=<num_procs> || number of processes per node (the possible maximum depends on your nodes)<br />
|}<br />
<br />
Email notifications:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --mail-type=<type> || type can be one of BEGIN, END, FAIL, REQUEUE or ALL (where a mail will be sent each time the status of your process changes)<br />
|-<br />
| --mail-user=<email_address> || email address to send notifications to<br />
|}<br />
<br />
== Job Submission ==<br />
<br />
This command submits the job you defined in your [[Jobscript|jobscript]] to the batch system:<br />
<br />
$ sbatch jobscript.sh<br />
<br />
Just like any other incoming job, your job will first be queued. Then, the scheduler decides when your job will be run. The more resources your job requires, the longer it may be waiting to execute.<br />
<br />
You can check the current status of your submitted jobs and their job ids with the following shell command. A job can either be pending <code>PD</code> (waiting for free nodes to run on) or running <code>R</code> (the jobscript is currently being executed). This command will also print the time (hours:min:sec) that your job has been running for.<br />
<br />
$ squeue -u <user_id><br />
<br />
In case you submitted a job on accident or realised that your job might not be running correctly, you can always remove it from the queue or terminate it when running by typing:<br />
<br />
$ scancel <job_id><br />
<br />
Furthermore, Information about current and past jobs can be accessed via:<br />
$ sacct<br />
with more detailed information at the [https://slurm.schedmd.com/sacct.html Slurm documentation of this command]<br />
<br />
== Jobscript Examples ==<br />
<br />
This serial job will run a given executable, in this case "myapp.exe".<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MYJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MYJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 15 min 30 sec<br />
#SBATCH --time=00:15:30<br />
<br />
### Memory your job needs per node, e. g. 1 GB<br />
#SBATCH --mem=1G<br />
<br />
### The last part consists of regular shell commands:<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Execute your application<br />
myapp.exe<br />
</syntaxhighlight><br />
<br />
If you'd like to run a parallel job on a cluster that is managed by SLURM, you have to clarify that. Therefore, use the command "srun <my_executable>" in your jobscript.<br />
<br />
This OpenMP job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 24 threads.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=OMPJOB<br />
<br />
### File for the output<br />
#SBATCH --output=OMPJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 30 min<br />
#SBATCH --time=00:30:00<br />
<br />
### Memory your job needs per node, e. g. 500 MB<br />
#SBATCH --mem=500M<br />
<br />
### Use one node for parallel jobs on shared-memory systems<br />
#SBATCH --nodes=1<br />
<br />
### Number of threads to use, e. g. 24<br />
#SBATCH --cpus-per-task=24<br />
<br />
### Number of hyperthreads per core<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Tasks per node (for shared-memory parallelisation, use 1)<br />
#SBATCH --ntasks-per-node=1<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to the value specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
This MPI job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 12 processes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MPIJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MPIJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 50 min<br />
#SBATCH --time=00:50:00<br />
<br />
### Memory your job needs per node, e. g. 250 MB<br />
#SBATCH --mem=250M<br />
<br />
### Use more than one node for parallel jobs on distributed-memory systems, e. g. 2<br />
#SBATCH --nodes=2<br />
<br />
### Number of CPUS per task (for distributed-memory parallelisation, use 1)<br />
#SBATCH --cpus-per-task=1<br />
<br />
### Disable hyperthreading by setting the tasks per core to 1<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Number of processes per node, e. g. 6 (6 processes on 2 nodes = 12 processes in total)<br />
#SBATCH --ntasks-per-node=6<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to 1, as specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
<br />
<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
You can use this hybrid toy Fortran90 program to test the above job script<br />
<syntaxhighlight lang="bash"><br />
program hello<br />
use mpi<br />
use omp_lib<br />
<br />
integer rank, size, ierror, tag, status(MPI_STATUS_SIZE),threadid<br />
character*(MPI_MAX_PROCESSOR_NAME) name<br />
<br />
call MPI_INIT(ierror)<br />
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)<br />
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)<br />
call MPI_GET_PROCESSOR_NAME(name,len,ierror)<br />
<br />
!$omp parallel private(threadid)<br />
threadid=omp_get_thread_num()<br />
print*, 'node: ', trim(name), ' rank:', rank, ', thread_id:', threadid<br />
!$omp end parallel<br />
<br />
call MPI_FINALIZE(ierror)<br />
<br />
end program<br />
</syntaxhighlight><br />
<br />
When sorting the program output it may look like <br />
<syntaxhighlight lang="fortran"><br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 0 , thread_id: 2<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 0<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 1<br />
node: ncm1018.hpc.itc.rwth-aachen.de rank: 1 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 2 , thread_id: 2<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 0<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 1<br />
node: ncm1019.hpc.itc.rwth-aachen.de rank: 3 , thread_id: 2<br />
</syntaxhighlight><br />
<br />
== Site specific notes ==<br />
<br />
=== RRZE ===<br />
<br />
* <code>--output=</code> ''should not'' be used on RRZE's clusters; the submit filter already sets suitable defaults automatically<br />
* <code>--mem=<memlimit></code> '''must not''' be used on RRZE's clusters<br />
* the first line of the job script ''should be'' <code>#/bin/bash -l</code> otherwise <code>module</code> commands won't work in te job script<br />
* to have a clean environment in job scripts, it is recommended to add <code>#SBATCH --export=NONE</code> '''and''' <code>unset SLURM_EXPORT_ENV</code> to the job script. Otherwise, the job will inherit some settings from the submitting shell.<br />
* access to the parallel file system has to be specified by <code>#SBATCH ---constraint=parfs</code> or the command line shortcut <code>-C parfs</code><br />
* access to hardware performance counters (e.g. to be able to use <code>likwid-perfctr</code>) has to be requested by <code>#SBATCH ---constraint=hwperf</code> or the command line shortcut <code>-C hwperf</code>. Only request that feature if you really want to access the ardware performance counters as the feature interferes with the automatic system monitoring.<br />
* multiple features have to be requested in a single <code>--constraint=</code> statement, listing all required features separated by ampersand, e.g. <code>hwperf&parfs</code><br />
* for Intel MPI, RRZE recommends the usage of <code>mpirun</code> instead of <code>srun</code>; if <code>srun</code> shall be used, the additional command line argument <code>--mpi=pmi2</code> is required. The command line option <code>-ppn</code> of <code>mpirun</code> only works if you <code>export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off</code> before.<br />
* for <code>squeue</code> the option <code>-u user</code> does not have any effect as you always only see your own jobs<br />
<br />
== References ==<br />
<br />
[https://www.lrz.de/services/compute/linux-cluster/batch_parallel/example_jobs/ Advanced SLURM jobscript examples]<br />
<br />
[http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/ Detailled guide to more advanced scripts]<br />
<br />
[https://slurm.schedmd.com/sbatch.html SBATCH documentation]<br />
<br />
[https://user.cscs.ch/getting_started/running_jobs/jobscript_generator/#slurm-jobscript-generator SLURM jobscript generator]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=SLURM&diff=1144SLURM2018-12-20T17:45:51Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>== General ==<br />
<br />
SLURM is a workload manager / job [[scheduler]]. To get an overview of the functionality of a scheduler, go [[Scheduler#General|here]] or to the [[Scheduling_Basics|Scheduling Basics]].<br />
<br />
<br />
__TOC__<br />
<br />
<br />
== #SBATCH Usage ==<br />
<br />
If you are writing a [[jobscript]] for a SLURM batch system, the magic cookie is "#SBATCH". To use it, start a new line in your script with "#SBATCH". Following that, you can put one of the parameters shown below, where the word written in <...> should be replaced with a value.<br />
<br />
Basic settings:<br />
{| class="wikitable" style="width: 40%;"<br />
| Parameter || Function<br />
|-<br />
| --job-name=<name> || job name<br />
|-<br />
| --output=<path> || path to the file where the job (error) output is written to<br />
|}<br />
<br />
Requesting resources:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --time=<runlimit> || runtime limit in the format hours:min:sec; once the time specified is up, the job will be killed by the [[scheduler]]<br />
|-<br />
| --mem=<memlimit> || job memory request per node, usually an integer followed by a prefix for the unit (e. g. --mem=1G for 1 GB)<br />
|}<br />
<br />
Parallel programming (read more [[Parallel_Programming|here]]):<br />
<br />
Settings for OpenMP:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=1 || start a parallel job for a shared-memory system on only one node<br />
|-<br />
| --cpus-per-task=<num_threads> || number of threads to execute OpenMP application with<br />
|-<br />
| --ntasks-per-core=<num_hyperthreads> || number of hyperthreads per core; i. e. any value greater than 1 will turn on hyperthreading (the possible maximum depends on your CPU)<br />
|-<br />
| --ntasks-per-node=1 || for OpenMP, use one task per node only<br />
|}<br />
<br />
Settings for MPI:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=<num_nodes> || start a parallel job for a distributed-memory system on several nodes<br />
|-<br />
| --cpus-per-task=1 || for MPI, use one task per CPU<br />
|-<br />
| --ntasks-per-core=1 || disable hyperthreading<br />
|-<br />
| --ntasks-per-node=<num_procs> || number of processes per node (the possible maximum depends on your nodes)<br />
|}<br />
<br />
Email notifications:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --mail-type=<type> || type can be one of BEGIN, END, FAIL, REQUEUE or ALL (where a mail will be sent each time the status of your process changes)<br />
|-<br />
| --mail-user=<email_address> || email address to send notifications to<br />
|}<br />
<br />
== Job Submission ==<br />
<br />
This command submits the job you defined in your [[Jobscript|jobscript]] to the batch system:<br />
<br />
$ sbatch jobscript.sh<br />
<br />
Just like any other incoming job, your job will first be queued. Then, the scheduler decides when your job will be run. The more resources your job requires, the longer it may be waiting to execute.<br />
<br />
You can check the current status of your submitted jobs and their job ids with the following shell command. A job can either be pending <code>PD</code> (waiting for free nodes to run on) or running <code>R</code> (the jobscript is currently being executed). This command will also print the time (hours:min:sec) that your job has been running for.<br />
<br />
$ squeue -u <user_id><br />
<br />
In case you submitted a job on accident or realised that your job might not be running correctly, you can always remove it from the queue or terminate it when running by typing:<br />
<br />
$ scancel <job_id><br />
<br />
Furthermore, Information about current and past jobs can be accessed via:<br />
$ sacct<br />
with more detailed information at the [https://slurm.schedmd.com/sacct.html Slurm documentation of this command]<br />
<br />
== Jobscript Examples ==<br />
<br />
This serial job will run a given executable, in this case "myapp.exe".<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MYJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MYJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 15 min 30 sec<br />
#SBATCH --time=00:15:30<br />
<br />
### Memory your job needs per node, e. g. 1 GB<br />
#SBATCH --mem=1G<br />
<br />
### The last part consists of regular shell commands:<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Execute your application<br />
myapp.exe<br />
</syntaxhighlight><br />
<br />
If you'd like to run a parallel job on a cluster that is managed by SLURM, you have to clarify that. Therefore, use the command "srun <my_executable>" in your jobscript.<br />
<br />
This OpenMP job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 24 threads.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=OMPJOB<br />
<br />
### File for the output<br />
#SBATCH --output=OMPJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 30 min<br />
#SBATCH --time=00:30:00<br />
<br />
### Memory your job needs per node, e. g. 500 MB<br />
#SBATCH --mem=500M<br />
<br />
### Use one node for parallel jobs on shared-memory systems<br />
#SBATCH --nodes=1<br />
<br />
### Number of threads to use, e. g. 24<br />
#SBATCH --cpus-per-task=24<br />
<br />
### Number of hyperthreads per core<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Tasks per node (for shared-memory parallelisation, use 1)<br />
#SBATCH --ntasks-per-node=1<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to the value specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
This MPI job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 12 processes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MPIJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MPIJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 50 min<br />
#SBATCH --time=00:50:00<br />
<br />
### Memory your job needs per node, e. g. 250 MB<br />
#SBATCH --mem=250M<br />
<br />
### Use more than one node for parallel jobs on distributed-memory systems, e. g. 2<br />
#SBATCH --nodes=2<br />
<br />
### Number of CPUS per task (for distributed-memory parallelisation, use 1)<br />
#SBATCH --cpus-per-task=1<br />
<br />
### Disable hyperthreading by setting the tasks per core to 1<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Number of processes per node, e. g. 6 (6 processes on 2 nodes = 12 processes in total)<br />
#SBATCH --ntasks-per-node=6<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to 1, as specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
<br />
<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
== Site specific notes ==<br />
<br />
=== RRZE ===<br />
<br />
* <code>--output=</code> ''should not'' be used on RRZE's clusters; the submit filter already sets suitable defaults automatically<br />
* <code>--mem=<memlimit></code> '''must not''' be used on RRZE's clusters<br />
* the first line of the job script ''should be'' <code>#/bin/bash -l</code> otherwise <code>module</code> commands won't work in te job script<br />
* to have a clean environment in job scripts, it is recommended to add <code>#SBATCH --export=NONE</code> '''and''' <code>unset SLURM_EXPORT_ENV</code> to the job script. Otherwise, the job will inherit some settings from the submitting shell.<br />
* access to the parallel file system has to be specified by <code>#SBATCH ---constraint=parfs</code> or the command line shortcut <code>-C parfs</code><br />
* access to hardware performance counters (e.g. to be able to use <code>likwid-perfctr</code>) has to be requested by <code>#SBATCH ---constraint=hwperf</code> or the command line shortcut <code>-C hwperf</code>. Only request that feature if you really want to access the ardware performance counters as the feature interferes with the automatic system monitoring.<br />
* multiple features have to be requested in a single <code>--constraint=</code> statement, listing all required features separated by ampersand, e.g. <code>hwperf&parfs</code><br />
* for Intel MPI, RRZE recommends the usage of <code>mpirun</code> instead of <code>srun</code>; if <code>srun</code> shall be used, the additional command line argument <code>--mpi=pmi2</code> is required. The command line option <code>-ppn</code> of <code>mpirun</code> only works if you <code>export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off</code> before.<br />
* for <code>squeue</code> the option <code>-u user</code> does not have any effect as you always only see your own jobs<br />
<br />
== References ==<br />
<br />
[https://www.lrz.de/services/compute/linux-cluster/batch_parallel/example_jobs/ Advanced SLURM jobscript examples]<br />
<br />
[http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/ Detailled guide to more advanced scripts]<br />
<br />
[https://slurm.schedmd.com/sbatch.html SBATCH documentation]<br />
<br />
[https://user.cscs.ch/getting_started/running_jobs/jobscript_generator/#slurm-jobscript-generator SLURM jobscript generator]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=SLURM&diff=1143SLURM2018-12-20T17:44:42Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>== General ==<br />
<br />
SLURM is a workload manager / job [[scheduler]]. To get an overview of the functionality of a scheduler, go [[Scheduler#General|here]] or to the [[Scheduling_Basics|Scheduling Basics]].<br />
<br />
<br />
__TOC__<br />
<br />
<br />
== #SBATCH Usage ==<br />
<br />
If you are writing a [[jobscript]] for a SLURM batch system, the magic cookie is "#SBATCH". To use it, start a new line in your script with "#SBATCH". Following that, you can put one of the parameters shown below, where the word written in <...> should be replaced with a value.<br />
<br />
Basic settings:<br />
{| class="wikitable" style="width: 40%;"<br />
| Parameter || Function<br />
|-<br />
| --job-name=<name> || job name<br />
|-<br />
| --output=<path> || path to the file where the job (error) output is written to<br />
|}<br />
<br />
Requesting resources:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --time=<runlimit> || runtime limit in the format hours:min:sec; once the time specified is up, the job will be killed by the [[scheduler]]<br />
|-<br />
| --mem=<memlimit> || job memory request per node, usually an integer followed by a prefix for the unit (e. g. --mem=1G for 1 GB)<br />
|}<br />
<br />
Parallel programming (read more [[Parallel_Programming|here]]):<br />
<br />
Settings for OpenMP:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=1 || start a parallel job for a shared-memory system on only one node<br />
|-<br />
| --cpus-per-task=<num_threads> || number of threads to execute OpenMP application with<br />
|-<br />
| --ntasks-per-core=<num_hyperthreads> || number of hyperthreads per core; i. e. any value greater than 1 will turn on hyperthreading (the possible maximum depends on your CPU)<br />
|-<br />
| --ntasks-per-node=1 || for OpenMP, use one task per node only<br />
|}<br />
<br />
Settings for MPI:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --nodes=<num_nodes> || start a parallel job for a distributed-memory system on several nodes<br />
|-<br />
| --cpus-per-task=1 || for MPI, use one task per CPU<br />
|-<br />
| --ntasks-per-core=1 || disable hyperthreading<br />
|-<br />
| --ntasks-per-node=<num_procs> || number of processes per node (the possible maximum depends on your nodes)<br />
|}<br />
<br />
Email notifications:<br />
{| class="wikitable" style="width: 60%;"<br />
| Parameter || Function<br />
|-<br />
| --mail-type=<type> || type can be one of BEGIN, END, FAIL, REQUEUE or ALL (where a mail will be sent each time the status of your process changes)<br />
|-<br />
| --mail-user=<email_address> || email address to send notifications to<br />
|}<br />
<br />
== Job Submission ==<br />
<br />
This command submits the job you defined in your [[Jobscript|jobscript]] to the batch system:<br />
<br />
$ sbatch jobscript.sh<br />
<br />
Just like any other incoming job, your job will first be queued. Then, the scheduler decides when your job will be run. The more resources your job requires, the longer it may be waiting to execute.<br />
<br />
You can check the current status of your submitted jobs and their job ids with the following shell command. A job can either be pending <code>PD</code> (waiting for free nodes to run on) or running <code>R</code> (the jobscript is currently being executed). This command will also print the time (hours:min:sec) that your job has been running for.<br />
<br />
$ squeue -u <user_id><br />
<br />
In case you submitted a job on accident or realised that your job might not be running correctly, you can always remove it from the queue or terminate it when running by typing:<br />
<br />
$ scancel <job_id><br />
<br />
Furthermore, Information about current and past jobs can be accessed via:<br />
$ sacct<br />
with more detailed information at the [https://slurm.schedmd.com/sacct.html Slurm documentation of this command]<br />
<br />
== Jobscript Examples ==<br />
<br />
This serial job will run a given executable, in this case "myapp.exe".<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MYJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MYJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 15 min 30 sec<br />
#SBATCH --time=00:15:30<br />
<br />
### Memory your job needs per node, e. g. 1 GB<br />
#SBATCH --mem=1G<br />
<br />
### The last part consists of regular shell commands:<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Execute your application<br />
myapp.exe<br />
</syntaxhighlight><br />
<br />
If you'd like to run a parallel job on a cluster that is managed by SLURM, you have to clarify that. Therefore, use the command "srun <my_executable>" in your jobscript.<br />
<br />
This OpenMP job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 24 threads.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=OMPJOB<br />
<br />
### File for the output<br />
#SBATCH --output=OMPJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 30 min<br />
#SBATCH --time=00:30:00<br />
<br />
### Memory your job needs per node, e. g. 500 MB<br />
#SBATCH --mem=500M<br />
<br />
### Use one node for parallel jobs on shared-memory systems<br />
#SBATCH --nodes=1<br />
<br />
### Number of threads to use, e. g. 24<br />
#SBATCH --cpus-per-task=24<br />
<br />
### Number of hyperthreads per core<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Tasks per node (for shared-memory parallelisation, use 1)<br />
#SBATCH --ntasks-per-node=1<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to the value specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
This MPI job will start the [[Parallel_Programming|parallel program]] "myapp.exe" with 12 processes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/bash<br />
<br />
### Job name<br />
#SBATCH --job-name=MPIJOB<br />
<br />
### File for the output<br />
#SBATCH --output=MPIJOB_OUTPUT<br />
<br />
### Time your job needs to execute, e. g. 50 min<br />
#SBATCH --time=00:50:00<br />
<br />
### Memory your job needs per node, e. g. 250 MB<br />
#SBATCH --mem=250M<br />
<br />
### Use more than one node for parallel jobs on distributed-memory systems, e. g. 2<br />
#SBATCH --nodes=2<br />
<br />
### Number of CPUS per task (for distributed-memory parallelisation, use 1)<br />
#SBATCH --cpus-per-task=1<br />
<br />
### Disable hyperthreading by setting the tasks per core to 1<br />
#SBATCH --ntasks-per-core=1<br />
<br />
### Number of processes per node, e. g. 6 (6 processes on 2 nodes = 12 processes in total)<br />
#SBATCH --ntasks-per-node=6<br />
<br />
### The last part consists of regular shell commands:<br />
### Set the number of threads in your cluster environment to 1, as specified above<br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
<br />
<br />
<br />
This hybrid MPI+OpenMP job will start the [[Parallel_Programming|parallel program]] "hello.x" with 4 MPI processes and 3 OpenMP threads each on 2 compute nodes.<br />
<syntaxhighlight lang="bash"><br />
#!/bin/zsh<br />
<br />
### Job name<br />
#SBATCH --job-name=HelloHybrid<br />
<br />
### 2 compute nodes<br />
#SBATCH --nodes=2<br />
<br />
### 4 MPI ranks<br />
#SBATCH --ntasks=4<br />
<br />
### 2 MPI ranks per node<br />
#SBATCH --ntasks-per-node=2<br />
<br />
### 3 tasks per MPI rank<br />
#SBATCH --cpus-per-task=3<br />
<br />
### the number of OpenMP threads <br />
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK<br />
<br />
### Change to working directory<br />
cd /home/usr/workingdirectory<br />
<br />
### Run your parallel application<br />
srun myapp.exe<br />
</syntaxhighlight><br />
== Site specific notes ==<br />
<br />
=== RRZE ===<br />
<br />
* <code>--output=</code> ''should not'' be used on RRZE's clusters; the submit filter already sets suitable defaults automatically<br />
* <code>--mem=<memlimit></code> '''must not''' be used on RRZE's clusters<br />
* the first line of the job script ''should be'' <code>#/bin/bash -l</code> otherwise <code>module</code> commands won't work in te job script<br />
* to have a clean environment in job scripts, it is recommended to add <code>#SBATCH --export=NONE</code> '''and''' <code>unset SLURM_EXPORT_ENV</code> to the job script. Otherwise, the job will inherit some settings from the submitting shell.<br />
* access to the parallel file system has to be specified by <code>#SBATCH ---constraint=parfs</code> or the command line shortcut <code>-C parfs</code><br />
* access to hardware performance counters (e.g. to be able to use <code>likwid-perfctr</code>) has to be requested by <code>#SBATCH ---constraint=hwperf</code> or the command line shortcut <code>-C hwperf</code>. Only request that feature if you really want to access the ardware performance counters as the feature interferes with the automatic system monitoring.<br />
* multiple features have to be requested in a single <code>--constraint=</code> statement, listing all required features separated by ampersand, e.g. <code>hwperf&parfs</code><br />
* for Intel MPI, RRZE recommends the usage of <code>mpirun</code> instead of <code>srun</code>; if <code>srun</code> shall be used, the additional command line argument <code>--mpi=pmi2</code> is required. The command line option <code>-ppn</code> of <code>mpirun</code> only works if you <code>export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off</code> before.<br />
* for <code>squeue</code> the option <code>-u user</code> does not have any effect as you always only see your own jobs<br />
<br />
== References ==<br />
<br />
[https://www.lrz.de/services/compute/linux-cluster/batch_parallel/example_jobs/ Advanced SLURM jobscript examples]<br />
<br />
[http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/ Detailled guide to more advanced scripts]<br />
<br />
[https://slurm.schedmd.com/sbatch.html SBATCH documentation]<br />
<br />
[https://user.cscs.ch/getting_started/running_jobs/jobscript_generator/#slurm-jobscript-generator SLURM jobscript generator]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=HPC_Wiki&diff=1051HPC Wiki2018-11-06T09:24:26Z<p>Dieter-anmey-f9d9@rwth-aachen.de: /* Rubriken */</p>
<hr />
<div>Welcome to HPC Wiki of the ProPE Project.<br />
<br />
This website is currently work-in-progress and aims to provide a site independant HPC documentation.<br />
<br />
<br />
== Categroies ==<br />
<br />
[[Getting_Started]] is a basic guide for first-time users. It covers a wide range of topics from access and login to system-independant concepts of Unix systems to data transfers.<br />
<br />
[[FAQs]]<br />
<br />
== Todo ==<br />
Create pages with help of the [[Sample_Page|Sample Page]]<br />
<br />
Basics documentation:<br />
<br />
[[Load_Balancing]], [[make]], [[cmake]], [[Ssh_keys]], [[compiler]], [[software-tools]], [[Modules]], [[vi/vim]], [[screen/tmux]], [[ssh]] [[python/pip]], [[scp]], [[rsync]], [[git]], [[ps]], [[shell]], [[chmod]], [[umask]], [[tar]], [[sh-file]]<br />
<br />
<br />
HPC-Programs: [[Measurement-tools]], [[Likwid]], [[Vampir]], [[ScoreP]], [[Must]]<br />
<br />
HPC-Pages:<br />
[[PE-Process]], [[correctness checking]], [[Software]], [[Access]], [[Site-specific_documentation]], [[measurement-tools]], [[likwid]]</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Schedulers&diff=1050Schedulers2018-11-06T09:22:52Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>A [[Batch-Scheduler]] is a program running on a cluster that decides who can use which machines at what time as explained in the [[Getting_Started#Schedulers_or_.22How-To-Run-Applications-on-a-supercomputer.22|Schedulers Section]] of the [[Getting_Started|Getting Started]] and the [[Batch-Scheduler]] article.<br />
<br />
The following list details the schedulers of the different facilities:<br />
<br />
{| class="wikitable" style="width: 40%;"<br />
| IT Center - RWTH Aachen [https://doc.itc.rwth-aachen.de/display/CC/Hardware+of+the+RWTH+Compute+Cluster Hardware] || [[LSF]] migration to [[SLURM]] in progress <br />
|-<br />
| RRZE - Erlangen [https://www.anleitungen.rrze.fau.de/hpc/ HPC] || [[Torque]] or [[SLURM]] depending on the cluster<br />
|-<br />
| ZIH - Dresden [https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/HardwareTaurus Taurus_Hardware] || [[SLURM]]<br />
|}</div>Dieter-anmey-f9d9@rwth-aachen.dehttps://hpc-wiki.info/hpc/index.php?title=Software&diff=1041Software2018-10-04T14:54:52Z<p>Dieter-anmey-f9d9@rwth-aachen.de: </p>
<hr />
<div>The following list details the available software of the different facilities:<br />
<br />
{| class="wikitable" style="width: 40%;"<br />
| IT Center - RWTH Aachen || [https://doc.itc.rwth-aachen.de/display/CC/Installed+software Software RWTH]<br />
|-<br />
| RRZE - Erlangen || [https://www.rrze.fau.de/hard-software/software/ Software RRZE]<br />
|-<br />
| ZIH - Dresden || [https://tu-dresden.de/zih/dienste/service-katalog/arbeitsumgebung/dir_software Software ZIH]<br />
|}<br />
<br />
Furthermore, the Gauss Alliance is offering a platform to provide an overview of installed software at German HPC Centers at [[https://gauss-allianz.de/en/application]]</div>Dieter-anmey-f9d9@rwth-aachen.de