Difference between revisions of "Multiple Program Runs in one Slurm Job"

From HPC Wiki
Jump to navigation Jump to search
m
 
(10 intermediate revisions by 3 users not shown)
Line 1: Line 1:
In certain circumstances it may be profitable to start multiple OpenMP programs at a time in one single batch job.
+
[[Category:HPC-User]]
 +
In certain circumstances it may be profitable to start multiple shared-memory / [[OpenMP]] programs at a time in one single batch job.
 
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time using the [[SLURM|Slurm]] batch system.
 
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time using the [[SLURM|Slurm]] batch system.
  
 
__TOC__
 
__TOC__
  
== Problem Position ==
+
== Problem ==
  
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.
+
Nowadays, the number of cores per processor/socket keeps increasing. Furthermore, compute nodes of HPC clusters tend to have two or even more such chips.  
But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number of cores and thus resources may be waisted.
+
Unfortunately, the scalability of shared-memory- ([[OpenMP]]-) programs does not always keep pace with the increasing core count. In such cases, a program cannot benefit from such a high number of cores and thus, resources may be wasted.
Additional complexity stems from the NUMA architecure of modern compute nodes.
+
Additional complexity is caused by the [[NUMA]] architecture of modern compute nodes, which has to be taken into account when trying to achieve good performance.
  
 
== Shared or Exclusive Operation ==
 
== Shared or Exclusive Operation ==
  
One way of operating a cluster of multi-core nodes is to allow sharing compute nodes between multiple jobs.  
+
One way of operating a cluster of multi-core nodes is to allow ''sharing'' of any given compute node between multiple jobs of multiple users.  
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run.
+
But because of shared hardware resources like caches and pathes to memory, these coexisting jobs may influence each other heavily. Thus, the runtime of any such jobs is hard to predict and may vary considerably between runs.  
 
 
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses one node exclusively.
 
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable.
 
In such a case input data, execution environment and the batch job script has to be adjusted properly.
 
  
 +
Another way is to start multiple program runs with similar runtime just within ''one single batch job'', using one node exclusively.
 +
These program runs will still impact each other, but the compute node is under control of a single user and when applied repeatedly the individual jobs' total runtime will be more predictable.
 +
For tat to work efficiently, the user has to adjust input data, execution environment and the batch job script accordingly, as explained below.
  
 
== Example 1 ==
 
== Example 1 ==
 
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.
 
Two runs of the Gaussian chemistry code are started within one Slurm job in the following example.
The target machine has two processors with 24 cores each and provides 192 GB of main memory and above 400 GB of SSD for fast file IO.
+
The target machine has two Intel SkyLake processors with 24 cores each and provides 192 GB of main memory and some 400 GB of SSD for fast file IO.
 
Each program run uses 24 threads such that both runs together occupy the whole machine.
 
Each program run uses 24 threads such that both runs together occupy the whole machine.
  
Line 33: Line 33:
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
#!/usr/local_rwth/bin/zsh
+
#!/usr/bin/zsh
  
 
#SBATCH  --job-name=run2x24   
 
#SBATCH  --job-name=run2x24   
Line 76: Line 76:
  
 
### display NUMA characteristics
 
### display NUMA characteristics
numaclt -H
+
numactl -H
 
numactl --cpubind=0,1 --membind=0,1 -- numactl -show
 
numactl --cpubind=0,1 --membind=0,1 -- numactl -show
 
numactl --cpubind=2,3 --membind=2,3 -- numactl -show
 
numactl --cpubind=2,3 --membind=2,3 -- numactl -show
  
 +
### launch 2 program runs
 
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \
 
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \
 
   export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \
 
   export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \
Line 89: Line 90:
 
pid2=$!
 
pid2=$!
  
 +
### wait for the termination of both programs runs
 
wait $pid1 $pid2
 
wait $pid1 $pid2
 
</syntaxhighlight>
 
</syntaxhighlight>
Line 95: Line 97:
 
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by '''%nprocshared''' and the amount of main memory for the working array by '''%mem'''. If the (fast) file system for scratch files has limitations also the '''maxdisk''' parameter has to be set accordingly.
 
In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by '''%nprocshared''' and the amount of main memory for the working array by '''%mem'''. If the (fast) file system for scratch files has limitations also the '''maxdisk''' parameter has to be set accordingly.
  
<syntaxhighlight>
+
<syntaxhighlight lang="slurm">
 
%nprocshared=24
 
%nprocshared=24
 
%mem=70000MB
 
%mem=70000MB
Line 105: Line 107:
  
 
As modern multi-core compute nodes typically have a [[NUMA]] architecture, it is profitable to carefully place threads of a program close to their data.
 
As modern multi-core compute nodes typically have a [[NUMA]] architecture, it is profitable to carefully place threads of a program close to their data.
In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, the compute node has 2 NUMA domains.
+
In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, i.e. the compute node has 2 NUMA domains.
  
 
The command
 
The command
Line 172: Line 174:
  
 
In Order to start a program called '''program.exe''' under the control of numactl, the syntax  
 
In Order to start a program called '''program.exe''' under the control of numactl, the syntax  
<syntaxhighlight>numactl --cpubind=... --membind=... -- program.exe </syntaxhighlight> is used.
+
<syntaxhighlight lang="slurm">numactl --cpubind=... --membind=... -- program.exe </syntaxhighlight> is used.
 
When substituting '''program.exe''' by '''numactl -show''', it can be checked if the placement of the threads works as desired:
 
When substituting '''program.exe''' by '''numactl -show''', it can be checked if the placement of the threads works as desired:
  
<syntaxhighlight>
+
<syntaxhighlight lang="slurm">
 
numactl --cpubind=0,1 --membind=0,1 -- numactl -show
 
numactl --cpubind=0,1 --membind=0,1 -- numactl -show
 
policy: bind
 
policy: bind
Line 199: Line 201:
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
#!/usr/local_rwth/bin/zsh
+
#!/usr/bin/zsh
 
#SBATCH  --job-name=run4x12   
 
#SBATCH  --job-name=run4x12   
 
#SBATCH --output=%j.log
 
#SBATCH --output=%j.log
Line 250: Line 252:
  
 
### display NUMA characteristics
 
### display NUMA characteristics
numaclt -H
+
numactl -H
 
numactl --cpubind=0 --membind=0 -- numactl -show
 
numactl --cpubind=0 --membind=0 -- numactl -show
 
numactl --cpubind=1 --membind=1 -- numactl -show
 
numactl --cpubind=1 --membind=1 -- numactl -show
Line 256: Line 258:
 
numactl --cpubind=3 --membind=3 -- numactl -show
 
numactl --cpubind=3 --membind=3 -- numactl -show
  
 +
### launch 4 program runs
 
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \
 
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \
 
   export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \
 
   export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \
Line 272: Line 275:
 
   numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out  ) &
 
   numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out  ) &
  
 +
### wait for the termination of all 4 programs runs
 
wait $pid1 $pid2 $pid3 $pid4
 
wait $pid1 $pid2 $pid3 $pid4
  
Line 278: Line 282:
 
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:
 
Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:
  
<syntaxhighlight>
+
<syntaxhighlight lang="slurm">
 
%nprocshared=12
 
%nprocshared=12
 
%mem=35000MB
 
%mem=35000MB
Line 298: Line 302:
 
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.
 
when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.
  
For optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.
+
Thus, for optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.
  
  
 
== Links and more Information ==
 
== Links and more Information ==
t.b.a.
+
 
 +
[https://linux.die.net/man/8/numactl numactl(8) - Linux man page]
 +
 
 +
[https://gaussian.com/techsupport/ Gaussian Technical Information]

Latest revision as of 16:31, 28 February 2023

In certain circumstances it may be profitable to start multiple shared-memory / OpenMP programs at a time in one single batch job. Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time using the Slurm batch system.

Problem

Nowadays, the number of cores per processor/socket keeps increasing. Furthermore, compute nodes of HPC clusters tend to have two or even more such chips. Unfortunately, the scalability of shared-memory- (OpenMP-) programs does not always keep pace with the increasing core count. In such cases, a program cannot benefit from such a high number of cores and thus, resources may be wasted. Additional complexity is caused by the NUMA architecture of modern compute nodes, which has to be taken into account when trying to achieve good performance.

Shared or Exclusive Operation

One way of operating a cluster of multi-core nodes is to allow sharing of any given compute node between multiple jobs of multiple users. But because of shared hardware resources like caches and pathes to memory, these coexisting jobs may influence each other heavily. Thus, the runtime of any such jobs is hard to predict and may vary considerably between runs.

Another way is to start multiple program runs with similar runtime just within one single batch job, using one node exclusively. These program runs will still impact each other, but the compute node is under control of a single user and when applied repeatedly the individual jobs' total runtime will be more predictable. For tat to work efficiently, the user has to adjust input data, execution environment and the batch job script accordingly, as explained below.

Example 1

Two runs of the Gaussian chemistry code are started within one Slurm job in the following example. The target machine has two Intel SkyLake processors with 24 cores each and provides 192 GB of main memory and some 400 GB of SSD for fast file IO. Each program run uses 24 threads such that both runs together occupy the whole machine.

The batch job script requests the full node exclusively.

Each program run is executed in a separate directory such that file IO does not interfer. Both programs are started asynchonously and a wait command waits for the termination of both programs.

In order to make sure that both programs profit from the NUMA architecture of a modern computer in an optimal way, the command g09 to launch the Gaussain package is started under the control of the numactl command - see explanation in the next paragraph for this aspect.

#!/usr/bin/zsh

#SBATCH  --job-name=run2x24  
#SBATCH --output=%j.log
#SBATCH --error=%j.err
#SBATCH --time=00-01:00:00
#SBATCH --mem=180G

### exclusive usage of a single node
#SBATCH --exclusive
### use all cores of one node, one thread per core
#SBATCH --ntasks=1 --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH --threads-per-core=1

### prepare your environment for running gaussian
module load CHEMISTRY gaussian
### make sure this environment variable points to a suitable location
### here the gaussian module allocates the scratch directory
echo $GAUSS_SCRDIR

### adjust working directory and input file names and output directory names
export WDIR=....

export INP1=small1.inp24
export INP2=small2.inp24

export OUT1=run1
export OUT2=run2

### this is not necessary in the case of Gaussian program runs
### but it may be important in other cases
export OMP_NUM_THREADS=24

### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx
### Input files are assumed to be in $WDIR/$INPx
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2

### display NUMA characteristics
numactl -H
numactl --cpubind=0,1 --membind=0,1 -- numactl -show
numactl --cpubind=2,3 --membind=2,3 -- numactl -show

### launch 2 program runs
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \
  numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &
pid1=$!
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \
  numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &
pid2=$!

### wait for the termination of both programs runs
wait $pid1 $pid2


In the case of the Gaussian chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by %nprocshared and the amount of main memory for the working array by %mem. If the (fast) file system for scratch files has limitations also the maxdisk parameter has to be set accordingly.

%nprocshared=24
%mem=70000MB
...
#p ... maxdisk=200GB

NUMA Aspects

As modern multi-core compute nodes typically have a NUMA architecture, it is profitable to carefully place threads of a program close to their data. In the given example with two 24 core processors per compute node, each processor has direct access to half of the main memory whereas access to the distant half of the memory takes more time, i.e. the compute node has 2 NUMA domains.

The command

numactl -H

provides information about the NUMA characteristic of the machine ( when Sub-NUMA Clustering is deactivated ):

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
node 0 size: 195270 MB
node 0 free: 135391 MB
node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 196608 MB
node 1 free: 143410 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

The numbers of 24 cores of each NUMA domain are listed and the size of the attached main memory portion. It is a bit unfortunate that here the term "node" is used for NUMA domain (versus: compute node = one computer in a compute cluster).

Also, the relative costs of cores within one NUMA domain accessing memory of another NUMA domain are given in a matrix of node distances. For example it costs 10 (abstract timing units) if core 2 accesses memory of its own NUMA domain 0, while it costs 21 if the same core accesses memory of the memory domain 1.

In fact, the machine which has been used for the experiments here has a BIOS setting turned on which is called Sub-NUMA Clustering. This setting splits each 24 core processor chip into halves with 12 cores each, reigning over one quarter of the compute node's main memory, as if the compute node would employ 4 chips with 12 cores each.

But still those two halves of each chip are rather close to each other, as the command

numactl -H

reveils:

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20
node 0 size: 47820 MB
node 0 free: 37007 MB
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23
node 1 size: 49152 MB
node 1 free: 41 MB
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44
node 2 size: 49152 MB
node 2 free: 47613 MB
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47
node 3 size: 49152 MB
node 3 free: 47554 MB
node distances:
node   0   1   2   3 
  0:  10  11  21  21 
  1:  11  10  21  21 
  2:  21  21  10  11 
  3:  21  21  11  10

Here, as the matrix of the node distances depicts, NUMA domains 0 and 1 are very close to each other as are NUMA domains 2 und 3.

As a consequence, when launching two program runs with 24 threads each in the above example, the first run is bound to NUMA domain 0 and 1 and the second run is bound to NUMA domain 2 and 3. Binding means assigning the threads to the corresponding cores plus allocating memory that is touched by these threads on the corresponding memory area.

In Order to start a program called program.exe under the control of numactl, the syntax

numactl --cpubind=... --membind=... -- program.exe

is used.

When substituting program.exe by numactl -show, it can be checked if the placement of the threads works as desired:

numactl --cpubind=0,1 --membind=0,1 -- numactl -show
policy: bind
preferred node: 0
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1 
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show
policy: bind
preferred node: 2
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 
cpubind: 2 3 
nodebind: 2 3 
membind: 2 3


Example 2

Using the same setting than Example 1, it is actually profitable to launch 4 program runs at a time in a single batch job. Here, each program run is bound to a single NUMA domain:

#!/usr/bin/zsh
#SBATCH  --job-name=run4x12  
#SBATCH --output=%j.log
#SBATCH --error=%j.err
#SBATCH --time=00-01:00:00
#SBATCH --mem=180G

### exclusive usage of a single node
#SBATCH --exclusive
### use all cores of one, one thread per core
#SBATCH --ntasks=1 --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH --threads-per-core=1

### prepare your environment for running gaussian
module load CHEMISTRY gaussian
### make sure this environment variable points to a suitable location
### here the gaussian module allocates the scratch directory
echo $GAUSS_SCRDIR

### adjust working directory and input file names and output directory names
export WDIR=....

export INP1=small1.inp12
export INP2=small2.inp12
export INP3=small3.inp12
export INP4=small4.inp12

export OUT1=run1
export OUT2=run2
export OUT3=run3
export OUT4=run4

### this is not necessary in the case of Gaussian program runs
### but it may be important in other cases
export OMP_NUM_THREADS=12

### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx
### Input files are assumed to be in $WDIR/$INPx

mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4

### display NUMA characteristics
numactl -H
numactl --cpubind=0 --membind=0 -- numactl -show
numactl --cpubind=1 --membind=1 -- numactl -show
numactl --cpubind=2 --membind=2 -- numactl -show
numactl --cpubind=3 --membind=3 -- numactl -show

### launch 4 program runs
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \
  numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out  ) &
pid1=$!
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \
  numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out  ) &
pid2=$!
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \
  numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out  ) &
pid3=$!
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \
  numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out  ) &

### wait for the termination of all 4 programs runs
wait $pid1 $pid2 $pid3 $pid4

Of course, the input parameters for the Gaussian program have to be adjusted for 4 program runs at a time:

%nprocshared=12
%mem=35000MB
...
#p ... maxdisk=100GB


Timing Experiments

For timing measurements a single small input data set was used. As a consequence all programs runs had about the same execution time - which is of course optimal for the given scenario.

Running a single program exclusively the program took approximately 250 seconds with 12 threads, 220 seconds with 24 threads, 180 seconds with 48 threads

When launching 2 program runs at a time with 24 threads each, both took about 285 seconds and when launching 4 program runs at a time with 12 threads each, all 4 took about 515 seconds.

Thus, for optimal throughput it is most profitable to launch 4 programs at a time in this comparison, as 4 program runs would take 570 seconds when running in pairs with 24 threads and 720 seconds when running 4 times in single mode with 48 threads.


Links and more Information

numactl(8) - Linux man page

Gaussian Technical Information