Difference between revisions of "Multiple Program Runs in one Slurm Job"

From HPC Wiki
Jump to navigation Jump to search
m
m
Line 1: Line 1:
Work in progress ...
+
In certain circumstances it may be profitable to start multiple OpenMP programs at a time in one single batch job.
 +
Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time.
  
 +
__TOC__
  
Mehrfaches Starten von Programm mit etwas gleicher Laufzeit in einem Batchjob auf einem Multicore-Knoten am Beispiel von Gaussian.
+
== Problem Position ==
  
__TOC__
+
These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster.
 +
But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number cores and thus resources may be waisted.
 +
 
 +
== Shared or Exclusive Operation ==
 +
 
 +
One way of operating a cluster of multi-core nodes is to allow sharing nodes between multiple jobs.
 +
But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run.
 +
 
 +
Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses a node exclusively.
 +
These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable.
 +
In such a case input data, execution environment and the batch job script has to be adjusted properly.
  
  
== Basic usage ==
+
== Example ==
 
Problem: Ein Programm (hier Gaussian) skaliert nicht gut über alle Cores eines  Multicore-Knoten.
 
Problem: Ein Programm (hier Gaussian) skaliert nicht gut über alle Cores eines  Multicore-Knoten.
 
Bei nicht-exklusiver Nutzung von solchen Rechenknoten laufen Jobs mehrere Nutzer gleichzeitig und beeinflussen sich gegenseitig in ihrer Laufzeit. Eine zuverlässige Abschätzung der Laufzeit zur Angabe des Rechenzeitlimits fällt dadurch schwer.
 
Bei nicht-exklusiver Nutzung von solchen Rechenknoten laufen Jobs mehrere Nutzer gleichzeitig und beeinflussen sich gegenseitig in ihrer Laufzeit. Eine zuverlässige Abschätzung der Laufzeit zur Angabe des Rechenzeitlimits fällt dadurch schwer.
 
Eine Maßnahme dagegen wäre das Starten von mehreren Programmläufen innerhalb eines Batchjobs, das einen Knoten exklusiv nutzt.
 
Eine Maßnahme dagegen wäre das Starten von mehreren Programmläufen innerhalb eines Batchjobs, das einen Knoten exklusiv nutzt.
Bei üblicher NUMA-Architektur von solchen Multicore-Knoten ist es wichtig die einzelnen Programmläufe sorgfältig zu platzieren - z.B. ein Programmlauf pro NUMA-Node.
+
 
 +
 
 +
 
  
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
Line 21: Line 35:
 
#SBATCH --time=00-01:00:00
 
#SBATCH --time=00-01:00:00
 
#SBATCH --mem=180G
 
#SBATCH --mem=180G
 +
 
### exclusive usage of a single node
 
### exclusive usage of a single node
 
#SBATCH --exclusive
 
#SBATCH --exclusive
### use all cores, one thread per core
+
### use all cores of one node, one thread per core
 
#SBATCH --ntasks=1 --nodes=1
 
#SBATCH --ntasks=1 --nodes=1
 
#SBATCH --cpus-per-task=48
 
#SBATCH --cpus-per-task=48
Line 42: Line 57:
 
export OUT1=run1
 
export OUT1=run1
 
export OUT2=run2
 
export OUT2=run2
 +
 +
### this is not necessary in the case of Gaussian program runs
 +
### but it may be important in other cases
 +
export OMP_NUM_THREADS=24
  
 
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx
 
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx
Line 51: Line 70:
 
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2
 
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2
  
 +
### display NUMA characteristics
 +
numaclt -H
 
numactl --cpubind=0,1 --membind=0,1 -- numactl -show
 
numactl --cpubind=0,1 --membind=0,1 -- numactl -show
 
numactl --cpubind=2,3 --membind=2,3 -- numactl -show
 
numactl --cpubind=2,3 --membind=2,3 -- numactl -show
Line 65: Line 86:
 
wait $pid1 $pid2
 
wait $pid1 $pid2
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
 +
In the case of the Gaussion chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by %nprocshared and the amount of main memory for the working array by %mem. If the (fast) file system for scratch files has limitations also the maxdisk parameter has to be set accordingly.
 +
 +
<syntaxhighlight>
 +
%nprocshared=24
 +
%mem=70000MB
 +
...
 +
#p ... maxdisk=100GB
 +
</syntaxhighlight>
 +
 +
 +
== NUMA Aspects ==
 +
 +
Bei üblicher NUMA-Architektur von solchen Multicore-Knoten ist es wichtig die einzelnen Programmläufe sorgfältig zu platzieren - z.B. ein Programmlauf pro NUMA-Node.
 +
 +
<syntaxhighlight lang="bash">
 +
numactl -H
 +
</syntaxhighlight>
 +
 +
<syntaxhighlight lang="bash">
 +
available: 4 nodes (0-3)
 +
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20
 +
node 0 size: 47820 MB
 +
node 0 free: 37007 MB
 +
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23
 +
node 1 size: 49152 MB
 +
node 1 free: 41 MB
 +
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44
 +
node 2 size: 49152 MB
 +
node 2 free: 47613 MB
 +
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47
 +
node 3 size: 49152 MB
 +
node 3 free: 47554 MB
 +
node distances:
 +
node  0  1  2  3
 +
  0:  10  11  21  21
 +
  1:  11  10  21  21
 +
  2:  21  21  10  11
 +
  3:  21  21  11  10
 +
</syntaxhighlight>
 +
 +
<syntaxhighlight>
 +
numactl --cpubind=0,1 --membind=0,1 -- numactl -show
 +
policy: bind
 +
preferred node: 0
 +
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 +
cpubind: 0 1
 +
nodebind: 0 1
 +
membind: 0 1
 +
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show
 +
policy: bind
 +
preferred node: 2
 +
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 +
cpubind: 2 3
 +
nodebind: 2 3
 +
membind: 2 3
 +
</syntaxhighlight>
 +
 +
 +
 +
 +
 +
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
#!/usr/local_rwth/bin/zsh
 
#!/usr/local_rwth/bin/zsh
Line 75: Line 160:
 
### exclusive usage of a single node
 
### exclusive usage of a single node
 
#SBATCH --exclusive
 
#SBATCH --exclusive
### use all cores, one thread per core
+
### use all cores of one, one thread per core
 
#SBATCH --ntasks=1 --nodes=1
 
#SBATCH --ntasks=1 --nodes=1
 
#SBATCH --cpus-per-task=48
 
#SBATCH --cpus-per-task=48
Line 98: Line 183:
 
export OUT3=run3
 
export OUT3=run3
 
export OUT4=run4
 
export OUT4=run4
 +
 +
### this is not necessary in the case of Gaussian program runs
 +
### but it may be important in other cases
 +
export OMP_NUM_THREADS=12
  
 
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx
 
### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx
Line 111: Line 200:
 
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3
 
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3
 
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4
 
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4
 +
 +
### display NUMA characteristics
 +
numaclt -H
 +
numactl --cpubind=0 --membind=0 -- numactl -show
 +
numactl --cpubind=1 --membind=1 -- numactl -show
 +
numactl --cpubind=2 --membind=2 -- numactl -show
 +
numactl --cpubind=3 --membind=3 -- numactl -show
  
 
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \
 
( cd $WDIR/$SLURM_JOB_ID/$OUT1; \

Revision as of 20:02, 20 March 2019

In certain circumstances it may be profitable to start multiple OpenMP programs at a time in one single batch job. Here you can find explanations and an example launching multiple runs of the Gaussian chemistry code at a time.

Problem Position

These days, the number of cores per processor chip keeps increasing. Furthermore in many cases there are two (or sometimes) more such chips in each compute node of an HPC cluster. But the scalability of shared-memory- (OpenMP-) programs does not always keep track. In such a case a program cannot profit form such a high number cores and thus resources may be waisted.

Shared or Exclusive Operation

One way of operating a cluster of multi-core nodes is to allow sharing nodes between multiple jobs. But because of hardware characteristics (sharing hardware resources like caches, paths to memory) these jobs may influence each other heavily. Thus the runtime of each job is hard to predict and may vary considerably from run to run.

Another possibility is to start multiple program runs with similar runtimes within one single batch job which uses a node exclusively. These program runs will still have an impact on each other, but it is more under control of a single user and when applied repeatedly the total runtime be more predictable. In such a case input data, execution environment and the batch job script has to be adjusted properly.


Example

Problem: Ein Programm (hier Gaussian) skaliert nicht gut über alle Cores eines Multicore-Knoten. Bei nicht-exklusiver Nutzung von solchen Rechenknoten laufen Jobs mehrere Nutzer gleichzeitig und beeinflussen sich gegenseitig in ihrer Laufzeit. Eine zuverlässige Abschätzung der Laufzeit zur Angabe des Rechenzeitlimits fällt dadurch schwer. Eine Maßnahme dagegen wäre das Starten von mehreren Programmläufen innerhalb eines Batchjobs, das einen Knoten exklusiv nutzt.



#!/usr/local_rwth/bin/zsh

#SBATCH  --job-name=run2x24  
#SBATCH --output=%j.log
#SBATCH --error=%j.err
#SBATCH --time=00-01:00:00
#SBATCH --mem=180G

### exclusive usage of a single node
#SBATCH --exclusive
### use all cores of one node, one thread per core
#SBATCH --ntasks=1 --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH --threads-per-core=1

### prepare your environment for running gaussian
module load CHEMISTRY gaussian
### make sure this environment variable points to a suitable location
### here the gaussian module allocates the scratch directory
echo $GAUSS_SCRDIR

### adjust working directory and input file names and output directory names
export WDIR=....

export INP1=small1.inp24
export INP2=small2.inp24

export OUT1=run1
export OUT2=run2

### this is not necessary in the case of Gaussian program runs
### but it may be important in other cases
export OMP_NUM_THREADS=24

### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx
### Input files are assumed to be in $WDIR/$INPx
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2

### display NUMA characteristics
numaclt -H
numactl --cpubind=0,1 --membind=0,1 -- numactl -show
numactl --cpubind=2,3 --membind=2,3 -- numactl -show

( cd $WDIR/$SLURM_JOB_ID/$OUT1; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \
  numactl --cpubind=0,1 --membind=0,1 -- timex g09 < ../../$INP1 > g09.out 2> g09.err ) &
pid1=$!
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \
  numactl --cpubind=2,3 --membind=2,3 -- timex g09 < ../../$INP2 > g09.out 2> g09.err ) &
pid2=$!

wait $pid1 $pid2


In the case of the Gaussion chemistry application some parameters in the input file have to be adjusted. The number of threads has to be specificed by %nprocshared and the amount of main memory for the working array by %mem. If the (fast) file system for scratch files has limitations also the maxdisk parameter has to be set accordingly.

%nprocshared=24
%mem=70000MB
...
#p ... maxdisk=100GB


NUMA Aspects

Bei üblicher NUMA-Architektur von solchen Multicore-Knoten ist es wichtig die einzelnen Programmläufe sorgfältig zu platzieren - z.B. ein Programmlauf pro NUMA-Node.

numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 6 7 8 12 13 14 18 19 20
node 0 size: 47820 MB
node 0 free: 37007 MB
node 1 cpus: 3 4 5 9 10 11 15 16 17 21 22 23
node 1 size: 49152 MB
node 1 free: 41 MB
node 2 cpus: 24 25 26 30 31 32 36 37 38 42 43 44
node 2 size: 49152 MB
node 2 free: 47613 MB
node 3 cpus: 27 28 29 33 34 35 39 40 41 45 46 47
node 3 size: 49152 MB
node 3 free: 47554 MB
node distances:
node   0   1   2   3 
  0:  10  11  21  21 
  1:  11  10  21  21 
  2:  21  21  10  11 
  3:  21  21  11  10
numactl --cpubind=0,1 --membind=0,1 -- numactl -show
policy: bind
preferred node: 0
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
cpubind: 0 1 
nodebind: 0 1 
membind: 0 1 
da026566@cluster-hpc:~/hpc/benchmarks/Gaussian/Raabe$ ssh ncm0800 numactl --cpubind=2,3 --membind=2,3 -- numactl -show
policy: bind
preferred node: 2
physcpubind: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 
cpubind: 2 3 
nodebind: 2 3 
membind: 2 3




#!/usr/local_rwth/bin/zsh
#SBATCH  --job-name=run4x12  
#SBATCH --output=%j.log
#SBATCH --error=%j.err
#SBATCH --time=00-01:00:00
#SBATCH --mem=180G

### exclusive usage of a single node
#SBATCH --exclusive
### use all cores of one, one thread per core
#SBATCH --ntasks=1 --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH --threads-per-core=1

### prepare your environment for running gaussian
module load CHEMISTRY gaussian
### make sure this environment variable points to a suitable location
### here the gaussian module allocates the scratch directory
echo $GAUSS_SCRDIR

### adjust working directory and input file names and output directory names
export WDIR=....

export INP1=small1.inp12
export INP2=small2.inp12
export INP3=small3.inp12
export INP4=small4.inp12

export OUT1=run1
export OUT2=run2
export OUT3=run3
export OUT4=run4

### this is not necessary in the case of Gaussian program runs
### but it may be important in other cases
export OMP_NUM_THREADS=12

### the program will run in $WDIR/$SLURM_JOB_ID/$OUTx
### Scratch files will be put in $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUTx
### Input files are assumed to be in $WDIR/$INPx

mkdir -p $WDIR/$SLURM_JOB_ID/$OUT1
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT2
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT3
mkdir -p $WDIR/$SLURM_JOB_ID/$OUT4
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3
mkdir -p $GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4

### display NUMA characteristics
numaclt -H
numactl --cpubind=0 --membind=0 -- numactl -show
numactl --cpubind=1 --membind=1 -- numactl -show
numactl --cpubind=2 --membind=2 -- numactl -show
numactl --cpubind=3 --membind=3 -- numactl -show

( cd $WDIR/$SLURM_JOB_ID/$OUT1; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT1; \
  numactl --cpubind=0 --membind=0 -- timex g09 < ../../$INP1 > g09.out  ) &
pid1=$!
( cd $WDIR/$SLURM_JOB_ID/$OUT2; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT2; \
  numactl --cpubind=1 --membind=1 -- timex g09 < ../../$INP2 > g09.out  ) &
pid2=$!
( cd $WDIR/$SLURM_JOB_ID/$OUT3; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT3; \
  numactl --cpubind=2 --membind=2 -- timex g09 < ../../$INP3 > g09.out  ) &
pid3=$!
( cd $WDIR/$SLURM_JOB_ID/$OUT4; \
  export GAUSS_SCRDIR=$GAUSS_SCRDIR/$SLURM_JOB_ID/$OUT4; \
  numactl --cpubind=3 --membind=3 -- timex g09 < ../../$INP4 > g09.out  ) &

wait $pid1 $pid2 $pid3 $pid4


Links and more Information

t.b.a.