Admin Guide BeeOND
This article describes how local SSDs can be utilized for HPC jobs.
Software
Purpose
Your parallel filesystem is too busy and you have some empty SSD storage on your compute nodes? Here comes an idea for this scenario. Parallel filesystems can be rather slow despite of their capacity when heavily utilized. To put some I/O load from the central parallel filesystem, BeeGFS on Demand (BEEOND) can be used to create a new parallel filesystem from local SSDs for the runtime of a job. As a prequisite, you need an empty partition on each of the nodes. From a user’s perspective, a new parallel filesystem is created each time a job starts on the allocated nodes and destroyed after the job finishes.
Installation
- Add the official repository to your nodes
- Install the “beeond” package
- Mount a xfs (or similar) formatted partition in for example /mnt/sda1 on every node
- Create an empty folder /mnt/beeond on every node
Additional scripts
To automatically create and destroy the device, the prologue and epliogue-scripts of the batch system have to be modified. In SLURM, the prologue command could look like:
if ( [[ $(( num_cpu % 32 )) == 0 ]] && [[ $SLURM_JOB_PARTITION = "gpuk20" ]] ) || ( [[ $(( num_cpu % 24 )) == 0 ]] && ( [[ $SLURM_JOB_PARTITION = "gputitanxp" ]] || [[ $SLURM_JOB_PARTITION = "gpuv100" ]] ) ) || ( [[ $(( num_cpu % 72 )) == 0 ]] && ( [[ $SLURM_JOB_PARTITION = "express" ]] || [[ $SLURM_JOB_PARTITION = "normal" ]] || [[ $SLURM_JOB_PARTITION = "requeue" ]] ) ) || ( [[ $(( num_cpu % 64 )) == 0 ]] && [[ $SLURM_JOB_PARTITION = "broadwell" ]] ) ; then if [ $myhostname == $head_node ] ; then logdir="/var/log" logfile=$logdir/slurm_beeond.log nodefile=/tmp/slurm_nodelist.$SLURM_JOB_ID echo $job_hosts | tr " " "\n" >> $nodefile 2>&1 timeout 60 /usr/bin/beeond start -n $nodefile -d /mnt/sda1 -c /mnt/beeond -P -F -L /tmp >> $logfile 2>&1 fi fi
(Only create the device if a user requests all CPU cores.)
The epilogue script:
# $SLURM_JOB_PARTITION is not known in the epilogue, so check for the existence of a nodefile nodefile=/tmp/slurm_nodelist.$SLURM_JOB_ID if [ -e $nodefile ] ; then logdir="/var/log" logfile=$logdir/slurm_beeond.log echo "$DATE Stopping beeond" >> $logfile 2>&1 /usr/bin/beeond stop -n $nodefile -L -d -P -c >> $logfile 2>&1 rm $nodefile fi
Additional tools (tell this to your users)
Synchronize a folder
You can use beeond-cp stagein to stage your dataset onto BeeOND.
beeond-cp stagein -n /tmp/slurm_nodelist.$SLURM_JOB_ID -g /scratch/tmp/<username>/dataset -l /mnt/beeond
where /scratch/tmp/<username>/dataset is the path to your dataset. Only changes in your dataset will be copied. Everything will be completely synchronized, files deleted on BeeOND will also get deleted in your global dataset.
Use beeond-cp stageout to stage your dataset out of BeeOND.
beeond-cp stageout -n /tmp/slurm_nodelist.$SLURM_JOB_ID -g /scratch/tmp/<username>/dataset -l /mnt/beeond
Parallel copy
You can use beeond-cp copy to parallel copy directories recursively onto BeeOND. beeond-cp copy -n /tmp/slurm_nodelist.$SLURM_JOB_ID dir_1 dir_2 /mnt/beeond
where dir_1 and dir_2 are the directories you want to copy.
Open questions
- How can one ensure that the device es reliably stopped after finishing the job? We see hanging nodes with a non-terminated Beeond.
- With "" you can configure the striping. Is there a recommondation for a reasonable default stripe? How can a user configure that? Other sites like the KIT generate different subdirectories with different stripes.