This article describes how local SSDs can be utilized for HPC jobs.

Software

BeegFS Wiki

Purpose

Your parallel filesystem is too busy and you have some empty SSD storage on your compute nodes? Here comes an idea for this scenario. Parallel filesystems can be rather slow despite of their capacity when heavily utilized. To put some I/O load from the central parallel filesystem, BeeGFS on Demand (BEEOND) can be used to create a new parallel filesystem from local SSDs for the runtime of a job. As a prequisite, you need an empty partition on each of the nodes. From a user’s perspective, a new parallel filesystem is created each time a job starts on the allocated nodes and destroyed after the job finishes.

Installation

Add the official repository to your nodes
Install the “beeond” package
Mount a xfs (or similar) formatted partition in for example /mnt/sda1 on every node
Create an empty folder /mnt/beeond on every node

Additional scripts

To automatically create and destroy the device, the prologue and epliogue-scripts of the batch system have to be modified. In SLURM, the prologue command could look like:

if ( [[ $(( num_cpu % 32 )) == 0 ]] && [[ $SLURM_JOB_PARTITION = "gpuk20" ]] ) || ( [[ $(( num_cpu % 24 )) == 0 ]] && ( [[ $SLURM_JOB_PARTITION = "gputitanxp" ]] || [[ $SLURM_JOB_PARTITION = "gpuv100" ]] ) ) || ( [[ $(( num_cpu % 72 )) == 0 ]] && ( [[ $SLURM_JOB_PARTITION = "express" ]] || [[ $SLURM_JOB_PARTITION = "normal" ]] || [[ $SLURM_JOB_PARTITION = "requeue" ]] ) ) || ( [[ $(( num_cpu % 64 )) == 0 ]] && [[ $SLURM_JOB_PARTITION = "broadwell" ]] ) ; then
    if [ $myhostname == $head_node ] ; then
      logdir="/var/log"
      logfile=$logdir/slurm_beeond.log
      nodefile=/tmp/slurm_nodelist.$SLURM_JOB_ID
      echo $job_hosts | tr " " "\n" >> $nodefile 2>&1
      timeout 60 /usr/bin/beeond start -n $nodefile -d /mnt/sda1 -c /mnt/beeond -P -F -L /tmp >> $logfile 2>&1
    fi
fi

(Only create the device if a user requests all CPU cores.)

The epilogue script:

# $SLURM_JOB_PARTITION is not known in the epilogue, so check for the existence of a nodefile
nodefile=/tmp/slurm_nodelist.$SLURM_JOB_ID
if [ -e $nodefile ] ; then

  logdir="/var/log"
  logfile=$logdir/slurm_beeond.log

  echo "$DATE Stopping beeond"  >> $logfile 2>&1
  /usr/bin/beeond stop -n $nodefile -L -d -P -c >> $logfile 2>&1
    
  rm $nodefile
fi

Additional tools (tell this to your users)

Synchronize a folder

You can use beeond-cp stagein to stage your dataset onto BeeOND.

beeond-cp stagein -n /tmp/slurm_nodelist.$SLURM_JOB_ID -g /scratch/tmp/<username>/dataset -l /mnt/beeond

where /scratch/tmp/<username>/dataset is the path to your dataset. Only changes in your dataset will be copied. Everything will be completely synchronized, files deleted on BeeOND will also get deleted in your global dataset.

Use beeond-cp stageout to stage your dataset out of BeeOND.

beeond-cp stageout -n /tmp/slurm_nodelist.$SLURM_JOB_ID -g /scratch/tmp/<username>/dataset -l /mnt/beeond

Parallel copy

You can use beeond-cp copy to parallel copy directories recursively onto BeeOND. beeond-cp copy -n /tmp/slurm_nodelist.$SLURM_JOB_ID dir_1 dir_2 /mnt/beeond

where dir_1 and dir_2 are the directories you want to copy.

Open questions

How can one ensure that the device es reliably stopped after finishing the job? We see hanging nodes with a non-terminated Beeond.
With "" you can configure the striping. Is there a recommondation for a reasonable default stripe? How can a user configure that? Other sites like the KIT generate different subdirectories with different stripes.

BeeOND (Admin Guide)

Contents