File System Separation (Admin Guide)
At most HPC sites/clusters there are two separate filesystems available on both gateway servers and all compute nodes:
In addition, every compute node provides a separate
/scratch filesystem for local, temporary data storage.
/work, disk quotas per user are enabled. Available disk space quota and current quota usage are automatically shown when logging in. At this point, no file quota beyond filesystem limitations exist.
/home could have, for example, a quota of 32 GiB for user data, but usually its content is backed up on tape such that in case of a filesystem problem the
/home filesystem and its data can be restored. The tape backup is usually done on a daily basis, deleted files are kept for a maximum of 60 days or similar durations.
/home is usually provided by a redundant infrastructure, e.g., two redundant NFS servers and is hence a network filesystem, but not a parallel filesystem. To avoid excessive load on the
/home filesystem, it can be mounted write-protected on all compute nodes.
/work has different characteristics: it usually has a much larger quota for user data, but the files are not saved externally. It is provided by several redundant file servers and uses some kind of parallel filesystem.
/work can be read from and written to on both gateway servers and all compute nodes.
To mimic the filesystem layout of some other faculties a link
/home/$USER/nobackup -> /work/$USER is added in each home directory.
This section describes an alternative setup that only uses one file system underneath. The notes sizes are given as examples for a small-sized cluster system: The system can store a total amount of 2 petabytes of (HDD) data which consists of 4 JBODs. The cluster’s filesystem is managed by 4 storage servers with 10 terabytes of SSD metadata each. Two metadata servers are grouped into one bodymirror. Together, both bodymirrors, contain the entire namespace. Conversely, all bodies in a bodymirror contain the same namespace, making the system a bit more fail save. Every user has a storage limit, which can be queried via the BeeGFS interface:
beegfs-ctl --getquota --uid `id -u $USER`
There are four different partitions set up for different purposes:
|/work/home/||daily||scripts, executables, etc.|
|/work/TEMP||no||large (temporary) files|
|/work/DATA||daily||many small files (e.g. output data)|
There is no speed difference between the partitions.
ZFS is used and is well known for its data protection, not only because of the copy on write semantics but also due to the snapshot system. This snapshot system is used in the backup process. As shown in the table above, backups are done on a daily basis for the
This backup is organized in the following way
- Mirror the HOME, DATA, CONF, and TEMP on the backup server using rsync
rsync --deleteflag is used so the data here is deleted after one day
- ZFS-Snapshot is made of DATA and HOME backup on the backup server every day
- DATA and HOME data is copied to the TSM (Tivoli Storage Manager) Taperoboter. This way of making a backup is apparently quite fast since the snapshot is currently done in about 20 seconds and uses way less space. The HOME directory backup on TSM is stored for 6 months.
File transfer/access speed is the same for all partitions. If the user has very large amounts of data that do not necessarily need to be backed up, they are asked to store it in
/work/TEMP as that will reduce the time it takes for the backup to complete. Furthermore, the user is asked to keep the number of files minimal (e.g. by combining many small text files into a single big one or combining different files into one tar-ball for storage), as that will not only increase the speed of scripts working on this data (HDDs have very limited IOPS) but also the general performance of the filesystem.