Fair-Share Scheduling in Heterogeneous Clusters (Admin Guide)
Introduction
This article explains the slurm configuration of a heterogeneous GPU Cluster using a slurm fairshare tree. The article is divided into a user-oriented and an administrator oriented part. The user-oriented part describes how the fair share tree works and how the priority of a job is determined while the administrator oriented part describes the slurm parameters in a bit more detail.
Accounting, Fair Usage and Job Priorities
Slurm saves accounting data for every job or job step that the user submits. This data is then used to calculate job priorites and give each user a fair share of the cluster’s resources. Usually, the Slurm Database Deamon is used to collect this data and store it in a MySQL database. The schedueling system uses the Fair Tree Fairshare Algorithm to determine the priority of a job.
Job Priorities
The priority of a users job in the queue is determined by the Multifactor Priority Plugin. A job’s priority is an integer that ranges between 0 and 4,294,967,295. The larger the number, the higher the job will be positioned in the queue, and the sooner the job will be scheduled. The user can display the priorities of jobs and the individual contributions from the various factors via
sprio -u $USER
In order to improve the output of sprio
, following can be put in the ~/.bash_profile
:
export SPRIO_FORMAT="%.10i %10u %17r %.8Y %.6A %.9F %.9P %.T"
These are the factors that contribute to the users job’s priority:
Factor | Weight | Description |
---|---|---|
Age | 10 000 | The length of time a job has been waiting in the queue, eligible to be scheduled. This maxes out to 10000 after 5 days . |
Fair-share | 100 000 | The difference between the portion of the computing resources that is proportionally promised to each user and the amount of resources that the user has actually consumed. See below for details. |
#GPUs | 20 000 | The relative number of GPUs, e.g. 1/224*20000=88 per GPU. |
Partition | 10 000 | Different partitions give different factors (e.g. compute > develop). |
The weights are unsigned, 32 bit integers and the factors are floating point numbers from 0.0 to 1.0. The priority is the weighted sum of these factors. So far the weights are experimental, they are optimized in order to improve the user experience of job priority.
The fair share factor depends on the users resource consumption from the last ~60 days. The more resources the users is consuming, the lower the fair share factor will be which will result in lower priorities. Specifically, it is quantified by (1.0number_allocated_gpusseconds) for the compute partitions. This number is subtracted from the proportionally promised amount of resources and then normalized to it. View the fair share factors and corresponding promised and actual usage for all users via
sshare -a --format=Account,User,NormShares,NormUsage,FairShare
The column that contains the actual factor is called “FairShare”.
For details have a look at the official documentation, that is sshare and Fair Tree Fairshare Algorithm.
Slurm Parameter Definitions
In this part some of the set slurm parameters are explained which are used to set up the Fair Tree Fairshare Algorithm. For a more detailed explanation please consult the official documentation of slurm
- PriorityDecayHalfLife=[number of days]-[number of hours] The time, of which the resource consumption is taken into account for the Fairshare Algorithm, can be set by this.
- PriorityMaxAge=[number of days]-[number of hours] The maximal queueing time which counts for the priority calculation. Note that queueing times above are possible but do not contribute to the priority factor.