Fair-Share Scheduling in Heterogeneous Clusters (Admin Guide)

From HPC Wiki
Admin Guide Fair-Share Scheduling in Heterogeneous Clusters
Jump to navigation Jump to search


This article explains the slurm configuration of a heterogeneous GPU Cluster using a slurm fairshare tree. The article is divided into a user-oriented and an administrator oriented part. The user-oriented part describes how the fair share tree works and how the priority of a job is determined while the administrator oriented part describes the slurm parameters in a bit more detail.

Accounting, Fair Usage and Job Priorities

Slurm saves accounting data for every job or job step that the user submits. This data is then used to calculate job priorites and give each user a fair share of the cluster’s resources. Usually, the Slurm Database Deamon is used to collect this data and store it in a MySQL database. The schedueling system uses the Fair Tree Fairshare Algorithm to determine the priority of a job.

Job Priorities

The priority of a users job in the queue is determined by the Multifactor Priority Plugin. A job’s priority is an integer that ranges between 0 and 4,294,967,295. The larger the number, the higher the job will be positioned in the queue, and the sooner the job will be scheduled. The user can display the priorities of jobs and the individual contributions from the various factors via

 sprio -u $USER 

In order to improve the output of sprio, following can be put in the ~/.bash_profile:

export SPRIO_FORMAT="%.10i %10u %17r %.8Y %.6A %.9F %.9P %.T" 

These are the factors that contribute to the users job’s priority:

Factor Weight Description
Age 10 000 The length of time a job has been waiting in the queue, eligible to be scheduled. This maxes out to 10000 after 5 days .
Fair-share 100 000 The difference between the portion of the computing resources that is proportionally promised to each user and the amount of resources that the user has actually consumed. See below for details.
#GPUs 20 000 The relative number of GPUs, e.g. 1/224*20000=88 per GPU.
Partition 10 000 Different partitions give different factors (e.g. compute > develop).

The weights are unsigned, 32 bit integers and the factors are floating point numbers from 0.0 to 1.0. The priority is the weighted sum of these factors. So far the weights are experimental, they are optimized in order to improve the user experience of job priority.

Fair Share Factor

The fair share factor depends on the users resource consumption from the last ~60 days. The more resources the users is consuming, the lower the fair share factor will be which will result in lower priorities. Specifically, it is quantified by (1.0number_allocated_gpusseconds) for the compute partitions. This number is subtracted from the proportionally promised amount of resources and then normalized to it. View the fair share factors and corresponding promised and actual usage for all users via

sshare -a --format=Account,User,NormShares,NormUsage,FairShare

The column that contains the actual factor is called “FairShare”.

For details have a look at the official documentation, that is sshare and Fair Tree Fairshare Algorithm.

Slurm Parameter Definitions

In this part some of the set slurm parameters are explained which are used to set up the Fair Tree Fairshare Algorithm. For a more detailed explanation please consult the official documentation of slurm

  • PriorityDecayHalfLife=[number of days]-[number of hours] The time, of which the resource consumption is taken into account for the Fairshare Algorithm, can be set by this.
  • PriorityMaxAge=[number of days]-[number of hours] The maximal queueing time which counts for the priority calculation. Note that queueing times above are possible but do not contribute to the priority factor.