Slurm job priorities#

Introduction#

SCITAS Slurm clusters, like most Slurm clusters, does not follow a FIFO order for job execution.

Instead, the order in which Slurm will schedule jobs to run depends on multiple factors which Slurm will use to compute a job priority. Jobs will mostly run in the order established by this priority.

The exception to this is when another job with less priority can run without delaying a higher priority job. This is the case when there are enough resources available in the cluster. A lower priority job can also end up running sooner if a larger, higher priority job is waiting for several nodes to become available, but the lower priority job can fit in currently idle nodes (because it is smaller in both required resources and time) and can therefore finish without impacting the start time of the larger, higher priority job.

Checking job state#

Jobs which are queueing are in PENDING or PD state. Running jobs are in RUNNING or R state.

Slurm job states can be queried using the squeue utility. See the following pretend example:

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2846884  standard    water  sciuser  PD       0:00     1 (Dependency)
           2846885  standard    water  sciuser  PD       0:00     1 (Priority)
           2846886  standard    water  sciuser  PD       0:00     1 (Resources)
           2846812  standard main.run  sciuser  R 1-03:15:48      2 h[076,079]
           2846637  standard main.run  sciuser  R 1-15:28:57      2 h[077-078]
           2846636  standard main.run  sciuser  R 1-15:29:15      2 h[073-074]
           2846914  standard     N=72  sciuser  R      23:24      1 h265
           2846917  standard sim1_nor  sciuser  R       6:14      1 h267
           2846916  standard     sim1  sciuser  R       7:42      1 h266
           2846912  standard main.run  sciuser  R      46:56      1 h265
           2846811  standard main.run  sciuser  R 1-03:18:52      1 h075
           2846514  standard main.run  sciuser  R 2-20:35:11      1 h091
           2846513  standard main.run  sciuser  R 2-20:35:22      1 h091
           2846512  standard main.run  sciuser  R 2-20:35:35      1 h091

In the example above, we can see under the ST column the state of the jobs. We can observe that three jobs are in pending state (PD). Job ID 2846884 is pending due to a dependency with another job, while job ID 2846885 is queueing due to having a lower priority. Job 2846886 is waiting for resources to become available and will eventually run. This documentation page aims to give more insights into why some pending jobs may end up running sooner than others.

It is worth noting that queueing many jobs at once is possible, and Slurm will manage the priorities accordingly. Therefore, a job submitted last can end up running before other jobs that were already queuing. What matters is not when or how many jobs were submitted, but the job's computed priority.

Big submissions

We do not recommend submitting jobs in batches of thousands or higher, as this can put a big load on the whole batch system. If you find yourself in the need to queue thousands of jobs at a time, we suggest you please get in contact with us before doing so.

Checking job priority#

Each Slurm job's priority is computed based on several factors such as:

The job's age (how long it's been queueing).
The job's QOS.
The user's Fairshare.

Slurm has a fair-share system in place that is meant to influence the job priority based on how many computing resources the account has been allocated, and how many resources have been consumed. In other words, the share represents the share of the cluster each account is "entitled to", and is a normalized value between 0.0 and 1.0. Premium accounts have the same share, and free accounts have a lower share. This share is then shared amongst the users in the same account. The fairshare value changes based on the utilisation of the cluster. Therefore more a user runs, the more their account's fairshare value will diminish.

In addition, there is a half-life decay factor in play that takes into account past usage only up to a certain point in time. Therefore, a user's usage will decay to half its original value after this half-life decay period. In other words, if for instance an account would make no usage of the cluster for two weeks (using the default half-life decay of 1 week), their fairshare would "recover" to its full amount automatically.

While the fairshare factor, and therefore job priority, works on a per-account basis, the usage will also affect the priorities within the same account. Two users using the same account will have different fairshare and priorities. Another consequence of the fairshare working on a per-account basis is the fact that the usage that one user does on an account has an impact on all users of the same account: as the account's usage goes up, the fairshare factor will correspondingly go down.

Slurm job priorities can be queried using the sprio utility. Below, the -S '-Y' sorts by priority in descending order:

$ sprio -S '-Y'
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE        QOS                 TRES
        4624550 standard        1630          0         84        534       1000               cpu=13
        4621254 standard         835          0        128        607        100                cpu=1
        4623757 standard         775          0         94        579        100                cpu=3
        4627953 standard         656          0         27        527        100                cpu=3
        4626617 standard         463          0         50        314        100                cpu=0
        4627119 standard         455          0         42        314        100                cpu=0
        4628806 standard         281          0          9        171        100                cpu=1

We can see that Job ID 4624550 has the highest priority (1630 = 1000 (QOS) + 534 (Fairshare) + 84 (Age) + 13 (cpu TRES)).

To look up fairshare usage, the sshare utility can be used, which shows all fairshares organised in a tree structure (accounts and users within accounts). Use the -a option to list all users in the cluster.

As you can observe, our clusters are designed so that the QOS counts the most, and within each QOS tier, the fairshare will usually have a bigger impact next.

Example#

User A and user B are submitting GPU jobs to the same cluster and partition, but user B's jobs keep running before user A's jobs, even when user A's jobs have been queuing for longer. Is there something wrong with the cluster, or is user B abusing the system?

The reason this happens can be explained by the fair-share system detailed in this page. User B is using a premium account which has a (normalized) cluster share of 0.005556, and it's the only active user in that account, resulting in a fairshare factor of 0.323529 at the moment of his job submissions. User A is using the master account. The master account is also a premium account, and therefore has also a share over the cluster of 0.005556, but this share is shared by many other students. As a result, user A's fairshare factor is 0.192212 at the moment of her submissions. (The share information is transparent and can be checked by anybody by running "sshare -a".)

The final job priority is then computed based on multiple factors, including the QOS and the fairshare factor. It's computed by Slurm as follows:

jobpriority = 10000*fairshare + 100000*qos

Since both are using premium accounts and both are using the same "gpu" QOS, which has a factor of 4000, the only thing that's different is the fairshare factor.

This results in user A's jobs having a priority of 5922, while user B's have a priority of 7235. Therefore, user B's jobs end up running sooner than user A's jobs.

References#

Last update: August 30, 2023