Slurm job priorities#
Introduction#
SCITAS Slurm clusters, like most Slurm clusters, does not follow a FIFO order for job execution.
Instead, the order in which Slurm will schedule jobs to run depends on multiple factors which Slurm will use to compute a job priority. Jobs will mostly run in the order established by this priority.
The exception to this is when another job with less priority can run without delaying a higher priority job. This is the case when there are enough resources available in the cluster. A lower priority job can also end up running sooner if a larger, higher priority job is waiting for several nodes to become available, but the lower priority job can fit in currently idle nodes (because it is smaller in both required resources and time) and can therefore finish without impacting the start time of the larger, higher priority job.
Checking job state#
Jobs which are queueing are in PENDING
or PD
state. Running jobs are in RUNNING
or R
state.
Slurm job states can be queried using the squeue
utility.
See the following pretend example:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2846884 standard water sciuser PD 0:00 1 (Dependency)
2846885 standard water sciuser PD 0:00 1 (Priority)
2846886 standard water sciuser PD 0:00 1 (Resources)
2846812 standard main.run sciuser R 1-03:15:48 2 h[076,079]
2846637 standard main.run sciuser R 1-15:28:57 2 h[077-078]
2846636 standard main.run sciuser R 1-15:29:15 2 h[073-074]
2846914 standard N=72 sciuser R 23:24 1 h265
2846917 standard sim1_nor sciuser R 6:14 1 h267
2846916 standard sim1 sciuser R 7:42 1 h266
2846912 standard main.run sciuser R 46:56 1 h265
2846811 standard main.run sciuser R 1-03:18:52 1 h075
2846514 standard main.run sciuser R 2-20:35:11 1 h091
2846513 standard main.run sciuser R 2-20:35:22 1 h091
2846512 standard main.run sciuser R 2-20:35:35 1 h091
In the example above, we can see under the ST
column the state of the jobs. We can observe that three jobs are in pending state (PD
). Job ID 2846884 is pending due to a dependency with another job, while job ID 2846885 is queueing due to having a lower priority.
Job 2846886 is waiting for resources to become available and will eventually run. This documentation page aims to give more insights into why some pending jobs may end up running sooner than others.
It is worth noting that queueing many jobs at once is possible, and Slurm will manage the priorities accordingly. Therefore, a job submitted last can end up running before other jobs that were already queuing. What matters is not when or how many jobs were submitted, but the job's computed priority.
Big submissions
We do not recommend submitting jobs in batches of thousands or higher, as this can put a big load on the whole batch system. If you find yourself in the need to queue thousands of jobs at a time, we suggest you please get in contact with us before doing so.
Checking job priority#
Each Slurm job's priority is computed based on several factors such as:
- The job's age (how long it's been queueing).
- The job's QOS.
- The user's Fairshare.
Slurm has a fair-share system in place that is meant to influence the job priority based on how many computing resources the account has been allocated, and how many resources have been consumed. In other words, the share represents the share of the cluster each account is "entitled to", and is a normalized value between 0.0 and 1.0. Premium accounts have the same share, and free accounts have a lower share. This share is then shared amongst the users in the same account. The fairshare value changes based on the utilisation of the cluster. Therefore more a user runs, the more their account's fairshare value will diminish.
In addition, there is a half-life decay factor in play that takes into account past usage only up to a certain point in time. Therefore, a user's usage will decay to half its original value after this half-life decay period. In other words, if for instance an account would make no usage of the cluster for two weeks (using the default half-life decay of 1 week), their fairshare would "recover" to its full amount automatically.
While the fairshare factor, and therefore job priority, works on a per-account basis, the usage will also affect the priorities within the same account. Two users using the same account will have different fairshare and priorities. Another consequence of the fairshare working on a per-account basis is the fact that the usage that one user does on an account has an impact on all users of the same account: as the account's usage goes up, the fairshare factor will correspondingly go down.
Slurm job priorities can be queried using the sprio
utility. Below, the -S '-Y'
sorts by priority in descending order:
$ sprio -S '-Y'
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE QOS TRES
4624550 standard 1630 0 84 534 1000 cpu=13
4621254 standard 835 0 128 607 100 cpu=1
4623757 standard 775 0 94 579 100 cpu=3
4627953 standard 656 0 27 527 100 cpu=3
4626617 standard 463 0 50 314 100 cpu=0
4627119 standard 455 0 42 314 100 cpu=0
4628806 standard 281 0 9 171 100 cpu=1
We can see that Job ID 4624550 has the highest priority (1630 = 1000 (QOS) + 534 (Fairshare) + 84 (Age) + 13 (cpu TRES)).
To look up fairshare usage, the sshare
utility can be used, which shows all fairshares organised in a tree structure (accounts and users within accounts). Use the -a
option to list all users in the cluster.
As you can observe, our clusters are designed so that the QOS counts the most, and within each QOS tier, the fairshare will usually have a bigger impact next.
Example#
User A and user B are submitting GPU jobs to the same cluster and partition, but user B's jobs keep running before user A's jobs, even when user A's jobs have been queuing for longer. Is there something wrong with the cluster, or is user B abusing the system?
The reason this happens can be explained by the fair-share system detailed in this page. User B is using a premium account which has a (normalized) cluster share of 0.005556, and it's the only active user in that account, resulting in a fairshare factor of 0.323529 at the moment of his job submissions. User A is using the master account. The master account is also a premium account, and therefore has also a share over the cluster of 0.005556, but this share is shared by many other students. As a result, user A's fairshare factor is 0.192212 at the moment of her submissions. (The share information is transparent and can be checked by anybody by running "sshare -a".)
The final job priority is then computed based on multiple factors, including the QOS and the fairshare factor. It's computed by Slurm as follows:
Since both are using premium accounts and both are using the same "gpu" QOS, which has a factor of 4000
, the only thing that's different is the fairshare factor.
This results in user A's jobs having a priority of 5922, while user B's have a priority of 7235. Therefore, user B's jobs end up running sooner than user A's jobs.