Slurm job priorities#
Introduction#
SCITAS Slurm clusters, like most Slurm clusters, do not follow a FIFO order for job execution.
Instead, the order in which Slurm schedules jobs to run depends on multiple
factors which Slurm uses to compute a job priority. Jobs mostly run in the
order established by this priority.
The exception to this is when another job with a lower priority can run without delaying a higher priority job. This is the case when there are enough resources available in the cluster. A lower priority job can also end up running sooner if a larger, higher priority job is waiting for several nodes to become available, but the lower priority job can fit in currently idle nodes (because it is smaller in both required resources and time) and can therefore finish without impacting the start time of the larger, higher priority job.
Checking the job state#
Queued jobs are in PENDING
or PD
state. Running jobs are in
RUNNING
or R
state.
You can query the state of a Slurm job using the squeue
utility. For example:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2846884 standard water sciuser PD 0:00 1 (Dependency)
2846885 standard water sciuser PD 0:00 1 (Priority)
2846886 standard water sciuser PD 0:00 1 (Resources)
2846812 standard main.run sciuser R 1-03:15:48 2 h[076,079]
2846637 standard main.run sciuser R 1-15:28:57 2 h[077-078]
2846636 standard main.run sciuser R 1-15:29:15 2 h[073-074]
2846914 standard N=72 sciuser R 23:24 1 h265
2846917 standard sim1_nor sciuser R 6:14 1 h267
2846916 standard sim1 sciuser R 7:42 1 h266
2846912 standard main.run sciuser R 46:56 1 h265
2846811 standard main.run sciuser R 1-03:18:52 1 h075
2846514 standard main.run sciuser R 2-20:35:11 1 h091
2846513 standard main.run sciuser R 2-20:35:22 1 h091
2846512 standard main.run sciuser R 2-20:35:35 1 h091
In the example above, we can see under the ST
column the state of the jobs.
We can observe that three jobs are in pending state (PD
). Job ID 2846884 is
pending due to a dependency with another job, while job ID 2846885 is queued
as it has a lower priority. Job 2846886 is waiting for resources to become
available and will eventually run.
It is worth noting that Slurm will take into account all the queued jobs and users and manage their priorities accordingly. Therefore, a job submitted last can end up running before other jobs that were already queuing. What matters is not when or how many jobs were submitted, but the job's computed priority.
Big submissions
We do not recommend submitting jobs in batches of thousands or higher, as this can put a big load on the whole batch system. If you find yourself needing to queue thousands of jobs at a time, either use job arrays, or get in contact with us before doing so.
Checking job priority#
Each Slurm job's priority is computed based on several factors such as:
- The job's age (how long it's been queueing).
- The job's size (in terms of resources reserved).
- The job's QOS.
- The user's Fairshare.
Slurm has a fair-share system in place that is meant to influence the job priority based on how many computing resources the account has been allocated, and how many resources have been consumed. In other words, the share represents the part of the cluster each account is "entitled to," normalized between 0.0 and 1.0. Premium accounts all have the same share. This share is then distributed amongst the users in the same account and it changes based on the cluster usage. The more jobs a user runs, the lower their fairshare value will be.
In addition, there is a half-life decay factor in play that takes into account past usage only up to a certain point in time. A user's usage will decay to half its value after this half-life decay period, typically one week. In other words, if for instance an account does not use the cluster for two weeks, their computed usage will fall to a quarter of what it was before those two weeks.
While the fairshare factor, and therefore job priority, works on a per-account basis, the usage will also affect the priorities within the same account. Two users using the same account will have different fairshare and priorities, based on their personal usages. Another consequence of the fairshare working on a per-account basis is that the usage of one member of an account has an impact on all users of the same account: as the account's usage goes up, the fairshare factor will correspondingly go down for everyone in that account.
Slurm job priorities can be queried using the sprio
utility. Below, the
-S '-Y'
sorts by priority in descending order:
$ sprio -S '-Y'
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE QOS TRES
4624550 standard 1630 0 84 534 1000 cpu=13
4621254 standard 835 0 128 607 100 cpu=1
4623757 standard 775 0 94 579 100 cpu=3
4627953 standard 656 0 27 527 100 cpu=3
4626617 standard 463 0 50 314 100 cpu=0
4627119 standard 455 0 42 314 100 cpu=0
4628806 standard 281 0 9 171 100 cpu=1
We can see that Job ID 4624550 has the highest priority (1630 = 1000 (QOS) + 534 (Fairshare) + 84 (Age) + 13 (cpu TRES)). In this case, we can also see that the QOS and fairshare have the biggest impact on the priority.
To look up fairshare usage, the sshare
utility can be used, which shows all
fairshares organised in a tree structure (accounts and users within accounts).
Use the -a
option to list all users in the cluster.
Example#
Alice and Bob are submitting jobs to the same cluster and partition, but Bob's
jobs keep running before Alice's jobs, despite Alice's jobs having been longer
in the queue. Is there something wrong with the cluster, or is Bob abusing the
system?
The reason this happens can be explained by the fair-share system, as described
above. Bob is currently running more jobs, but their account has submitted
less jobs over the long period considered for the fairshare calculation
(including the decayed factor of the previous weeks). This share usage
information is available to everybody and can be checked by running sshare -a
on each individual cluster.