Skip to content

Slurm job priorities#

Introduction#

SCITAS Slurm clusters, like most Slurm clusters, do not follow a FIFO order for job execution.

Instead, the order in which Slurm schedules jobs to run depends on multiple
factors which Slurm uses to compute a job priority. Jobs mostly run in the
order established by this priority.

The exception to this is when another job with a lower priority can run without delaying a higher priority job. This is the case when there are enough resources available in the cluster. A lower priority job can also end up running sooner if a larger, higher priority job is waiting for several nodes to become available, but the lower priority job can fit in currently idle nodes (because it is smaller in both required resources and time) and can therefore finish without impacting the start time of the larger, higher priority job.

Checking the job state#

Queued jobs are in PENDING or PD state. Running jobs are in RUNNING or R state.

You can query the state of a Slurm job using the squeue utility. For example:

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           2846884  standard    water  sciuser  PD       0:00     1 (Dependency)
           2846885  standard    water  sciuser  PD       0:00     1 (Priority)
           2846886  standard    water  sciuser  PD       0:00     1 (Resources)
           2846812  standard main.run  sciuser  R 1-03:15:48      2 h[076,079]
           2846637  standard main.run  sciuser  R 1-15:28:57      2 h[077-078]
           2846636  standard main.run  sciuser  R 1-15:29:15      2 h[073-074]
           2846914  standard     N=72  sciuser  R      23:24      1 h265
           2846917  standard sim1_nor  sciuser  R       6:14      1 h267
           2846916  standard     sim1  sciuser  R       7:42      1 h266
           2846912  standard main.run  sciuser  R      46:56      1 h265
           2846811  standard main.run  sciuser  R 1-03:18:52      1 h075
           2846514  standard main.run  sciuser  R 2-20:35:11      1 h091
           2846513  standard main.run  sciuser  R 2-20:35:22      1 h091
           2846512  standard main.run  sciuser  R 2-20:35:35      1 h091

In the example above, we can see under the ST column the state of the jobs. We can observe that three jobs are in pending state (PD). Job ID 2846884 is pending due to a dependency with another job, while job ID 2846885 is queued as it has a lower priority. Job 2846886 is waiting for resources to become available and will eventually run.

It is worth noting that Slurm will take into account all the queued jobs and users and manage their priorities accordingly. Therefore, a job submitted last can end up running before other jobs that were already queuing. What matters is not when or how many jobs were submitted, but the job's computed priority.

Big submissions

We do not recommend submitting jobs in batches of thousands or higher, as this can put a big load on the whole batch system. If you find yourself needing to queue thousands of jobs at a time, either use job arrays, or get in contact with us before doing so.

Checking job priority#

Each Slurm job's priority is computed based on several factors such as:

  • The job's age (how long it's been queueing).
  • The job's size (in terms of resources reserved).
  • The job's QOS.
  • The user's Fairshare.

Slurm has a fair-share system in place that is meant to influence the job priority based on how many computing resources the account has been allocated, and how many resources have been consumed. In other words, the share represents the part of the cluster each account is "entitled to," normalized between 0.0 and 1.0. Premium accounts all have the same share. This share is then distributed amongst the users in the same account and it changes based on the cluster usage. The more jobs a user runs, the lower their fairshare value will be.

In addition, there is a half-life decay factor in play that takes into account past usage only up to a certain point in time. A user's usage will decay to half its value after this half-life decay period, typically one week. In other words, if for instance an account does not use the cluster for two weeks, their computed usage will fall to a quarter of what it was before those two weeks.

While the fairshare factor, and therefore job priority, works on a per-account basis, the usage will also affect the priorities within the same account. Two users using the same account will have different fairshare and priorities, based on their personal usages. Another consequence of the fairshare working on a per-account basis is that the usage of one member of an account has an impact on all users of the same account: as the account's usage goes up, the fairshare factor will correspondingly go down for everyone in that account.

Slurm job priorities can be queried using the sprio utility. Below, the -S '-Y' sorts by priority in descending order:

$ sprio -S '-Y'
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE        QOS                 TRES
        4624550 standard        1630          0         84        534       1000               cpu=13
        4621254 standard         835          0        128        607        100                cpu=1
        4623757 standard         775          0         94        579        100                cpu=3
        4627953 standard         656          0         27        527        100                cpu=3
        4626617 standard         463          0         50        314        100                cpu=0
        4627119 standard         455          0         42        314        100                cpu=0
        4628806 standard         281          0          9        171        100                cpu=1

We can see that Job ID 4624550 has the highest priority (1630 = 1000 (QOS) + 534 (Fairshare) + 84 (Age) + 13 (cpu TRES)). In this case, we can also see that the QOS and fairshare have the biggest impact on the priority.

To look up fairshare usage, the sshare utility can be used, which shows all fairshares organised in a tree structure (accounts and users within accounts). Use the -a option to list all users in the cluster.

Example#

Alice and Bob are submitting jobs to the same cluster and partition, but Bob's
jobs keep running before Alice's jobs, despite Alice's jobs having been longer
in the queue. Is there something wrong with the cluster, or is Bob abusing the
system?

The reason this happens can be explained by the fair-share system, as described above. Bob is currently running more jobs, but their account has submitted less jobs over the long period considered for the fairshare calculation (including the decayed factor of the previous weeks). This share usage information is available to everybody and can be checked by running sshare -a on each individual cluster.

References#