Skip to content

Running Jobs#

Success

Before reading this document, you must have logged at least once to a cluster and possibly tried out some examples.

Note

In the following examples, <username> means your EPFL GASPAR username.

Batch systems#

The key to using the clusters is to keep in mind that all tasks (or jobs) need to be given to a batch system called Slurm. With this scheduler, your jobs will be launched according to different factors such as priority, availability of the nodes, etc.

Except for rare cases, your jobs won't start as soon as you submit them. It is totally normal to wait a few minutes, hours, or even days(!) depending on the current cluster workload. If your jobs do not start right away, leave them in the queue and Slurm will decide when it will run them. Do not try to cancel them and re-submit them later.

All the SCITAS' clusters are using Slurm, which is widely used and open source.

Running jobs with Slurm#

The normal way of working is to create a short script that describes what you need to do and submit it to the batch system using the sbatch command.

Here is a minimal example for submitting a job running the serial code moovit:

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --time 1:00:00

$HOME/code/moovit < $HOME/params/moo1

Any line beginning with #SBATCH is a directive to the batch system. Type the command: bash $ man sbatch for sbatch command documentation.

The options in the script above do the following:

  • --nodes 1 specifies the number of nodes to use. For jobs with more than one task it's important to set a value for this parameter. Failing to do so may cause your job to be distributed over multiples nodes and have potentially lower performance.
  • --ntasks 1 specifies the maximum number of tasks (in an MPI sense) to be run per job.
  • --cpus-per-task 1 specifies the number of cores per aforementioned task.
  • --time 1:00:00 specifies the maximum walltime required. Your job will be automatically killed if it exceeds this limit. Note that there are different formats to specify the time, e.g. in this example HH:MM:SS.

See the sbatch documentation for more details.

Choosing a reasonable time limit

It is in your best interest to set a reasonable yet short value for --time. Slurm will try to optimize resource usage. If, for instance, a 16-node job is scheduled to run in a few hours and Slurm is reserving nodes for it, a small job can still use those nodes if Slurm knows it will end before the big job should start. Asking for 3 days for a job that will finish in 30 minutes will in general lead to a longer wait times in the queue.

Choosing a reasonable amount of memory

To optimize your workflow and ensure fairness for all users, it is essential to select an appropriate value for the --mem parameter. Requesting excessive memory, such as 40GB, for a job that only requires 4GB may result in a significant increase in queue time. Moreover, by reserving more resources that you will use, you may prevent other users to run their jobs.

This script should be saved to a file, for example moojob.run and we run it using:

$ sbatch moojob.run

The output will look something like:

$ sbatch moojob.run
Submitted batch job 123456 
The number returned is the Job ID and is the key to finding out further information or modifying the task.

Slurm directives can also be given in the command line, superseding what you set on the script itself:

$ sbatch --time=2-00:00:00 moojob.run

would ask for a 2 day time limit, regardless of the 12 hour limit set in the script.

Canceling Jobs#

To cancel a specific job:

$ scancel JOBID
To cancel all your jobs (use with care!):
$ scancel -u $USER
To cancel all your jobs that are not yet running:
$ scancel -u $USER -t PENDING

Getting Job Information#

A number of different tools can be used to query jobs depending on exactly what information is needed.

If the name of a tool begins with a capital S then it is a SCITAS specific tool. Any tool whose name starts with a small s is part of the base Slurm distribution.

Squeue#

Squeue shows information about all your jobs:

$ Squeue
     JOBID         NAME  ACCOUNT       USER NODE  CPUS  MIN_MEMORY     ST       REASON           START_TIME             NODELIST
    123456         run1   scitas        bob    6    96       32000      R         None  2023-02-03T04:18:37         jst04[32-37]
    123457         run2   scitas        bob    6    16       32000     PD   Dependency                  N/A                     

squeue#

By default, squeue will show you all the jobs from all users. This information can be modified by passing options to squeue.

To see all the running jobs from the scitas group we run:

$ squeue -t R -A scitas
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456  parallel  gromacs      bob  R      48:43      6 jst04[32-37]
            123457  parallel     pw.x      sue  R   18:06:44      8 jst01[03,11,21],jst04[50,61-64]

See man squeue for all the options.

For example, the Squeue command described above is actually a script that calls:

$ squeue -u $USER -o "%.10A %.12j %.8a %.10u %.4D %.5C %.11m %.6t %.12r %.20S %.20N" -S S

scontrol#

The scontrol command will show you everything that the system knows about a running or pending job.

$ scontrol -d show job 87439
   JobId=87439 JobName=PDG
   UserId=user(100000) GroupId=epfl-unit(100000) MCS_label=N/A
   Priority=992 Nice=0 Account=scitas QOS=parallel
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:30:10 TimeLimit=08:00:00 TimeMin=N/A
   SubmitTime=2023-01-09T18:18:43 EligibleTime=2023-01-09T18:18:43
   AccrueTime=2023-01-09T18:18:43
   StartTime=2023-01-10T09:14:17 EndTime=2023-01-10T17:14:17 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-01-10T09:14:17 Scheduler=Main
   Partition=standard AllocNode:Sid=jed:424335
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=jst[005,009,208]
   BatchHost=jst005
   NumNodes=3 NumCPUs=216 NumTasks=216 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=216,mem=1512000M,node=3,billing=216
   Socks/Node=* NtasksPerN:B:S:C=72:0:*:* CoreSpec=*
   JOB_GRES=(null)
   Nodes=jst[005,009,208] CPU_IDs=0-71 Mem=504000 GRES=
   MinCPUsNode=72 MinMemoryCPU=7000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/user/PD_tg_Delta.run 115
   WorkDir=/scratch/user
   StdErr=/scratch/user/out.out
   StdIn=/dev/null
   StdOut=/scratch/user/out.out
   Power=

Sjob#

The Sjob command is particularly useful to find out information about jobs that have recently finished.

$ Sjob 2649827

JobID           JobName    Cluster    Account  Partition  Timelimit      User     Group
------------ ---------- ---------- ---------- ---------- ---------- --------- ---------
2649827        VTspU4_1        jed     scitas   standard   00:30:00      user epfl-unit
2649827.bat+      batch        jed     scitas
2649827.ext+     extern        jed     scitas
2649827.0    hydra_pmi+        jed     scitas

             Submit            Eligible               Start                 End
------------------- ------------------- ------------------- -------------------
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59 
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59 
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59 
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59 

Elapsed    ExitCode      State
---------- -------- ----------
  00:21:47      0:0  COMPLETED
  00:21:47      0:0  COMPLETED
  00:21:47      0:0  COMPLETED
  00:21:47      0:0  COMPLETED

NCPUS        NTasks        NodeList    UserCPU  SystemCPU     AveCPU  MaxVMSize
---------- -------- --------------- ---------- ---------- ---------- ----------
       216          jst[002,010-01+ 1-20:28:46 1-05:04:24                       
        72        1          jst002  00:00.060  00:00.044   00:00:00      6600K 
       216        3 jst[002,010-01+  00:00.001   00:00:00   00:00:00          0 
       216      216 jst[002,010-01+ 1-20:28:46 1-05:04:24 1-00:30:53 948421532K 

Examples of submission scripts#

There are a number of examples available on a git repository. To download these, run the following command from one of the clusters:

git clone https://c4science.ch/source/scitas-examples.git

Enter the directory scitas-examples and choose the example to run by navigating the folders.

We have three categories of examples:

  1. Basic (examples to get you started)
  2. Advanced (including hybrid jobs and job arrays)
  3. Modules (specific examples of installed software).

To run an example, here the hybrid HPL, do:

$ sbatch --qos=debug hpl-hybrid.run

Or, if you do not wish to run on the debug QOS:

$ sbatch hpl-hybrid.run

Running MPI jobs#

On the SCITAS clusters we fully support the a intel and gcc combination of compiler/MPI

As for Ferbruary 2023 we support GCC/OpenMPI and Intel/Intel OneAPI MPI For a precise list of compilers and MPIs you can check on the Software Stack page.

If we have an MPI code we need some way of correctly launching it across multiple nodes. To do this we use the srun command, which is a Slurm built-in job launcher:

$ srun mycode.x
Please, note that we don't provide the usual mpirun or mpiexec.

To specify how many tasks and the number of nodes, we add the relevant #SBATCH directives to the job script.

For example to launch our code on 4 nodes with 72 tasks per node we specify:

#!/bin/bash
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 72
#SBATCH --cpus-per-task 1
#SBATCH --time 1:0:0

module purge
module load <mycompiler>
module load <mympi>

srun /home/bob/code/mycode.x

There is no need to specify the number of tasks when you call srun as it inherits the value from the allocation. In this example <mycompiler> and <mympi> should be replaced by your choice of compiler and mpi implementation

Running OpenMP jobs#

When running an OpenMP, it is important to set the number of OpenMP threads per process via the variable OMP_NUM_THREADS. If this is not specified, the default value is system dependent.

We can integrate this with Slurm as seen for the following example (1 tasks, 4 threads per task):

#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 4
#SBATCH --time 1:0:0

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun mycode.x
This takes the environment variable set by Slurm and assigns its value to OMP_NUM_THREADS.

In this example the srun in front of the comment is not strictly required

Running an hybrid MPI/OpenMP code#

You can also mix the two previous cases in case your code supports both, shared and distributed memories

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 2
#SBATCH --cpus-per-task 36
#SBATCH --time 1:0:0

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun mycode.x

This would be an example to run 2 MPI tasks with 36 OpenMP threads each on 1 node.

If you run such hybrid jobs we advise you to read the page on CPU affinity.

The Debug QOS#

All the clusters have a special QOS for running short jobs. This high priority QOS is provided so you can debug jobs or quickly test input files.

To use this QOS you can either add the #SBATCH -q debug directive to your job script or specify it on the command line:

$ sbatch -q debug myjob.run

Debug Nodes Usage

Please note that the debug QOS is not meant for production runs!

Any such use will result in access to the clusters being revoked.

For more information on the specific limitations of the debug QOS please read Debug QOS

Interactive Jobs#

There are two main methods for getting interactive (rather than batch) access to the machines. They have different use cases and advantages.

Sinteract#

The Sinteract command allows you to log onto a compute node and run applications directly on it.

This can be especially useful for graphical applications such as MATLAB and COMSOL:

[user@jed ~]$ Sinteract 
Cores:            1
Tasks:            1
Time:             00:30:00
Memory:           4G
Partition:        standard
Account:          scitas-ge
Jobname:          interact
Resource:
QOS:              serial
Reservation:
Constraints:      

salloc: Pending job allocation 2224524
salloc: job 2224524 queued and waiting for resources

Izar cluster

On the Izar cluster, the -g option is necessary to request the desired number of GPUs. For example:

$ Sinteract -g gpu:1

You can find some more information on Sinteract here or by running Sinteract -h on our clusters

salloc#

The salloc command creates an allocation on the system that you can then access via srun.

It allows you to run MPI jobs in an interactive manner and is very useful for debugging:

[username@frontend ]$ salloc -q debug -N 2 -n 2 --mem 2048 
salloc: Granted job allocation 579440
salloc: Waiting for resource configuration
salloc: Nodes jst[017,018] are ready for job

[username@frontend ]$ hostname
frontend

[username@frontend ]$ srun hostname
jst017
jst018

[username@frontend ]$ exit
salloc: Relinquishing job allocation 579440
salloc: Job allocation 579440 has been revoked.

Interactive shell

To gain interactive access on the node, we suggest using Sinteract.

If you wish to achieve a similar result with salloc, you can type, after having had access to your job allocation:

$ srun --pty bash

or, if you need a graphical display (see the following section for other prerequisites):

$ srun --x11 --pty bash

Use of graphical applications (X11) on the clusters#

To be able to use graphical applications, there are two requirements:

Connection to the cluster

You must connect from your machine to the login node with the -X option. (The use of -Y is unnecessary and highly discouraged as it is a security risk.)

$ ssh -X <username>@<cluster>.epfl.ch

Connection within the cluster (login node to compute nodes)

We've enabled host-based authentication on our clusters. You should be able to log in to the nodes without having to type your password.

However, you can only connect to a specific compute node if you have a job running on it. If you try to connect to a node where you have no running jobs, you will see a message like:

$ ssh jst020
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by 10.91.44.20 port 22

Last update: November 28, 2023