Running Jobs#
Success
Before reading this document, you must have logged at least once to a cluster and possibly tried out some examples.
Note
In the following examples, <username>
means your EPFL
GASPAR username.
Batch systems#
The key to using the clusters is to keep in mind that all tasks (or jobs) need to be given to a batch system called Slurm. With this scheduler, your jobs will be launched according to different factors such as priority, availability of the nodes, etc.
Except for rare cases, your jobs won't start as soon as you submit them. It is totally normal to wait a few minutes, hours, or even days(!) depending on the current cluster workload. If your jobs do not start right away, leave them in the queue and Slurm will decide when it will run them. Do not try to cancel them and re-submit them later.
All the SCITAS' clusters are using Slurm, which is widely used and open source.
Running jobs with Slurm#
The normal way of working is to create a short script that describes what you
need to do and submit it to the batch system using the sbatch
command.
Here is a minimal example for submitting a job running the serial code moovit
:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --time 1:00:00
$HOME/code/moovit < $HOME/params/moo1
Any line beginning with #SBATCH
is a directive to the batch system. Type the
command:
bash
$ man sbatch
for
sbatch
command documentation.
The options in the script above do the following:
--nodes 1
specifies the number of nodes to use. For jobs with more than one task it's important to set a value for this parameter. Failing to do so may cause your job to be distributed over multiples nodes and have potentially lower performance.--ntasks 1
specifies the maximum number of tasks (in an MPI sense) to be run per job.--cpus-per-task 1
specifies the number of cores per aforementioned task.--time 1:00:00
specifies the maximum walltime required. Your job will be automatically killed if it exceeds this limit. Note that there are different formats to specify the time, e.g. in this exampleHH:MM:SS
.
See the sbatch
documentation for more details.
Choosing a reasonable time limit
It is in your best interest to set a reasonable yet short value for --time
.
Slurm will try to optimize resource usage. If, for instance, a 16-node job is
scheduled to run in a few hours and Slurm is reserving nodes for it, a small job
can still use those nodes if Slurm knows it will end before the big job should
start. Asking for 3 days for a job that will finish in 30 minutes will in general
lead to a longer wait times in the queue.
Choosing a reasonable amount of memory
To optimize your workflow and ensure fairness for all users, it is essential to select an appropriate value for the --mem parameter. Requesting excessive memory, such as 40GB, for a job that only requires 4GB may result in a significant increase in queue time. Moreover, by reserving more resources that you will use, you may prevent other users to run their jobs.
This script should be saved to a file, for example moojob.run and we run it using:
The output will look something like:
The number returned is the Job ID and is the key to finding out further information or modifying the task.Slurm directives can also be given in the command line, superseding what you set on the script itself:
would ask for a 2 day time limit, regardless of the 12 hour limit set in the script.
Canceling Jobs#
To cancel a specific job:
To cancel all your jobs (use with care!): To cancel all your jobs that are not yet running:Getting Job Information#
A number of different tools can be used to query jobs depending on exactly what information is needed.
If the name of a tool begins with a capital S then it is a SCITAS specific tool. Any tool whose name starts with a small s is part of the base Slurm distribution.
Squeue#
Squeue shows information about all your jobs:
$ Squeue
JOBID NAME ACCOUNT USER NODE CPUS MIN_MEMORY ST REASON START_TIME NODELIST
123456 run1 scitas bob 6 96 32000 R None 2023-02-03T04:18:37 jst04[32-37]
123457 run2 scitas bob 6 16 32000 PD Dependency N/A
squeue#
By default, squeue
will show you all the jobs from all users. This information
can be modified by passing options to squeue
.
To see all the running jobs from the scitas
group we run:
$ squeue -t R -A scitas
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 parallel gromacs bob R 48:43 6 jst04[32-37]
123457 parallel pw.x sue R 18:06:44 8 jst01[03,11,21],jst04[50,61-64]
See man squeue
for all the options.
For example, the Squeue
command described above is actually a script that calls:
scontrol#
The scontrol
command will show you everything that the system knows about a
running or pending job.
$ scontrol -d show job 87439
JobId=87439 JobName=PDG
UserId=user(100000) GroupId=epfl-unit(100000) MCS_label=N/A
Priority=992 Nice=0 Account=scitas QOS=parallel
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:30:10 TimeLimit=08:00:00 TimeMin=N/A
SubmitTime=2023-01-09T18:18:43 EligibleTime=2023-01-09T18:18:43
AccrueTime=2023-01-09T18:18:43
StartTime=2023-01-10T09:14:17 EndTime=2023-01-10T17:14:17 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-01-10T09:14:17 Scheduler=Main
Partition=standard AllocNode:Sid=jed:424335
ReqNodeList=(null) ExcNodeList=(null)
NodeList=jst[005,009,208]
BatchHost=jst005
NumNodes=3 NumCPUs=216 NumTasks=216 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=216,mem=1512000M,node=3,billing=216
Socks/Node=* NtasksPerN:B:S:C=72:0:*:* CoreSpec=*
JOB_GRES=(null)
Nodes=jst[005,009,208] CPU_IDs=0-71 Mem=504000 GRES=
MinCPUsNode=72 MinMemoryCPU=7000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/scratch/user/PD_tg_Delta.run 115
WorkDir=/scratch/user
StdErr=/scratch/user/out.out
StdIn=/dev/null
StdOut=/scratch/user/out.out
Power=
Sjob#
The Sjob
command is particularly useful to find out information about jobs
that have recently finished.
$ Sjob 2649827
JobID JobName Cluster Account Partition Timelimit User Group
------------ ---------- ---------- ---------- ---------- ---------- --------- ---------
2649827 VTspU4_1 jed scitas standard 00:30:00 user epfl-unit
2649827.bat+ batch jed scitas
2649827.ext+ extern jed scitas
2649827.0 hydra_pmi+ jed scitas
Submit Eligible Start End
------------------- ------------------- ------------------- -------------------
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59
Elapsed ExitCode State
---------- -------- ----------
00:21:47 0:0 COMPLETED
00:21:47 0:0 COMPLETED
00:21:47 0:0 COMPLETED
00:21:47 0:0 COMPLETED
NCPUS NTasks NodeList UserCPU SystemCPU AveCPU MaxVMSize
---------- -------- --------------- ---------- ---------- ---------- ----------
216 jst[002,010-01+ 1-20:28:46 1-05:04:24
72 1 jst002 00:00.060 00:00.044 00:00:00 6600K
216 3 jst[002,010-01+ 00:00.001 00:00:00 00:00:00 0
216 216 jst[002,010-01+ 1-20:28:46 1-05:04:24 1-00:30:53 948421532K
Examples of submission scripts#
There are a number of examples available on a git repository. To download these, run the following command from one of the clusters:
Enter the directory scitas-examples
and choose the example to run by navigating
the folders.
We have three categories of examples:
- Basic (examples to get you started)
- Advanced (including hybrid jobs and job arrays)
- Modules (specific examples of installed software).
To run an example, here the hybrid HPL, do:
Or, if you do not wish to run on the debug QOS:
Running MPI jobs#
On the SCITAS clusters we fully support the a intel and gcc combination of compiler/MPI
As for Ferbruary 2023 we support GCC/OpenMPI and Intel/Intel OneAPI MPI For a precise list of compilers and MPIs you can check on the Software Stack page.
If we have an MPI code we need some way of correctly launching it across multiple
nodes. To do this we use the srun
command, which is a Slurm built-in job launcher:
mpirun
or mpiexec
.
To specify how many tasks and the number of nodes, we add the relevant #SBATCH
directives to the job script.
For example to launch our code on 4 nodes with 72 tasks per node we specify:
#!/bin/bash
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 72
#SBATCH --cpus-per-task 1
#SBATCH --time 1:0:0
module purge
module load <mycompiler>
module load <mympi>
srun /home/bob/code/mycode.x
There is no need to specify the number of tasks when you call srun
as it inherits
the value from the allocation. In this example <mycompiler>
and <mympi>
should be replaced by your choice of compiler and mpi implementation
Running OpenMP jobs#
When running an OpenMP, it is important to set the number
of OpenMP threads per process via the variable OMP_NUM_THREADS
. If this is not
specified, the default value is system dependent.
We can integrate this with Slurm as seen for the following example (1 tasks, 4 threads per task):
#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 4
#SBATCH --time 1:0:0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun mycode.x
OMP_NUM_THREADS
.
In this example the srun
in front of the comment is not strictly required
Running an hybrid MPI/OpenMP code#
You can also mix the two previous cases in case your code supports both, shared and distributed memories
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 2
#SBATCH --cpus-per-task 36
#SBATCH --time 1:0:0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun mycode.x
This would be an example to run 2 MPI tasks with 36 OpenMP threads each on 1 node.
If you run such hybrid jobs we advise you to read the page on CPU affinity.
The Debug QOS#
All the clusters have a special QOS for running short jobs. This high priority QOS is provided so you can debug jobs or quickly test input files.
To use this QOS you can either add the #SBATCH -q debug
directive to
your job script or specify it on the command line:
Debug Nodes Usage
Please note that the debug QOS is not meant for production runs!
Any such use will result in access to the clusters being revoked.
For more information on the specific limitations of the debug QOS please read Debug QOS
Interactive Jobs#
There are two main methods for getting interactive (rather than batch) access to the machines. They have different use cases and advantages.
Sinteract#
The Sinteract
command allows you to log onto a compute node and run
applications directly on it.
This can be especially useful for graphical applications such as MATLAB and COMSOL:
[user@jed ~]$ Sinteract
Cores: 1
Tasks: 1
Time: 00:30:00
Memory: 4G
Partition: standard
Account: scitas-ge
Jobname: interact
Resource:
QOS: serial
Reservation:
Constraints:
salloc: Pending job allocation 2224524
salloc: job 2224524 queued and waiting for resources
Izar cluster
On the Izar cluster, the -g
option is necessary to request the desired number of GPUs. For example:
You can find some more information on Sinteract here or by running Sinteract -h
on our clusters
salloc#
The salloc
command creates an allocation on the system that you can then access via
srun
.
It allows you to run MPI jobs in an interactive manner and is very useful for debugging:
[username@frontend ]$ salloc -q debug -N 2 -n 2 --mem 2048
salloc: Granted job allocation 579440
salloc: Waiting for resource configuration
salloc: Nodes jst[017,018] are ready for job
[username@frontend ]$ hostname
frontend
[username@frontend ]$ srun hostname
jst017
jst018
[username@frontend ]$ exit
salloc: Relinquishing job allocation 579440
salloc: Job allocation 579440 has been revoked.
Interactive shell
To gain interactive access on the node, we suggest using Sinteract.
If you wish to achieve a similar result with salloc, you can type, after having had access to your job allocation:
or, if you need a graphical display (see the following section for other prerequisites):
Use of graphical applications (X11) on the clusters#
To be able to use graphical applications, there are two requirements:
Connection to the cluster
You must connect from your machine to the login node with the -X
option.
(The use of -Y
is unnecessary and highly discouraged as it is a security
risk.)
Connection within the cluster (login node to compute nodes)
We've enabled host-based authentication on our clusters. You should be able to log in to the nodes without having to type your password.
However, you can only connect to a specific compute node if you have a job running on it. If you try to connect to a node where you have no running jobs, you will see a message like:
$ ssh jst020
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by 10.91.44.20 port 22