Running Jobs#
Success
Before reading this document, you must have logged at least once to a cluster and possibly tried out some examples.
Note
In the following examples, <username>
refers to your EPFL
GASPAR username.
Batch systems#
All tasks (or jobs) need to be submitted to a batch system called Slurm. With this scheduler, your jobs will be launched based on factors such as priority and node availability.
It's normal for jobs not to start immediately after submission. Depending on the current cluster workload, you might wait minutes, hours, or even days. Leave your jobs in the queue, and Slurm will decide when to run them. Do not cancel and re-submit them.
All SCITAS clusters use Slurm, which is widely used and open source.
Running jobs with Slurm#
Create a short script describing your task and submit it using the sbatch
command.
Here is a minimal example for submitting a job running the serial code moovit
:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --time 1:00:00
$HOME/code/moovit < $HOME/params/moo1
#SBATCH
are directives to the batch system.
For sbatch command documentation, type:
Script Options#
--nodes 1
: Number of nodes to use.--ntasks 1
: Maximum number of tasks (in an MPI sense) per job.--cpus-per-task 1
: Number of cores per task.--time 1:00:00
: Maximum walltime required. The job will be killed if it exceeds this limit. Time can be specified in different formats, e.g., HH:MM:SS.- (Optional)
--account=<your_account>
: Specify the Slurm account to use.
Choosing a reasonable time limit
Set a reasonable yet short value for --time
. Slurm optimizes resource usage.
For example, a small job can use nodes reserved for a larger job if Slurm knows
it will finish before the larger job starts.
Asking for 3 days for a job that finishes in 30 minutes generally leads to longer wait times.
Choosing a reasonable amount of memory
Select an appropriate value for the --mem
parameter. Requesting excessive memory,
such as 40GB for a job that requires only 4GB, may increase queue time and prevent
other users from running their jobs.
Save the script to a file, e.g., moojob.run, and run it using:
The output will look like:
The number returned is the Job ID, used for further information or modifications.
Slurm directives can also be given on the command line, overriding the script settings:
This sets a 2-day time limit, regardless of the 12-hour limit in the script.Canceling Jobs#
To cancel a specific job:
To cancel all your jobs (use with care!): To cancel all your jobs that are not yet running:Getting Job Information#
Different tools can be used to query jobs depending on the required information.
Tools with names starting with a capital S are SCITAS specific,
while those starting with a small s are part of the base Slurm distribution.
Squeue#
Shows information about all your jobs:
$ Squeue
JOBID NAME ACCOUNT USER NODE CPUS MIN_MEMORY ST REASON START_TIME NODELIST
123456 run1 scitas bob 6 96 32000 R None 2023-02-03T04:18:37 jst04[32-37]
123457 run2 scitas bob 6 16 32000 PD Dependency N/A
squeue#
By default, squeue
shows all jobs from your group. Modify this by passing options to squeue
or
by unsetting the SQUEUE_ACCOUNT
environment variable.
To see all running jobs from your group:
$ squeue -t R
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
123456 parallel gromacs bob R 48:43 6 jst04[32-37]
123457 parallel pw.x sue R 18:06:44 8 jst01[03,11,21],jst04[50,61-64]
See man squeue
for all the options.
The squeue
command described above gives more information about the jobs your are running:
scontrol#
Shows everything the system knows about a running or pending job:
$ scontrol -d show job 87439
JobId=87439 JobName=PDG
UserId=user(100000) GroupId=epfl-unit(100000) MCS_label=N/A
Priority=992 Nice=0 Account=scitas QOS=parallel
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:30:10 TimeLimit=08:00:00 TimeMin=N/A
SubmitTime=2023-01-09T18:18:43 EligibleTime=2023-01-09T18:18:43
AccrueTime=2023-01-09T18:18:43
StartTime=2023-01-10T09:14:17 EndTime=2023-01-10T17:14:17 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-01-10T09:14:17 Scheduler=Main
Partition=standard AllocNode:Sid=jed:424335
ReqNodeList=(null) ExcNodeList=(null)
NodeList=jst[005,009,208]
BatchHost=jst005
NumNodes=3 NumCPUs=216 NumTasks=216 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=216,mem=1512000M,node=3,billing=216
Socks/Node=* NtasksPerN:B:S:C=72:0:*:* CoreSpec=*
JOB_GRES=(null)
Nodes=jst[005,009,208] CPU_IDs=0-71 Mem=504000 GRES=
MinCPUsNode=72 MinMemoryCPU=7000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/scratch/user/PD_tg_Delta.run 115
WorkDir=/scratch/user
StdErr=/scratch/user/out.out
StdIn=/dev/null
StdOut=/scratch/user/out.out
Power=
Sjob#
Useful for finding information about recently finished jobs:
$ Sjob 2649827
JobID JobName Cluster Account Partition Timelimit User Group
------------ ---------- ---------- ---------- ---------- ---------- --------- ---------
2649827 VTspU4_1 jed scitas standard 00:30:00 user epfl-unit
2649827.bat+ batch jed scitas
2649827.ext+ extern jed scitas
2649827.0 hydra_pmi+ jed scitas
Submit Eligible Start End
------------------- ------------------- ------------------- -------------------
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59
Elapsed ExitCode State
---------- -------- ----------
00:21:47 0:0 COMPLETED
00:21:47 0:0 COMPLETED
00:21:47 0:0 COMPLETED
00:21:47 0:0 COMPLETED
NCPUS NTasks NodeList UserCPU SystemCPU AveCPU MaxVMSize
---------- -------- --------------- ---------- ---------- ---------- ----------
216 jst[002,010-01+ 1-20:28:46 1-05:04:24
72 1 jst002 00:00.060 00:00.044 00:00:00 6600K
216 3 jst[002,010-01+ 00:00.001 00:00:00 00:00:00 0
216 216 jst[002,010-01+ 1-20:28:46 1-05:04:24 1-00:30:53 948421532K
Examples of submission scripts#
There are a number of examples available on a git repository. To download these,
run the following command from one of the clusters:
scitas-examples
and choose the example to run by navigating the folders.
We have three categories of examples:
- Basic: examples to get you started
- Advanced: Including hybrid jobs and job arrays
- Modules: specific examples of installed software
To run an example, such as the hybrid HPL, do:
Or, if you do not wish to run on the debug QOS:
Running MPI jobs#
On the SCITAS clusters, we fully support the combination of Intel and GCC compilers with MPI.
As of February 2023 we support GCC/OpenMPI and Intel/Intel OneAPI MPI. For a precise list of compilers and MPIs, check the Software Stack page.
To correctly launch an MPI code across multiple nodes, use the srun
command,
which is a Slurm built-in job launcher:
mpirun
or mpiexec
.
To specify how many tasks and the number of nodes, we add the relevant #SBATCH
directives to the job script.
For example to launch our code on 4 nodes with 72 tasks per node we specify:
#!/bin/bash
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 72
#SBATCH --cpus-per-task 1
#SBATCH --time 1:0:0
module purge
module load <mycompiler>
module load <mympi>
srun /home/bob/code/mycode.x
There is no need to specify the number of tasks when you call srun
as it inherits
the value from the allocation. In this example <mycompiler>
and <mympi>
should
be replaced by your choice of compiler and mpi implementation
Running OpenMP jobs#
When running an OpenMP, it is important to set the number
of OpenMP threads per process via the variable OMP_NUM_THREADS
. If this is not
specified, the default value is system dependent.
Integrate this with Slurm as shown in the following example (1 task, 4 threads per task):
#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 4
#SBATCH --time 1:0:0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun mycode.x
This takes the environment variable set by Slurm and assigns its value to
OMP_NUM_THREADS
.
In this example the srun
in front of the comment is not strictly required
Running an hybrid MPI/OpenMP code#
You can also mix the two previous cases in case your code supports both shared and distributed memories
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 2
#SBATCH --cpus-per-task 36
#SBATCH --time 1:0:0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun mycode.x
If you run such hybrid jobs, we advise you to read the page on CPU affinity.
The Debug QOS#
All clusters have a special QOS for running short jobs. This high-priority QOS is provided so you can debug jobs or quickly test input files.
To use this QOS, add the #SBATCH -q debug
directive to your job script or
specify it on the command line:
Debug Nodes Usage
Please note that the debug QOS is not meant for production runs!
Any such use will result in access to the clusters being revoked.
For more information on the specific limitations of the debug QOS please read Debug QOS
Interactive Jobs#
There are two main methods for getting interactive (rather than batch) access to the machines. They have different use cases and advantages.
Sinteract#
The Sinteract
command allows you to log onto a compute node and run
applications directly on it.
This can be especially useful for graphical applications such as MATLAB and
COMSOL:
[user@jed ~]$ Sinteract
Cores: 1
Tasks: 1
Time: 00:30:00
Memory: 4G
Partition: standard
Account: scitas-ge
Jobname: interact
Resource:
QOS: serial
Reservation:
Constraints:
salloc: Pending job allocation 2224524
salloc: job 2224524 queued and waiting for resources
GPU clusters
On the Kuma and Izar clusters, the -g
option is necessary to request
the desired number of GPUs. For example on Kuma:
You can find more information on Sinteract
here
or by running Sinteract -h
on our clusters.
salloc#
The salloc
command creates an allocation on the system that you can then
access via srun
. It allows you to run MPI jobs interactively and is
very useful for debugging:
[username@frontend ]$ salloc -q debug -N 2 -n 2 --mem 2048
salloc: Granted job allocation 579440
salloc: Waiting for resource configuration
salloc: Nodes jst[017,018] are ready for job
[username@frontend ]$ hostname
frontend
[username@frontend ]$ srun hostname
jst017
jst018
[username@frontend ]$ exit
salloc: Relinquishing job allocation 579440
salloc: Job allocation 579440 has been revoked.
Interactive shell
To gain interactive access on the node, we suggest using
Sinteract.
If you wish to achieve a similar result with salloc, you can type,
after having had access to your job allocation:
Use of graphical applications (X11) on the clusters#
To use graphical applications, there are two requirements:
Connection to the cluster#
You must connect from your machine to the login node with the -X
option.
(The use of -Y
is unnecessary and highly discouraged as it is a security
risk.)
Connection within the cluster (login node to compute nodes)#
We've enabled host-based authentication on our clusters. You should be able to log in to the nodes without having to type your password. However, you can only connect to a specific compute node if you have a job running on it. If you try to connect to a node where you have no running jobs, you will see a message like: