Running Jobs#

Success

Before reading this document, you must have logged at least once to a cluster and possibly tried out some examples.

Note

In the following examples, <username> refers to your EPFL GASPAR username.

Batch systems#

All tasks (or jobs) need to be submitted to a batch system called Slurm. With this scheduler, your jobs will be launched based on factors such as priority and node availability.

It's normal for jobs not to start immediately after submission. Depending on the current cluster workload, you might wait minutes, hours, or even days. Leave your jobs in the queue, and Slurm will decide when to run them. Do not cancel and re-submit them.

All SCITAS clusters use Slurm, which is widely used and open source.

Running jobs with Slurm#

Create a short script describing your task and submit it using the sbatch command. Here is a minimal example for submitting a job running the serial code moovit:

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
#SBATCH --time 1:00:00

$HOME/code/moovit < $HOME/params/moo1

Lines beginning with #SBATCH are directives to the batch system. For sbatch command documentation, type:

$ man sbatch

Script Options#

--nodes 1: Number of nodes to use.
--ntasks 1: Maximum number of tasks (in an MPI sense) per job.
--cpus-per-task 1: Number of cores per task.
--time 1:00:00: Maximum walltime required. The job will be killed if it exceeds this limit. Time can be specified in different formats, e.g., HH:MM:SS.
(Optional) --account=<your_account>: Specify the Slurm account to use.

Choosing a reasonable time limit

Set a reasonable yet short value for --time. Slurm optimizes resource usage.
For example, a small job can use nodes reserved for a larger job if Slurm knows it will finish before the larger job starts.
Asking for 3 days for a job that finishes in 30 minutes generally leads to longer wait times.

Choosing a reasonable amount of memory

Select an appropriate value for the --mem parameter. Requesting excessive memory,
such as 40GB for a job that requires only 4GB, may increase queue time and prevent other users from running their jobs.

Save the script to a file, e.g., moojob.run, and run it using:

$ sbatch moojob.run

The output will look like:

$ sbatch moojob.run
Submitted batch job 123456

The number returned is the Job ID, used for further information or modifications.

Slurm directives can also be given on the command line, overriding the script settings:

$ sbatch --time=2-00:00:00 moojob.run

This sets a 2-day time limit, regardless of the 12-hour limit in the script.

Canceling Jobs#

To cancel a specific job:

$ scancel JOBID

To cancel all your jobs (use with care!):

$ scancel -u $USER

To cancel all your jobs that are not yet running:

$ scancel -u $USER -t PENDING

Getting Job Information#

Different tools can be used to query jobs depending on the required information.
Tools with names starting with a capital S are SCITAS specific,
while those starting with a small s are part of the base Slurm distribution.

Squeue#

Shows information about all your jobs:

$ Squeue
     JOBID         NAME  ACCOUNT       USER NODE  CPUS  MIN_MEMORY     ST       REASON           START_TIME             NODELIST
    123456         run1   scitas        bob    6    96       32000      R         None  2023-02-03T04:18:37         jst04[32-37]
    123457         run2   scitas        bob    6    16       32000     PD   Dependency                  N/A

squeue#

By default, squeue shows all jobs from your group. Modify this by passing options to squeue or by unsetting the SQUEUE_ACCOUNT environment variable.

$ unset SQUEUE_ACCOUNT; squeue

To see all running jobs from your group:

$ squeue -t R
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            123456  parallel  gromacs      bob  R      48:43      6 jst04[32-37]
            123457  parallel     pw.x      sue  R   18:06:44      8 jst01[03,11,21],jst04[50,61-64]

See man squeue for all the options.

The squeue command described above gives more information about the jobs your are running:

$ squeue -u $USER -o "%.10A %.12j %.8a %.10u %.4D %.5C %.11m %.6t %.12r %.20S %.20N" -S S

scontrol#

Shows everything the system knows about a running or pending job:

$ scontrol -d show job 87439
   JobId=87439 JobName=PDG
   UserId=user(100000) GroupId=epfl-unit(100000) MCS_label=N/A
   Priority=992 Nice=0 Account=scitas QOS=parallel
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:30:10 TimeLimit=08:00:00 TimeMin=N/A
   SubmitTime=2023-01-09T18:18:43 EligibleTime=2023-01-09T18:18:43
   AccrueTime=2023-01-09T18:18:43
   StartTime=2023-01-10T09:14:17 EndTime=2023-01-10T17:14:17 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-01-10T09:14:17 Scheduler=Main
   Partition=standard AllocNode:Sid=jed:424335
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=jst[005,009,208]
   BatchHost=jst005
   NumNodes=3 NumCPUs=216 NumTasks=216 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=216,mem=1512000M,node=3,billing=216
   Socks/Node=* NtasksPerN:B:S:C=72:0:*:* CoreSpec=*
   JOB_GRES=(null)
   Nodes=jst[005,009,208] CPU_IDs=0-71 Mem=504000 GRES=
   MinCPUsNode=72 MinMemoryCPU=7000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/user/PD_tg_Delta.run 115
   WorkDir=/scratch/user
   StdErr=/scratch/user/out.out
   StdIn=/dev/null
   StdOut=/scratch/user/out.out
   Power=

Sjob#

Useful for finding information about recently finished jobs:

$ Sjob 2649827

JobID           JobName    Cluster    Account  Partition  Timelimit      User     Group
------------ ---------- ---------- ---------- ---------- ---------- --------- ---------
2649827        VTspU4_1        jed     scitas   standard   00:30:00      user epfl-unit
2649827.bat+      batch        jed     scitas
2649827.ext+     extern        jed     scitas
2649827.0    hydra_pmi+        jed     scitas

             Submit            Eligible               Start                 End
------------------- ------------------- ------------------- -------------------
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59 
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59 
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59 
2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:32:12 2023-03-10T10:53:59 

Elapsed    ExitCode      State
---------- -------- ----------
  00:21:47      0:0  COMPLETED
  00:21:47      0:0  COMPLETED
  00:21:47      0:0  COMPLETED
  00:21:47      0:0  COMPLETED

NCPUS        NTasks        NodeList    UserCPU  SystemCPU     AveCPU  MaxVMSize
---------- -------- --------------- ---------- ---------- ---------- ----------
       216          jst[002,010-01+ 1-20:28:46 1-05:04:24                       
        72        1          jst002  00:00.060  00:00.044   00:00:00      6600K 
       216        3 jst[002,010-01+  00:00.001   00:00:00   00:00:00          0 
       216      216 jst[002,010-01+ 1-20:28:46 1-05:04:24 1-00:30:53 948421532K

Examples of submission scripts#

There are a number of examples available on a git repository. To download these,
run the following command from one of the clusters:

$ git clone https://gitlab.epfl.ch/SCITAS/courses/scitas-examples

Enter the directory scitas-examples and choose the example to run by navigating the folders.

We have three categories of examples:

Basic: examples to get you started
Advanced: Including hybrid jobs and job arrays
Modules: specific examples of installed software

To run an example, such as the hybrid HPL, do:

$ sbatch --qos=debug hpl-hybrid.run

Or, if you do not wish to run on the debug QOS:

$ sbatch hpl-hybrid.run

Running MPI jobs#

On the SCITAS clusters, we fully support the combination of Intel and GCC compilers with MPI.

As of February 2023 we support GCC/OpenMPI and Intel/Intel OneAPI MPI. For a precise list of compilers and MPIs, check the Software Stack page.

To correctly launch an MPI code across multiple nodes, use the srun command,
which is a Slurm built-in job launcher:

$ srun mycode.x

Please, note that we don't provide the usual mpirun or mpiexec.

To specify how many tasks and the number of nodes, we add the relevant #SBATCH directives to the job script.

For example to launch our code on 4 nodes with 72 tasks per node we specify:

#!/bin/bash
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 72
#SBATCH --cpus-per-task 1
#SBATCH --time 1:0:0

module purge
module load <mycompiler>
module load <mympi>

srun /home/bob/code/mycode.x

There is no need to specify the number of tasks when you call srun as it inherits the value from the allocation. In this example <mycompiler> and <mympi> should
be replaced by your choice of compiler and mpi implementation

Running OpenMP jobs#

When running an OpenMP, it is important to set the number of OpenMP threads per process via the variable OMP_NUM_THREADS. If this is not specified, the default value is system dependent.

Integrate this with Slurm as shown in the following example (1 task, 4 threads per task):

#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 4
#SBATCH --time 1:0:0

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun mycode.x

This takes the environment variable set by Slurm and assigns its value to OMP_NUM_THREADS.

In this example the srun in front of the comment is not strictly required

Running an hybrid MPI/OpenMP code#

You can also mix the two previous cases in case your code supports both shared and distributed memories

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks 2
#SBATCH --cpus-per-task 36
#SBATCH --time 1:0:0

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun mycode.x

This example runs 2 MPI tasks with 36 OpenMP threads each on 1 node.
If you run such hybrid jobs, we advise you to read the page on CPU affinity.

The Debug QOS#

All clusters have a special QOS for running short jobs. This high-priority QOS is provided so you can debug jobs or quickly test input files.

To use this QOS, add the #SBATCH -q debug directive to your job script or specify it on the command line:

$ sbatch -q debug myjob.run

Debug Nodes Usage

Please note that the debug QOS is not meant for production runs!
Any such use will result in access to the clusters being revoked.

For more information on the specific limitations of the debug QOS please read Debug QOS

Interactive Jobs#

There are two main methods for getting interactive (rather than batch) access to the machines. They have different use cases and advantages.

Sinteract#

The Sinteract command allows you to log onto a compute node and run applications directly on it. This can be especially useful for graphical applications such as MATLAB and COMSOL:

[user@jed ~]$ Sinteract 
Cores:            1
Tasks:            1
Time:             00:30:00
Memory:           4G
Partition:        standard
Account:          scitas-ge
Jobname:          interact
Resource:
QOS:              serial
Reservation:
Constraints:      

salloc: Pending job allocation 2224524
salloc: job 2224524 queued and waiting for resources

GPU clusters

On the Kuma and Izar clusters, the -g option is necessary to request the desired number of GPUs. For example on Kuma:

$ Sinteract -p l40s -g gpu:1

You can find more information on Sinteract here or by running Sinteract -h on our clusters.

salloc#

The salloc command creates an allocation on the system that you can then access via srun. It allows you to run MPI jobs interactively and is very useful for debugging:

[username@frontend ]$ salloc -q debug -N 2 -n 2 --mem 2048 
salloc: Granted job allocation 579440
salloc: Waiting for resource configuration
salloc: Nodes jst[017,018] are ready for job

[username@frontend ]$ hostname
frontend

[username@frontend ]$ srun hostname
jst017
jst018

[username@frontend ]$ exit
salloc: Relinquishing job allocation 579440
salloc: Job allocation 579440 has been revoked.

Interactive shell

To gain interactive access on the node, we suggest using Sinteract.
If you wish to achieve a similar result with salloc, you can type, after having had access to your job allocation:

$ srun --pty bash

or, if you need a graphical display (see the following section for other prerequisites):

$ srun --x11 --pty bash

Use of graphical applications (X11) on the clusters#

To use graphical applications, there are two requirements:

Connection to the cluster#

You must connect from your machine to the login node with the -X option. (The use of -Y is unnecessary and highly discouraged as it is a security risk.)

$ ssh -X <username>@<cluster>.hpc.epfl.ch

We've enabled host-based authentication on our clusters. You should be able to log in to the nodes without having to type your password. However, you can only connect to a specific compute node if you have a job running on it. If you try to connect to a node where you have no running jobs, you will see a message like:

$ ssh jst020
Access denied by pam_slurm_adopt: you have no active jobs on this node
Connection closed by 10.91.44.20 port 22