SCITAS FAQ#
General FAQ#
Connecting to the Clusters#
Why can't I connect to the clusters from home?#
You can but to do so requires passing via the EPFL VPN service. See http://network.epfl.ch/vpn for how to use this service.
Why am I asked for a password while sshing from the front end to a node on Izar?#
Once logged into the Izar cluster, you can
ssh
directly to
the node(s) running your job(s). You can avoid being asked for the GASPAR
password again by creating a passwordless ssh
key.
For the other clusters we now have host-based authentication and you should be able to connect from the front end to the nodes without being asked for a password.
SLURM Batch System Questions#
What's the maximum run time of a job?#
For pay-per-use accounts, the maximum run time for a job is 3 days. If your job requires more time, you can request an extension by contacting us. Please provide a clear explanation of why the additional time is needed and include details about your workflow to help us assess the request.
How do I submit a job that requires a run time of more than three days?#
Labs with signed contracts may request a QOS for special needs. To do so, please send a request to 1234@epfl.ch with the subject line "HPC: request new QOS".
Can I submit array jobs and, if so, how?#
Yes, with the --array
option to sbatch
. See
http://slurm.schedmd.com/job_array.html
for the official documentation and our scitas-examples
repository for several
examples.
What is the difference between hpc-lab
and lab
?#
hpc-lab
is the name of the group to manage user access to the cluster (in the
groups.epfl.ch sense). lab`` is the name of the Slurm
account, automatically populated with users from the
hpc-lab` group. You have
to use the account name lab in your batch scripts.
Is it safe to share nodes with other users?#
Yes! We use cgroups
to limit the amount of CPU and memory assigned to users.
There is no way for users to adversely affect each other.
I have a pay-per-use account and I have run on the debug QOS. Do I have to pay for debug time?#
No. Debug time is free of charge.
What is a <job id>
?#
It's the unique numerical identifier of a job and is given when you submit the job:
It can also be seen usingsqueue
:
[user@cluster jobs]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1234567 serial my_job.job user R 1:02 1 c03
How do I display the used CPU time for my account since a certain point in time?#
You can use the sreport
tool. Here is an example of query where the used time is
reported in core hours. Just replace 2018-01-01T00:00:00
with the start time you
wish and scitas-ge
with your account name.
$ sreport cluster AccountUtilizationByUser -t Hour --parsable2 start=2018-01-01T00:00:00 Accounts=scitas-ge Format=Cluster,Account,Login,Used
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2018-01-01T00:00:00 - 2018-04-29T23:59:59 (10278000 secs)
Use reported in TRES Hours
--------------------------------------------------------------------------------
Cluster|Account|Login|Used
fidis|scitas-ge|user0|156349
fidis|scitas-ge|user1|7
fidis|scitas-ge|user2|22834
fidis|scitas-ge|user3|0
How do I specify that my multi-node MPI job should use nodes connected to the same switch?#
You can specify the maximum number of switches to be used as follows (in this case one switch)
Please note that jobs with such requirements may take much longer to schedule than those than can be spread across the cluster. This option should only be used in very specific cases!
Is any form of simultaneous multithreading (SMT) (such as Intel's 'Hyper-Threading' or 'HT') enabled on the clusters?#
In general SMT can decrease performance if there are any shared resources in the CPU. Considering parallel codes typically all perform similar operations any such shared resources would quickly become a bottleneck. As such SMT/HT is as a general rule disabled in all SCITAS clusters.
Why does my job fail immediately without leaving any trace (output)?#
This usually happens when one specifies a non-existing working directory (for
example by using: --chdir /path/that/does/not/exist)
.
Why does my job fail after submission with error "Invalid generic resource (gres) specification"?#
Because on Izar it's necessary to specify the --gres=gpu:X
flag, where X is
the number of GPUs per node you require.
How do I set up job notification emails?#
Add both following commands to your submission script to set the email address:
1. A valid email address, preferably one provided by EPFL. 2. A type of notification. Valid type values areNONE
, BEGIN
, END
, FAIL
,
REQUEUE
, ALL
(equivalent to BEGIN
, END
, FAIL
, REQUEUE
, and
STAGE_OUT
), STAGE_OUT
(burst buffer stage out and teardown completed),
TIME_LIMIT
. Multiple type values may be specified in a comma-separated list.
Why does my job fail after requeuing with the error "Requested operation is presently disabled for job JOBID"?#
The requeueing possibility must be explicitly requested by the user by adding
the option --requeue
to the batch script:
scontrol requeue JobID
command to be dispatched again.
I have many jobs waiting on the queue, but I need the last job I submitted to run first. What can I do to adapt the priority of my jobs relative to each other?#
As a regular you have two options, either to hold jobs or to change a parameter called niceness.
The easiest option is to put every other job on hold (e.g. scontrol hold 12345,12346,12347
)
so that only job you want to run can be scheduled. Note that you will later on
have to release the jobs (e.g. scontrol release 12345,12346,12347
) otherwise
the jobs will be on hold indefinitely.
Alternatively you can alter the order of your own jobs by adapting their niceness. When you check the properties of your job the first few lines are something like:
$ scontrol 12345
JobId=12345 JobName=test
UserId=user(1000) GroupId=unit(2000) MCS_label=N/A
Priority=7151 Nice=0 Account=lab-account QOS=parallel
JobState=PENDING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
Please note the Priority
and Nice
parameters. Priority is a dynamical variable
which Slurm uses to define the order of jobs on queues. Higher priority jobs, as
expected, will run earlier all else being equal. You cannot change the priority
of the job, since Slurm adjusts it at regular intervals.
You can, however, change the niceness of the job. The Nice value will be subtracted from the Priority so the higher you set Nice the lower the final Priority for the job. As a regular user you cannot set negative Nice values, so you cannot adapt your important job, you have to set higher Nice values for your other jobs.
On the example above, you see Priority=7151
and Nice=0
, which is the default Nice
value. If you wanted job 23456 to run first and the priority of that job is currently
3000 then you'd need to change the niceness of your high priority job by at least
4152 (one more than the difference). We change all the other jobs at once, as well:
If you now look at the queue:
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
23456 standard test user PD 0:00 2 (Resources)
12345 standard test user PD 0:00 2 (Priority)
12346 standard test user PD 0:00 2 (Priority)
12347 standard test user PD 0:00 2 (Priority)
Your most recent job is now on top of your queue. If you want you can later on change the nice value for your jobs once more. Otherwise older jobs will be for a while of lower priority relative to the younger ones.
Another user has a lot of jobs running or in the queue, is that abuse?#
Submitting many jobs is generally acceptable, as Slurm uses a fair-share system to ensure that a user can't take over the resources just by submitting many jobs. See Slurm Job Priorities for more details on this.
Why is my job still pending while other user job submissions run much sooner?#
This is quite usual and happens due to jobs being scheduled to run according to their computed priority. The priority depends on many factors, including your job's QOS, but also your account's fairshare. For more details on this, please see Slurm Job Priorities.
I have access to SCITAS through multiple labs. How can I make one the default Slurm account?#
The best way to do this is to make use of the SBATCH_ACCOUNT
variable. Add
the following line to your ~/.bashrc
file (or the
equivalent for your default shell):
where <main_lab_account>
is the account you want to set as the default one and
Slurm will take this information into account at job submission.
Option precedence
Please note that environment variables override anything set in the script
with an #SBATCH
option. You will need to do something like sbatch
--account=<my_other_lab> my_script.sh
for the second lab to be billed.
Where can I find the official sbatch
documentation ?#
You can check the official documentation for details.
File System Questions#
Where is my /scratch
space?#
Your /scratch
space is located at /scratch/<username>
. You can also access
it using the $SCRATCH
environment variable.
Can you recover an important file that was on my scratch area?#
NO. /scratch
is not backed
up so the file is gone forever. Please, note that we automatically
delete files on scratch to prevent it from filling up!
I've deleted a file on /home
or /work
- How can I recover it?#
If it was deleted in the last seven days then you can use the daily snapshots to get it back. These can be found at:
e.g./home/.snapshots/2015-11-11/bob/
.
File System Backup
The /home
file system is backed up onto tape. If the file was deleted more
than a week before we may be able to help. The /work
file system is not backed up by default.
How to display quota and usage information for the /home and /work file systems?#
-
/home
: to get user quota and file system usage for your group members, use the following command: -
/work
: to get group quota and file system usage for your group members, use the following command:
Why do I get "Disk quota exceeded" ?#
You exceeded your quota on /home
or /work
, even when you free space the quota is not instantly recomputed.
It's usually pretty fast, but depends on the file system general usage.
It's recomputed every week on Sunday. In case of problem please contact us.
How can I edit a file in the clusters using an application on my computer?#
If you wish to manipulate file on the remote file system by using a software
that is installed on your workstation you can mount the remote file system by
using sshfs
. After installing it, you can type from a terminal:
$ sudo mkdir /media/dest
$ sudo sshfs -o allow_other <username>@<cluster>.hpc.epfl.ch:/scratch/<username> /media/dest
<username>
is your GASPAR account and <cluster>
is the cluster file
system you wish to mount.
Software Questions#
I want to use an Intel software on my own machine/server. How can I do it?#
Yes, Intel now provides its OneAPI suite for free. You have access to the compilers, MPI library and different tools.
Why do I get the error "module: command not found" or "slmodules: command not found"?#
This is probably because you have Tcsh as your login shell and the environment isn't propagated to the compute nodes.
In order to fix the issue please change the first line of your job script as follows:
or The-l
option tells Bash/Tcsh to launch an interactive shell which
correctly sources the files in /etc/profile.d/
.
Why do I get the error "Empty or non-existent file" when loading a module?#
When trying to load certain modules you may get a message along the lines of:
$ module load comsol
Lmod has detected the following error: Unable to load module because of error when evaluating modulefile:
/ssoft/spack/jed_stable/share/spack/lmod/jed/linux-rhel9-x86_64/Core/comsol/6.2.lua: Empty or non-existent file
Please check the modulefile and especially if there is a line number specified in the above message
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
comsol/6.2 /ssoft/spack/jed_stable/share/spack/lmod/jed/linux-rhel9-x86_64/Core/comsol/6.2.lua
Access to some of our modules is restricted, typically due to licensing restrictions. If you see a message like this it means you don't have access to the code. For some codes you can ask for access directly from EPFL at this page. If you see the code there, follow the procedure described to have access to the code. Keep in mind that accepting the conditions is just the first step of the process. You will likely be informed once your access has been approved. You won't be able to load the module until then.
Note that if you signed a license agreement for an earlier version, you may need assistance from the team managing licenses to use a newer version. In this case you may see the above error message even if you have signed the agreement a long time ago. Contact Service Desk for the necessary intervention as the access to the licensed software is not managed by SCITAS.
If you don't see the code on that page, the license is managed at SCITAS. Request access to it through 1234@epfl.ch with the subject "HPC: request access to software".
Group updates
Access to our modules is managed via groups. Groups are not updated on ongoing sessions. You will need to open a new session after having been granted access to a program, in order to load the relevant module.
How can I change my default shell?#
Most systems use Bash by default and most of our documentation assumes your default shell is Bash. You can change your default shell on this page.
Which options should I use to link with the Intel MKL?#
Ask the Intel Math Kernel link line advisor
If you use the Intel compilers then you can pass the -mkl
flag which will do the
hard work for you.
What compilers/MPI combination do you support?#
SCITAS supports Intel compilers and Intel MPI or GCC compilers and OpenMPI. Other combinations are not supported (if provided they will be supported on a best-effort basis).
Why does my COMSOL job fail to get a license?#
Occasionally your COMSOL jobs might fail with a message such as:
Could not obtain license for COMSOL ...
License error: -5.
No such product exists.
No such feature exists.
Feature: COMSOL
License path: ...
FlexNet Licensing error:-5,414
(Alternatively, if possible for the task you are doing, you can try to use other equivalent software packages like ANSYS.)
GPUs#
Which cluster can I use to run jobs using GPUs?#
At the moment, we have two GPU accelerated cluster:
How do I submit jobs to the GPU nodes?#
You need to pass the following options:
WhereX
is the number of GPUs required per node.
QOS / Partition#
Kuma#
You will find the information about our QOS and partitions configuration here
Jed#
You will find the information about our QOS and partitions configuration here
Izar#
You can find the details on the QOS structure here.
Helvetios#
The Helvetios cluster has been removed from pay-per-use usage and will remain available for educational purposes only. It uses the same software stack and QOS as the Jed cluster.
I need Help !#
Please find here the instructions to contact the SCITAS Support Team.