Memory allocation#
Introduction#
Modern cluster nodes have many CPU cores and significant amounts of RAM. Calculating the amount of RAM each core can use without posing problems to other cores running similar tasks (let's call it the Maximum Memory per CPU core) is a trivial task.
In some cases, your job may very well need only one core and 200 GB of RAM (a value largely exceeding such Maximum Memory per CPU core). But most jobs don't need that much RAM and it would be easy to submit a job that asks for a few cores and large amounts of memory, even if you don't need that much. In this situation the node would be underused, since the few remaining cores may have not enough free RAM to run any significant job. As such, a small number of jobs asking for a few cores but large amounts of memory could keep most of the resources of a cluster idle. This would affects others, but would also significantly delay the execution of your own jobs, since a node could not run many (or perhaps not even more than one) such job.
This document will teach you how to effectively set up your Slurm jobs so you can run the jobs you need without wasting resources.
Note
To discourage overallocation of RAM we activated a Slurm option which adjusts the number of cores of
your job based on the RAM you ask, MaxMemPerCPU
. This may have cost implications. More on this in a
later section of this document.
Checking how much RAM my job needs#
The first step to effectively select an amount of RAM is to know how much your job may use.
Some codes have an option like Mem=3000M
, which may give you a hint. However, it's
important to know whether your code will use this as a hard limit or if this is instead
a suggestion (which could be impossible to comply with, given method or size of problem constraints).
Even if the job is limited to the memory you put in your input, this amount is typically what is reserved for the calculation itself. The program you run will also need some RAM and the Slurm job itself needs some more memory.
As such, if you know your job needs 3000 MB of RAM, you should ask for a bit more to Slurm. For
instance by adding #SBATCH --mem=4000M
to your script. If, after running your job, you open the
slurm-<JOBID>.out
and see a message like:
/var/spool/slurmd/job12345/slurm_script: line 7: 48740 Killed python my_job.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=12345.batch. Some of your
processes may have been killed by the cgroup out-of-memory handler.
then you should ask for more RAM. Increase it conservatively and try again.
Jobs that have finished#
For jobs that have completed the Slurm accounting tool is likely the best option. Here's a sample of its standard output:
$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12344 t1 standard scitas-ge 1 COMPLETED 0:0
12344.batch batch scitas-ge 1 COMPLETED 0:0
12344.extern extern scitas-ge 1 COMPLETED 0:0
12345 t2 standard scitas-ge 1 OUT_OF_ME+ 0:125
12345.batch batch scitas-ge 1 OUT_OF_ME+ 0:125
12345.extern extern scitas-ge 1 COMPLETED 0:0
In its most basic form it can help you diagnose, just like the message in the
Slurm's output file that job 12345
died with a OUT_OF_MEmory
error, whereas
your previous job finished with no problems.
A more useful sacct
output includes the actual memory usage. The next command
shows all your jobs that have started since yesterday (-S now-1day
) and that
are still running or ended until now (-E now
):
$ sacct -S now-1day -E now --format=JobID,JobName,AllocCPUS,Elapsed,ReqMem,MaxVMSize,State
JobID JobName AllocCPUS Elapsed ReqMem MaxVMSize State
------------ ---------- ---------- ---------- ---------- ---------- ----------
1234567 t1 18 09:21:29 110000M COMPLETED
1234567.bat+ batch 18 09:21:29 11330344K COMPLETED
1234567.ext+ extern 18 09:21:29 0 COMPLETED
1234568 t2 18 15:28:02 110000M RUNNING
1234568.bat+ batch 18 15:28:02 RUNNING
1234568.ext+ extern 18 15:28:02 RUNNING
1234569 t3 18 00:00:01 110000M FAILED
1234569.bat+ batch 18 00:00:01 3144K FAILED
1234569.ext+ extern 18 00:00:01 0 COMPLETED
In these jobs we asked for 18 cores (AllocCPUS
) and 110'000 MB of RAM (ReqMem
).
For the job that completed the maximum RAM usage was roughly 11'300 MB of RAM (MaxVMSize
).
We asked for substantially more than needed. In future jobs we could adjust the --mem
parameter to a smaller value with zero risk.
The job that failed in its early stages obviously used very little RAM. The calculation that
is still running shows no value on the MaxVMSize
column, since the memory usage is reported
at the end of the job.
Ongoing jobs#
For jobs that are still running, the best option is to log in to the node running the job (or one of the nodes, if using the parallel QOS) and check the memory usage of the individual processes:
$ ps ux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
my_user 2447781 0.0 0.0 12980 3308 ? SN 02:26 0:00 /bin/bash /var/spool/slurmd/job1234567/slurm_script
my_user 2447805 0.0 0.0 327216 8552 ? SNl 02:26 0:00 srun python3 my_mpi_job.py
my_user 2447821 99.6 0.7 8102076 4114324 ? RNl 02:26 702:01 python3 my_mpi_job.py
my_user 2447822 99.6 0.5 7062716 3080716 ? RNl 02:26 701:47 python3 my_mpi_job.py
my_user 2447823 99.6 0.5 7063184 3077608 ? RNl 02:26 702:03 python3 my_mpi_job.py
my_user 2447824 99.6 0.5 7062720 3079704 ? RNl 02:26 702:02 python3 my_mpi_job.py
In the example above, we can see that the Slurm script launches one python job, which itself uses 4 MPI
processes. Each of these processes is using between 0.5% and 0.7% of the RAM (%MEM
) or, in a different
presentation, between 7062716 KB and 8102076 of RAM (in the VSZ column, respectively roughly 7 and 8 GB)
per process.
For a case like this, asking for --mem=35000
is likely a good option.
Note
RAM usage may fluctuate at different steps in the job. You need to ask Slurm for a bit more RAM than the most you'll need. You may need to carefully check the RAM usage along the job, or at specific sections of the job, rather than just at the random moment you logged in to the node.
MaxMemPerCPU
and reassignment of the number of cores#
In our cluster we recently activated a Slurm option called MaxMemPerCPU
which assigns a
maximum value of RAM that you can ask for each CPU core. When you exceed that amount of RAM, Slurm
automatically adjusts the number of cores assigned to your job.
Say we have nodes with 5 GB of RAM per core. If you submit a job with
sbatch --ntasks=8 --mem=42G
, then Slurm will instead assign 9 cores to your job.
Slurm will inform you of this with a message like:
$ srun -p debug --mem 42G -n 8 hostname
[...]
srun: info: [MEMORY] ⚠️ WARNING: The amount of memory you asked for corresponds to 9 cpus.
srun: info: [MEMORY] ⚠️ WARNING: For this reason, your job will be assigned 9 cpus instead of 8.0.
[...]
The cost implications this may have are discussed in more detail in the Billing page on our documentation.
My job needs a lot of RAM per core, what can I do?#
If you need a lot of RAM per core, you are likely going to be using Jed. On Jed we have three types of nodes. They all have 72 cores and when it comes to RAM:
- standard nodes have 512 GB of RAM;
- bigmem nodes have 1 TB of RAM;
- hugemem nodes have 2 TB of RAM.
This equates to roughly 7000 MB, 14'000 MB and 28'000 MB of RAM per core, which are the values
we defined for MaxMemPerCPU
for each of these nodes. You may note this is slightly below the actual
amount of RAM per core, since we reserve some RAM for the system itself (i.e. the OS, GPFS, Slurm, etc).
If your job needs about 10 GB of RAM per core and you want to use 8 cores, then your best option is to use the bigmem nodes. You could do that by submitting your job with:
If you opted to use the standard nodes, this job would be assigned 13 cores instead. For many codes this is no issue and the job would simply finish faster. For some others this may be a problem.