Memory allocation#

Introduction#

Modern cluster nodes have many CPU cores and significant amounts of RAM. Calculating the amount of RAM each core can use without posing problems to other cores running similar tasks (let's call it the Maximum Memory per CPU core) is a trivial task.

In some cases, your job may very well need only one core and 200 GB of RAM (a value largely exceeding such Maximum Memory per CPU core). But most jobs don't need that much RAM and it would be easy to submit a job that asks for a few cores and large amounts of memory, even if you don't need that much. In this situation the node would be underused, since the few remaining cores may have not enough free RAM to run any significant job. As such, a small number of jobs asking for a few cores but large amounts of memory could keep most of the resources of a cluster idle. This would affects others, but would also significantly delay the execution of your own jobs, since a node could not run many (or perhaps not even more than one) such job.

This document will teach you how to effectively set up your Slurm jobs so you can run the jobs you need without wasting resources.

Note

To discourage overallocation of RAM we activated a Slurm option which adjusts the number of cores of your job based on the RAM you ask, MaxMemPerCPU. This may have cost implications. More on this in a later section of this document.

Checking how much RAM my job needs#

The first step to effectively select an amount of RAM is to know how much your job may use.

Some codes have an option like Mem=3000M, which may give you a hint. However, it's important to know whether your code will use this as a hard limit or if this is instead a suggestion (which could be impossible to comply with, given method or size of problem constraints).

Even if the job is limited to the memory you put in your input, this amount is typically what is reserved for the calculation itself. The program you run will also need some RAM and the Slurm job itself needs some more memory.

As such, if you know your job needs 3000 MB of RAM, you should ask for a bit more to Slurm. For instance by adding #SBATCH --mem=4000M to your script. If, after running your job, you open the slurm-<JOBID>.out and see a message like:

/var/spool/slurmd/job12345/slurm_script: line 7: 48740 Killed      python my_job.py
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=12345.batch. Some of your
processes may have been killed by the cgroup out-of-memory handler.

then you should ask for more RAM. Increase it conservatively and try again.

Jobs that have finished#

For jobs that have completed the Slurm accounting tool is likely the best option. Here's a sample of its standard output:

$ sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12344                t1   standard  scitas-ge          1  COMPLETED      0:0
12344.batch       batch             scitas-ge          1  COMPLETED      0:0
12344.extern     extern             scitas-ge          1  COMPLETED      0:0
12345                t2   standard  scitas-ge          1 OUT_OF_ME+    0:125
12345.batch       batch             scitas-ge          1 OUT_OF_ME+    0:125
12345.extern     extern             scitas-ge          1  COMPLETED      0:0

In its most basic form it can help you diagnose, just like the message in the Slurm's output file that job 12345 died with a OUT_OF_MEmory error, whereas your previous job finished with no problems.

A more useful sacct output includes the actual memory usage. The next command shows all your jobs that have started since yesterday (-S now-1day) and that are still running or ended until now (-E now):

$ sacct -S now-1day -E now --format=JobID,JobName,AllocCPUS,Elapsed,ReqMem,MaxVMSize,State
JobID           JobName  AllocCPUS    Elapsed     ReqMem  MaxVMSize      State
------------ ---------- ---------- ---------- ---------- ---------- ----------
1234567              t1         18   09:21:29    110000M             COMPLETED
1234567.bat+      batch         18   09:21:29             11330344K  COMPLETED
1234567.ext+     extern         18   09:21:29                     0  COMPLETED
1234568              t2         18   15:28:02    110000M               RUNNING
1234568.bat+      batch         18   15:28:02                          RUNNING
1234568.ext+     extern         18   15:28:02                          RUNNING
1234569              t3         18   00:00:01    110000M                FAILED
1234569.bat+      batch         18   00:00:01                 3144K     FAILED
1234569.ext+     extern         18   00:00:01                     0  COMPLETED

In these jobs we asked for 18 cores (AllocCPUS) and 110'000 MB of RAM (ReqMem). For the job that completed the maximum RAM usage was roughly 11'300 MB of RAM (MaxVMSize). We asked for substantially more than needed. In future jobs we could adjust the --mem parameter to a smaller value with zero risk.

The job that failed in its early stages obviously used very little RAM. The calculation that is still running shows no value on the MaxVMSize column, since the memory usage is reported at the end of the job.

Ongoing jobs#

For jobs that are still running, the best option is to log in to the node running the job (or one of the nodes, if using the parallel QOS) and check the memory usage of the individual processes:

$ ps ux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
my_user  2447781  0.0  0.0  12980  3308 ?        SN   02:26   0:00 /bin/bash /var/spool/slurmd/job1234567/slurm_script
my_user  2447805  0.0  0.0 327216  8552 ?        SNl  02:26   0:00 srun python3 my_mpi_job.py
my_user  2447821 99.6  0.7 8102076 4114324 ?     RNl  02:26 702:01 python3 my_mpi_job.py
my_user  2447822 99.6  0.5 7062716 3080716 ?     RNl  02:26 701:47 python3 my_mpi_job.py
my_user  2447823 99.6  0.5 7063184 3077608 ?     RNl  02:26 702:03 python3 my_mpi_job.py
my_user  2447824 99.6  0.5 7062720 3079704 ?     RNl  02:26 702:02 python3 my_mpi_job.py

In the example above, we can see that the Slurm script launches one python job, which itself uses 4 MPI processes. Each of these processes is using between 0.5% and 0.7% of the RAM (%MEM) or, in a different presentation, between 7062716 KB and 8102076 of RAM (in the VSZ column, respectively roughly 7 and 8 GB) per process. For a case like this, asking for --mem=35000 is likely a good option.

Note

RAM usage may fluctuate at different steps in the job. You need to ask Slurm for a bit more RAM than the most you'll need. You may need to carefully check the RAM usage along the job, or at specific sections of the job, rather than just at the random moment you logged in to the node.

`MaxMemPerCPU` and reassignment of the number of cores#

In our cluster we recently activated a Slurm option called MaxMemPerCPU which assigns a maximum value of RAM that you can ask for each CPU core. When you exceed that amount of RAM, Slurm automatically adjusts the number of cores assigned to your job.

Say we have nodes with 5 GB of RAM per core. If you submit a job with sbatch --ntasks=8 --mem=42G, then Slurm will instead assign 9 cores to your job. Slurm will inform you of this with a message like:

$ srun -p debug --mem 42G -n 8 hostname
[...]
srun: info: [MEMORY]     ⚠️  WARNING: The amount of memory you asked for corresponds to 9 cpus.
srun: info: [MEMORY]     ⚠️  WARNING: For this reason, your job will be assigned 9 cpus instead of 8.0.
[...]

The cost implications this may have are discussed in more detail in the Billing page on our documentation.

My job needs a lot of RAM per core, what can I do?#

If you need a lot of RAM per core, you are likely going to be using Jed. On Jed we have three types of nodes. They all have 72 cores and when it comes to RAM:

standard nodes have 512 GB of RAM;
bigmem nodes have 1 TB of RAM;
hugemem nodes have 2 TB of RAM.

This equates to roughly 7000 MB, 14'000 MB and 28'000 MB of RAM per core, which are the values we defined for MaxMemPerCPU for each of these nodes. You may note this is slightly below the actual amount of RAM per core, since we reserve some RAM for the system itself (i.e. the OS, GPFS, Slurm, etc).

If your job needs about 10 GB of RAM per core and you want to use 8 cores, then your best option is to use the bigmem nodes. You could do that by submitting your job with:

$ sbatch --mem=85000 --ntasks=8 -p bigmem -q bigmem <your script>

If you opted to use the standard nodes, this job would be assigned 13 cores instead. For many codes this is no issue and the job would simply finish faster. For some others this may be a problem.

Last update: July 4, 2023