Skip to content

Kuma#

Kuma Cluster

Research Cluster

This cluster is for pay-per-use accounts. Master students and courses cannot use Kuma. For the educational GPU cluster check Izar.

Useful info#

Connecting to the clusters#

To connect to the cluster you should:

ssh <username>@kuma.hpc.epfl.ch

Here's the list of current fingerprints you should expect when connecting to this cluster:

ECDSA
    MD5:94:cb:c8:73:22:30:70:ea:53:36:9e:4b:fd:33:e0:b6
    SHA256:vpM/BzmJapiUU3o6hbm2zlKFN93D8QE3xObVdh8x4hM
ED25519
    MD5:46:48:27:b0:b3:07:a8:68:ca:a5:4c:cf:1a:c2:c6:c4
    SHA256:VU3simBjo2CoUePsABLhZ/HpW+anz231EU3rfurZDFo
RSA
    MD5:14:41:97:e2:16:33:a9:cd:d9:2e:07:37:a6:39:31:ae
    SHA256:u3v9urAmgx03w1xUZR6WOxyXAoDoyTcBbbiYbR4IeMc

QOS#

The standard QOS are:

  • normal for jobs using up to 8 nodes, with a time limit of 3 days. This is the default;
  • long for jobs using up to 8 nodes, with a time limit of 7 days;
  • build for compiling your codes, with up to 16 cores on 1 node, 0 GPU and a time limit of 4 hours;
  • debug for debugging jobs on up to 2 nodes, with a high priority and a time limit of 1 hour.

Choose one with -q <qos> or --qos <qos>.

Partitions#

There are 3 partitions on Kuma, to differentiate nodes based on GPU type and use case:

  • h100, to use the Nvidia H100 GPU nodes (with FP64 capabilities). You can request up to 16 cores per GPU;
  • l40s, to use the Nvidia L40s GPU nodes (with FP32 capabilities). You can request up to 8 cores per GPU;
  • mig12gb, or mig24gb to use the MIG instances on the H100 GPU nodes. You request up to 2 or 5 cores per MIG (for 12gb and 24gb respectively).

There is no default partition. You have to choose one.

Choose one with -p <partition> or --partition <partition>.

The MIG have a reduced compute and memory capacity (roughly 10 GB of VRAM). They are ideal for debugging or coding sessions, as well as any other job that is too small to effectively use one full GPU.

Automatic assignment of RAM per core

We automatically assign 5900 MB of RAM per CPU core associated with the job. You cannot ask for more RAM than this, even by specifying --mem.

Hardware characteristics#

This cluster has the following configuration:

Type Count Model CPU Memory Storage Naming GPU # GPU Model
Frontend 2 ThinkSystem SR675 V3 Version: 03 AMD EPYC 9334 @ 2.7 GHz 384 GB 6.4 TB (NVMe) kuma[1-2] NA NA
Compute node H100 84 ThinkSystem SR675 V3 Version: 03 AMD EPYC 9334 @ 2.7 GHz 371 GB 6.4 TB (NVMe) kh[001-084] 4 NVIDIA H100 94GB
Compute node L40s 20 ThinkSystem SR675 V3 Version: 03 AMD EPYC 9334 @ 2.7 GHz 371 GB 7.6 TB x3 (NVMe) kl[001-020] 8 NVIDIA L40S 48GB
Admin server 2 ThinkSystem SR630 V3 Version: 06 Intel(R) Xeon(R) Silver 4416+ CPU @ 2.00 GHz 256 GB 1920 GB (SCSI) kadmin[1-2] NA NA
Proxy server 1 ThinkSystem SR630 V3 Version: 06 Intel(R) Xeon(R) Silver 4416+ CPU @ 2.00 GHz 256 GB 1920 GB (SCSI) ksmartproxy1 NA NA

H100#

Kuma H100 GPU node

L40S#

Kuma L40S GPU node

Admin servers#

Kuma admin servers

Frontal node#

Kuma frontal node