Kuma Beta Opening#
After a successful restricted beta with more than 80'000 jobs submitted, we are pleased to announce that Kuma, the new GPU-based cluster, is available for testing starting now!
This marks an important milestone as we transition from the Izar cluster, which will soon be reassigned to educational purposes, to the much more powerful Kuma cluster.
You can now connect to the login node at kuma.hpc.epfl.ch
to begin testing your codes.
Beta Test Rules#
- Duration: The beta testing period, open to all pay-per-use accounts, is planned to end on September 30th.
- Free Jobs: All jobs during the beta phase are free. No invoices will be issued. However, you will be able to monitor track your usage with the
sausage
command, based on the expected U1 prices (see below). -
Production pricing : After the beta phase, the following prices will apply:
- H100 => CHF 0.5174/gpu/hour,
- L40S => CHF 0.2141/gpu/hour. These are projected prices and may be refined before validation.
-
All jobs are preemptible : As per our contract, acceptance tests must be conducted during the beta phase. Consequently, we will be using a QoS configuration with preemption attributes. This means that any running job may be stopped and requeued automatically to free resources.
- Real-Time Bug Fixes: We will address bugs in real-time, which may cause service interruptions with minimal notice. Running jobs may be cancelled and unsaved data lost.
- Mandatory Partition Specification: When submitting jobs, it is mandatory to specify the partition (
h100
orl40s
).
Default Job Properties#
- Job Duration Limit : During the beta test period, the maximum job duration will be limited to 24 hours. The allocation policies will be adjusted during this time, so we advise you to regularly review the QoS configuration using the command:
sacctmgr show qos
. - Wall-time: The default wall-time is now set to 5 minutes. Please choose a value that fits your needs with the
--time
option. - Memory: The default RAM-per-core is set to 5900 MB, which you may adjust if necessary.
Technical Specifications of Kuma#
- Partition
h100
: 84 nodes, each equipped with 4 NVIDIA H100 SXM5 GPUs,94 GB GPU RAM (HBM2e) per GPU, Memory bandwidth: 2.4 TB/s, Interconnected with NVLink, 900 GB/s bandwidth. - Partition
l40s
: 20 nodes, each equipped with 8 NVIDIA L40S GPUs, 48 GB GPU RAM (HBM2e) per GPU, Memory bandwidth: 864 GB/s. - Connectivity: Each compute node has two 200Gb InfiniBand HDR connections
-
Storage:
/scratch
storage on a “Full Flash” infrastructure -
Usage example:
If your software depends on NCCL (ex. pyTorch) use the option --gpus
or --gpus-per-node
avoid --gpus-per-task
Software Stack#
There is a main stack including GCC 13.2.0, OpenMPI 5.0.3, OpenBLAS, and CUDA 12.4. There is also a minimal support for a NVHPC stack.
The content of software stack is slightly different on both partition. For this reason we highly advise you to compile directly on the nodes.
On the frontnode the default stack is the one from the h100
partition to load the l40s
stack you can use slmodules
:
Container Support#
Containers can be used on Kuma using apptainer/singularity
, the --nv
option should be added when launching apptainer/singularity
for the NVIDIA support.
For more information about containers on our clusters:
Example:
We look forward to your participation in testing Kuma and appreciate your cooperation in making this a successful launch. If you encounter any issues or have any feedback, feel free to contact us by sending an email to 1234@epfl.ch with “HPC” in the subject line.Happy computing! 😊🚀