Skip to content

Announcements#

Behind the Scenes of SCITAS Platforms - February 2026 Release

At SCITAS, the DevOps team is responsible for purchasing, installing in the data center, deploying, and operating the computing, visualization, and storage infrastructures that serve more than 1,700 EPFL researchers and students every day.

These platforms support a very wide range of scientific activities, from algorithm training and inference, meteorology, computational chemistry, fluid and solid mechanics, bioengineering, to nuclear fusion and many others.

Operating such infrastructures means constantly arbitrating between conflicting constraints:

  • laboratories that require maximum stability,
  • others that need very recent software stacks,
  • security requirements that cannot be compromised,
  • and vendor-imposed hardware and firmware updates tied to support contracts.

At the same time, we are fully aware that any production downtime has a direct and negative impact on research and teaching schedules. In academia, deadlines are increasingly tight, and lost compute time is rarely recoverable.

A multi-year transformation of our practices

Over the past few years, we have engaged in a deep transformation of our operational methods and practices, with clear objectives:

  • reduce time to resolution for critical production incidents,
  • deploy critical changes faster without compromising stability,
  • lower the change failure rate,
  • and minimize, to the strict necessary minimum, maintenance operations requiring full production outages.

Much of this work is invisible to users by design. However, it is starting to have a very concrete impact: in 2024 and 2025, we reached exceptionally high availability levels, with very few critical incidents and short resolution times.

How we deliver changes

All changes are grouped into formal releases. We plan to deploy 4 to 5 releases per year, and when needed, we also apply hotfixes to resolve critical issues within the same day.

As some of you may have noticed, between December 2025 and today, we successfully performed, for the first time, a complete upgrade of all our storage systems without any production downtime.

February 2026: a first without planned downtime

In February 2026, we will deploy the first release of the year with no planned production outage. Because this is a first, we decided to explicitly communicate these changes. In the long term, our goal is to make this blog the primary channel for sharing such information.

Deployment schedule

  • JED: 2026-02-17 to 2026-02-19
  • KUMA: 2026-02-24 to 2026-02-26

What’s new for users in this release

Among other improvements and fixes, this release introduces changes that enable a new GPU partition based on MIG (Multi-Instance GPU). This will allow us to offer low-cost virtual GPU instances, particularly well suited for lightweight workloads, testing, prototyping, and teaching, while making more efficient use of our hardware resources.

The full changelog for this release is available below.

We hope this gives you a better view of the work happening behind the scenes and why we believe these changes matter for your daily research and teaching activities.

Changelog

v4.0.0 (2026-02-05)

Highlights

  • kuma/slurm : enable gpu profiling epi/prolog (!765) (7d69008)

  • lua/maxcpupergpu : add logic to limit cores per gpu (!777) (fbeffee)

  • These changes enable a new GPU partition based on MIG (Multi-Instance GPU), offering low-cost virtual GPU instances suitable for lightweight workloads, testing, and teaching.

Bug fixes

  • vault : use certbot cert for scitasmgmt (!886) (ad916ba)

  • sinfo2influx : Ignore unknown extra lines in the reason field (!872) (46d8d87)

  • hpc_jobs : fix merge strategy for cron::job::multiple (!865) (97bc05a)

  • packages/compute : add ninja-build, a build system (!793) (4d285a3)

  • fqdn : Fix monitoring01 host config definition (!862) (1f6367c)

  • docker : compose_config error if undef (!853) (def1460)

  • docker : fix docker compose service gpfs interaction (!852) (f3e9d44)

  • merging sshd class with development (a7d2037)

  • dns : ood1 is now on jed by default (!844) (d8822b5)

  • dns : change openondemand on ondemand (!839) (8efc51c)

  • docker : compose: wrong type for compose_env_file (!834) (d03c1e7)

  • network : dhpc typo to dhcp (!837) (e67dc72)

  • gpfs : Enable the deployment of GPFS on new clusters (!836) (eed1fd2)

  • letsencrypt : Enable renew with Cron (!835) (1a76059)

  • hostkeys : Give the node's ssh host keys a name (!816) (64d05fc)

  • dns : ood2 typo again (f2aa869)

  • dns : ood2 reverse typo (!821) (939cb12)

  • docker-compose : Fix typos in Puppet recipe (!817) (1918a3a)

  • slurm/kuma : Add MaxMemPerCPU to limit RAM usage (!771) (72442ef)

  • telegraf : do not check nfs4 mount (!805) (4722baf)

  • slurm : do not reload slurm but use sigusr2 when logrotate runs (!755) (5d458f0)

  • quota : set fallback defaults to local home quotas (!794) (2c55b04)

  • nfs : increase smartproxy nfs share threads (!795) (dcadde5)

  • lmod : suppress error when there is no stack (!801) (8b2ee92)

  • hiera : nhc lookup regex (!800) (d1d36dd)

  • slurm/prolog : Stop the GPU monitoring services before reloading the nvidia module (!788) (f6ba8f9)

  • kuma/nhc : maintenance: kuma admin node ib port (!723) (4eee693)

  • nhc : only mark compute node as offline if needed (!721) (2c24245)

  • selinux deactivated without reboot (!779) (3309fc9)

  • allow apt as package provider (!778) (e8c0954)

  • slurmdbd : correct charset (!757) (8cbb085)

  • slurmctld : replace deprecated ensure_packages with stdlib::ensure_packages (!758) (8e828c3)

  • vars : resolve unknown variable warnings (!759) (3746c08)

  • url : fix gitlab url (!761) (a7d70cf)

  • insights : Remove insights-client from the motd (!742) (4592bc9)

  • checkzone : Avoid errors with squash commits (!753) (63a183b)

  • slurmctld : HOTFIX: Return to service set to 1 (!880) (57c5060)

  • HOTFIX: add ood ips from kuma network (!846) (2008dd0)

  • sausage : HOTFIX: bump to 1.2.2.1 (!841) (8b847c7)

  • jed,kuma : HOTFIX: fix regex issue on nodes with nhc.conf (!831) (80fdbda)

  • slurm/prolog : HOTFIX: Stop the GPU monitoring services before reloading the nvidia module (!788) (69db6a3)

  • slurm/kuma : HOTFIX: Add MaxMemPerCPU to limit RAM usage (!771) (5868090)

  • hosts : HOTFIX: scoldap IP has change (!791) (781c14b)

  • slurm : HOTFIX: epilog/prolog profiling gpu: fix reloading all missing modules (!767) (7021134)

  • slurm : epilog: fix typo in gpu profiling (!766) (2124000)

  • packages : HOTFIX: Add libglvnd-opengl package to resolve libOpenGL.so.0 dependency issue (!764) (11a40db)

  • packages : HOTFIX: Add numactl-devel into compute group (!747) (687e954)

  • packages : HOTFIX: Add numactl-devel and libglvnd-opengl packages as requested by Daniel (!746) (a6dc0b4)

Chores

  • dns : add testnfs-deb13 future VM IP in DNS (!885) (50ed8d2)

  • prometheus : use last version of community module (!833) (b9a6b56)

  • MR : Add info section to template for openproject OP code and AI assistance (!876) (6ca759e)

  • dns : Add mtc004 node (!867) (37ce9f4)

  • staging dns use cluster addr (!871) (4332bd5)

  • dns : Add Alias for zaphod ui container (!870) (999666b)

  • dns : Make s3.hpc.epfl.ch as alias for scitas-object (!869) (8b4be9d)

  • package : add tmux and numactl libs (!722) (d893558)

  • dns : add taloa01.staging (!859) (d6a4495)

  • dns : Re-add kuma2 behind kuma.hpc.epfl.ch (!849) (357fb9c)

  • dns : Re-add jed2 behind jed.hpc.epfl.ch (!845) (5ab7c78)

  • dns : Remove jed2 and kuma2 from their respective round-robin entries (!842) (c72ebf9)

  • dns : add staging vms (!838) (0b7bc93)

  • dns : Re-add jed1 and kuma1 to their respective round-robin entries (!829) (08fe809)

  • dns : Remove jed1 and kuma1 from their respective round-robin entries (!826) (07c8f9b)

  • dns : Add monitoring02 (!819) (37b361f)

  • dns : add chat.hpc.epfl.ch (d7ee3b8)

  • slurm : bump slurmrestd plugin and data_parser for 24.11+ version (!803) (b7c017e)

  • dns : Add cname for openondemand on ood2 (!807) (9fdd9de)

  • clean 2025 maintenance (!745) (c6c661f)

  • packages : add torrent client (INC0718696) (!695) (6e2c063)

  • dns : Add cname for openondemand on ood2 (c08b90f)

  • account : cs471 QOS TASK0256401 (!799) (a88ab90)

  • dns : add manticore nodes to cluster zone (!790) (19263b0)

  • docker : Add new version for Mattermost docker image: 10.11 (!787) (1db4636)

  • dns : Add whispers to containers01 (!784) (0ece13d)

  • dns : Add ood2 (!772) (93ba424)

  • dns : Add scitas-object to hpc.epfl.ch domain and permit usage from vm network (!751) (d02df61)

  • dns : Re-add kuma2 to kuma's rr-dns (fa3a9da)

  • dns : add manticore dns addresses to bind (!743) (6aebeb4)

  • dns : Add admin.zaphod to bind (!740) (ba1f6f5)

Features

  • test infrastructure for cfgmgmt (!823) (e6b2150)

  • slurm : allow override of default configuration (!857) (265bb68)

  • fqdn : add crontab for scratch_dir_cleanup script to admin nodes (!866) (a9be83e)

  • cron : add scratch cleanup in hpc_jobs crontab and fix kadmin crontab management (!863) (36ac674)

  • lua/maxcpupergpu : add logic to limit cores per gpu (!777) (fbeffee)

  • Add monitoring01 config (!818) (74f2280)

  • Add Debian compatibility for profile::base (!858) (f4d408d)

  • Open OnDemand deployement (!806) (cf3ed0c)

  • docker : Add more parameters to docker-compose files and fix Debian package removal bugs (!851) (9feacc4)

  • object : letsencrypt certificates (!843) (695a538)

  • compute : add cgroup and nvidia exporter (!815) (be12822)

  • sausage : HOTFIX: bump to 2.0.0 (!776) (eeec605)

  • docker/runner : prune docker system instead of image (!812) (1d6cee4)

  • common : prometheus node exporter everywhere (!814) (a650d2d)

  • docker : misc improvements (!830) (4bd6c1f)

  • dns : add hal test vm address (!832) (db9d16c)

  • dns : add hal to hpc.epfl.ch (!825) (5d814bb)

  • scitas_puppet_docker : Add Docker Compose deployement (!810) (3765c13)

  • ldap : use ldap.epfl.ch instead of scolap.epfl.ch (!811) (7b280c0)

  • izar : add bio-468 qos for a course TASK0257143 (!809) (3d46929)

  • izar : add cs-500 qos TASK0257105 (!808) (0d3c596)

  • clustername : modify clustername definition for vms hostgroup (!802) (25498c3)

  • letsencrypt (!796) (0ad40e1)

  • kuma/admin : add a cron to fetch the scitas-hpcusers (!769) (7a2cec6)

  • common : parametrize 'git' attribute from user resource (!770) (c6d91fb)

  • ssh : Adjust the limits on the number of unauthenticated connections (!749) (762744d)

  • add quota for local home (!774) (27fd3a9)

  • security : add fail2ban and rkhunter classes (!773) (f443f2e)

  • add ganeti basic host installation (!780) (f4f2ab7)

  • cloud : move cloud clusters to RHEL9.4 (!756) (6d50c98)

  • slurm : make sackd service state configurable via sackd_ensure parameter (!760) (6857294)

  • checkzone : Prevent zone updates without a serial change (!752) (15c376f)

  • gpfs : add a flag for gpfs to not automount /archive on compute nodes (!735) (16adf00)

  • nhc : Add nhc check for /export FS presence on all nodes of kuma jed and izar (!732) (a96300a)

  • HOTFIX: add new partition academic for master and courses (!854) (1b87084)

  • HOTFIX: SSH host based authentication for custom hosts (!840) (87e921c)

  • sausage : HOTFIX: bump to 2.0.0 (!776) (dcec57a)

  • gpfs : HOTFIX: Update gpfs to version 5.2.3-4 on jed and kuma (!827) (3b719a4)

  • kuma/slurm : enable gpu profiling epi/prolog (!765) (7d69008)

Refactoring

  • profiling_gpu : Move common code to a single file (!789) (306d5c5)

  • profiling_gpu : HOTFIX: Move common code to a single file (!789) (128fa9c)

Unknown

  • Revert "feat(cron): add scratch cleanup in hpc_jobs crontab and fix kadmin... (!864) (b8fbc1c)

  • Revert "feat(sausage): HOTFIX: bump to 2.0.0 (!776)"

This reverts commit eeec60522e005e53da4574a3bae91fba3ae2ae94. (6daab18)

  • Revert "chore(dns): Add cname for openondemand on ood2"

This reverts commit c08b90fb29c70d8fbeecf40d71552e536038d27a. (f587ff9)

  • Revert "feat(sausage): HOTFIX: bump to 2.0.0 (!776)"

This reverts commit dcec57a6ecc0c170d890cd1502d493626e83c4d7. (f673f79)

Helvetios: Helvetios down - run your jobs on Jed

Helvetios is currently fully unavailable due to a major network failure. An on-site intervention is required to restore a minimal service.

If you have already copied your data to the central storage, you can continue running your jobs on the Jed cluster (jed.hpc.epfl.ch) using the new academic partition.

How to proceed

Submit your jobs on Jed using the academic partition, either:

  • On the command line:
--partition=academic
  • In your submission script:
#SBATCH --partition=academic

Important note about the software stack

If you are using our software stack, please recompile your code on Jed, as the software stack available there is more recent and module versions may differ from Helvetios.

We will keep you informed as soon as Helvetios is restored to a minimal operational state to allow data copying as soon as possible.

Thank you for your understanding and cooperation.

Helvetios: Severe cluster issues

Our aging Helvetios academic cluster regularly experiences severe storage and network issues as the hardware starts failing.

We already had to isolate Helvetios from our central storage system earlier this year to mitigate the impact of these issues on the other clusters.

All data stored on Helvetios is at risk of being lost in the event of a fatal failure!

Please ensure you have a copy of ALL important data on a separate, reliable storage system. Do NOT rely solely on this cluster to store critical files.

We highly recommend copying important data currently stored on Helvetios to our central storage system (accessible from all our other clusters).

We remind you that Helvetios runs on obsolete hardware that is no longer supported by the vendors. SCITAS can only provide best-effort support for its maintenance. We are working on the next CPU computing solution for students and courses.

Thank you for your understanding and cooperation.

Helvetios: Back to production in degraded state!

The Helvetios cluster is now available again, but in a reduced and isolated configuration. Please read carefully the key changes and actions required.

Current Status

  • The cluster is back online with only 24 nodes currently provisioned.
  • Helvetios is no longer connected to the central storage, meaning:
  • /home, /scratch, and /work are now local to the cluster
  • The /work filesystem is no longer shared
  • All data previously stored in /scratch has been lost and cannot be recovered.

We plan to gradually increase the number of available nodes as soon as we confirm system stability.

Why These Changes?

  • Helvetios is based on unsupported, obsolete hardware, and SCITAS can only provide best-effort support for its maintenance.
  • Recent network issues on Helvetios have caused disruptions and performance degradation across all production clusters by impacting the central storage.
  • To protect the integrity of the production environment, we had to isolate Helvetios from shared storage.

What You Need to Do

  • Manually copy your SSH keys
  • Data previously stored in /home or /work (when it was part of the central storage) will need to be restored manually.

We understand this situation may cause inconvenience, and we appreciate your patience as we continue to maintain access to this legacy system under challenging conditions.

Happy computing! πŸ˜ŠπŸš€

Kuma Cluster Full Production & Pricing – Nov 1st

We are excited to announce the successful completion of the beta testing phase for the Kuma GPU cluster, and we are preparing to enter full production starting from November 1st, 2024. Your participation in the beta phase has been invaluable, with a total of approximately 450,000 GPU hours of calculation jobs executed. This extensive testing allowed us to identify and resolve various hardware and software issues, ensuring that Kuma is largely ready for production.

Kuma Beta Opening

After a successful restricted beta with more than 80'000 jobs submitted, we are pleased to announce that Kuma, the new GPU-based cluster, is available for testing starting now! This marks an important milestone as we transition from the Izar cluster, which will soon be reassigned to educational purposes, to the much more powerful Kuma cluster. You can now connect to the login node at kuma.hpc.epfl.ch to begin testing your codes.

New Archiving Service Now Available

We are happy to announce the launch of our new archiving service, designed to provide long-term low-cost storage for your research data.

Accessible from the frontend nodes of our Izar and Jed clusters, this service utilizes a reliable magnetic tape system to ensure your data is preserved for a minimum of 10 years.