Skip to content

SCITAS blog#

Security Update - SCITAS Infrastructure Status

We would like to inform you that a critical Linux kernel vulnerability (CVE-2026-31431) has recently been disclosed. This vulnerability allows any user with access to a system to gain root privileges with minimal effort. Such a flaw carries significant risks, including potential data compromise affecting users and laboratories, identity impersonation, and the installation of malicious software.

At the EPFL level, no unified policy has been enforced, and each unit has been asked to assess and manage the risk independently. Within SCITAS, we have decided to execute an emergency procedure by temporarily restricting access to the infrastructure while preserving running jobs. This approach allows us to conduct a thorough analysis of system logs and maintain full control over the situation.

We are fully aware that this unexpected disruption has a strong impact on your work. However, we have prioritized the security and integrity of both data and infrastructure.

At this stage, we have no indication of any system compromise. A complete verification of the software stack binaries has been carried out, and running jobs continue to execute without issue. In addition, all passwords and selected certificates for critical services have been rotated as a precautionary measure.

We are now progressively restoring access by restarting the frontends and compute nodes. We expect to be able to partially restore the service before noon.

We will continue to keep you informed as the situation evolves.

Progressive Restoration of SCITAS Clusters

We would like to inform you that we have started bringing patched compute nodes back into production. The frontends of both Kuma and Jed are now accessible to all users.

Users with running jobs on compute nodes that have not yet been patched will not be able to connect to these nodes via SSH.

Throughout this intervention, no running jobs were stopped, and we have not identified any evidence of data leakage.

We have made every effort to strike a balance between security and service availability. We sincerely apologize for this unexpected disruption, especially during such an important period for research activities.

Please do not hesitate to report any anomalies or issues you may encounter.

Partial Access Restoration - Data Retrieval on SCITAS Clusters

We would like to inform you that the frontends of the Jed and Kuma clusters have now been reopened in order to allow users to retrieve their data.

At this stage, users with running jobs on a given cluster will not be able to connect to its corresponding frontend. This is because the compute nodes hosting these jobs have not yet been redeployed and therefore remain vulnerable.

As a temporary workaround, users running jobs on Kuma may retrieve their data via the Jed frontend, and vice versa, as long as the data are not located on scratch.

Please note that the host keys of the frontends have been changed. When connecting, you should expect new SSH fingerprints. The expected fingerprints are available on the following pages:

During our investigation, we identified 193 private SSH keys stored in user home directories that could grant access to the clusters. To mitigate any risk of identity impersonation, access using these keys has been disabled. As a consequence, affected users will need to log in using their Gaspar password and then deploy a new, clean public SSH key.

If you have an urgent need to access a specific cluster where you currently have running jobs, you may contact us by opening a support ticket to request the termination of your jobs.

We appreciate your understanding and cooperation as we continue to restore the service safely.

Security Incident - Preventive Measures on SCITAS Clusters

We would like to inform you that, following the disclosure of a critical Linux kernel vulnerability (CVE-2026-31431), we have taken immediate preventive measures across the SCITAS infrastructure.

As of this morning: - All login frontends have been temporarily disabled to prevent new interactive access. - Running jobs are currently continuing on compute nodes. - We are actively analyzing system logs to assess any potential impact.

At this stage, there is no confirmed evidence of data compromise. However, investigations are ongoing.

We are currently awaiting the availability of official security patches from the operating system vendors. As soon as they are released, we will proceed with their deployment, which will require a coordinated restart of the clusters.

Depending on the outcome of our analysis, additional measures may be taken, including partial or full reinstallation of affected systems if deemed necessary.

We understand the impact this situation may have on your work and appreciate your understanding as we prioritize the security and integrity of the infrastructure.

We will keep you informed as the situation evolves.

CARLA 2026 – Call for Papers and Workshops

We are pleased to share the announcement of CARLA 2026, the Latin America High Performance Computing Conference.

Since 2014, CARLA has become one of the leading international events promoting excellence in High-Performance Computing (HPC) and AI at scale, bringing together researchers, technologists, and practitioners from across Latin America and around the world.

The 2026 edition will take place September 21–25, 2026, and will be hosted for the first time in Córdoba, Argentina. The event will be organized locally by the Universidad Nacional de Córdoba (UNC) through its Supercomputing Center (CCAD-UNC).

SCITAS Participation and HPC DevOps School

CARLA 2026 will be supported by SCITAS. Following the success of previous editions of the HPC DevOps School, first held in Chile and then in Jamaica, SCITAS will have the pleasure of organizing and contributing to the third edition of this initiative during the conference.

The Development Operations (DevOps) School for HPC is an integral, standalone educational activity organized as part of the CARLA program. The school is designed to provide participants with both hands-on experience and conceptual understanding in operating and optimizing large-scale computing infrastructures.

Led by internationally recognized instructors, the program offers high-quality training focused on practical aspects of managing, operating, and improving HPC/AI factories and large computing clusters.

Conference Tracks

CARLA 2026 will feature four technical tracks:

  • High-Performance Computing (HPC)
  • Artificial Intelligence at HPC Scale
  • HPC Applications
  • Education for High-Performance Computing (new track)

Accepted papers will be published in the Springer Communications in Computer and Information Science (CCIS) proceedings series, continuing the conference’s tradition of high scientific quality.

Paper Submission

Submissions must be:

  • Original contributions
  • Written in English
  • Formatted according to Springer LNCS guidelines
  • Maximum length: 15 pages

All submissions will undergo single-blind peer review by at least three independent experts.

More information:

Call for Workshops

CARLA 2026 also invites workshop proposals addressing emerging topics in HPC, AI, data science, and related fields.

Workshops may be organized as half-day or full-day events and may include:

  • peer-reviewed paper presentations
  • invited talks
  • panel discussions
  • interactive sessions

Proposal Requirements

Workshop proposals (maximum 1 page, PDF format) should include:

  • workshop title
  • organizers and affiliations
  • relevance to the CARLA community
  • preferred duration (half-day or full-day)
  • submission format
  • planned activities (keynotes, panels, discussions, etc.)
  • tentative schedule

Proposals should be sent to the CARLA 2026 Workshop Chairs:

Important Dates

Main Conference

  • Abstract submission deadline: May 10, 2026 (AOE)
  • Paper submission deadline: May 24, 2026 (AOE)
  • Acceptance notification: July 12, 2026
  • Camera-ready submission: July 19, 2026 (AOE)
  • Conference dates: September 21–25, 2026

Workshops

  • Workshop proposal submission: March 22, 2026
  • Notification of acceptance: March 30, 2026
  • Workshop sessions: September 21–25, 2026

Researchers interested in HPC and large-scale computational science are encouraged to contribute and participate in this international event.

Behind the Scenes of SCITAS Platforms - February 2026 Release

At SCITAS, the DevOps team is responsible for purchasing, installing in the data center, deploying, and operating the computing, visualization, and storage infrastructures that serve more than 1,700 EPFL researchers and students every day.

These platforms support a very wide range of scientific activities, from algorithm training and inference, meteorology, computational chemistry, fluid and solid mechanics, bioengineering, to nuclear fusion and many others.

Operating such infrastructures means constantly arbitrating between conflicting constraints:

  • laboratories that require maximum stability,
  • others that need very recent software stacks,
  • security requirements that cannot be compromised,
  • and vendor-imposed hardware and firmware updates tied to support contracts.

At the same time, we are fully aware that any production downtime has a direct and negative impact on research and teaching schedules. In academia, deadlines are increasingly tight, and lost compute time is rarely recoverable.

A multi-year transformation of our practices

Over the past few years, we have engaged in a deep transformation of our operational methods and practices, with clear objectives:

  • reduce time to resolution for critical production incidents,
  • deploy critical changes faster without compromising stability,
  • lower the change failure rate,
  • and minimize, to the strict necessary minimum, maintenance operations requiring full production outages.

Much of this work is invisible to users by design. However, it is starting to have a very concrete impact: in 2024 and 2025, we reached exceptionally high availability levels, with very few critical incidents and short resolution times.

How we deliver changes

All changes are grouped into formal releases. We plan to deploy 4 to 5 releases per year, and when needed, we also apply hotfixes to resolve critical issues within the same day.

As some of you may have noticed, between December 2025 and today, we successfully performed, for the first time, a complete upgrade of all our storage systems without any production downtime.

February 2026: a first without planned downtime

In February 2026, we will deploy the first release of the year with no planned production outage. Because this is a first, we decided to explicitly communicate these changes. In the long term, our goal is to make this blog the primary channel for sharing such information.

Deployment schedule

  • JED: 2026-02-17 to 2026-02-19
  • KUMA: 2026-02-24 to 2026-02-26

What’s new for users in this release

Among other improvements and fixes, this release introduces changes that enable a new GPU partition based on MIG (Multi-Instance GPU). This will allow us to offer low-cost virtual GPU instances, particularly well suited for lightweight workloads, testing, prototyping, and teaching, while making more efficient use of our hardware resources.

The full changelog for this release is available below.

We hope this gives you a better view of the work happening behind the scenes and why we believe these changes matter for your daily research and teaching activities.

Changelog

v4.0.0 (2026-02-05)

Highlights

  • kuma/slurm : enable gpu profiling epi/prolog (!765) (7d69008)

  • lua/maxcpupergpu : add logic to limit cores per gpu (!777) (fbeffee)

  • These changes enable a new GPU partition based on MIG (Multi-Instance GPU), offering low-cost virtual GPU instances suitable for lightweight workloads, testing, and teaching.

Bug fixes

  • vault : use certbot cert for scitasmgmt (!886) (ad916ba)

  • sinfo2influx : Ignore unknown extra lines in the reason field (!872) (46d8d87)

  • hpc_jobs : fix merge strategy for cron::job::multiple (!865) (97bc05a)

  • packages/compute : add ninja-build, a build system (!793) (4d285a3)

  • fqdn : Fix monitoring01 host config definition (!862) (1f6367c)

  • docker : compose_config error if undef (!853) (def1460)

  • docker : fix docker compose service gpfs interaction (!852) (f3e9d44)

  • merging sshd class with development (a7d2037)

  • dns : ood1 is now on jed by default (!844) (d8822b5)

  • dns : change openondemand on ondemand (!839) (8efc51c)

  • docker : compose: wrong type for compose_env_file (!834) (d03c1e7)

  • network : dhpc typo to dhcp (!837) (e67dc72)

  • gpfs : Enable the deployment of GPFS on new clusters (!836) (eed1fd2)

  • letsencrypt : Enable renew with Cron (!835) (1a76059)

  • hostkeys : Give the node's ssh host keys a name (!816) (64d05fc)

  • dns : ood2 typo again (f2aa869)

  • dns : ood2 reverse typo (!821) (939cb12)

  • docker-compose : Fix typos in Puppet recipe (!817) (1918a3a)

  • slurm/kuma : Add MaxMemPerCPU to limit RAM usage (!771) (72442ef)

  • telegraf : do not check nfs4 mount (!805) (4722baf)

  • slurm : do not reload slurm but use sigusr2 when logrotate runs (!755) (5d458f0)

  • quota : set fallback defaults to local home quotas (!794) (2c55b04)

  • nfs : increase smartproxy nfs share threads (!795) (dcadde5)

  • lmod : suppress error when there is no stack (!801) (8b2ee92)

  • hiera : nhc lookup regex (!800) (d1d36dd)

  • slurm/prolog : Stop the GPU monitoring services before reloading the nvidia module (!788) (f6ba8f9)

  • kuma/nhc : maintenance: kuma admin node ib port (!723) (4eee693)

  • nhc : only mark compute node as offline if needed (!721) (2c24245)

  • selinux deactivated without reboot (!779) (3309fc9)

  • allow apt as package provider (!778) (e8c0954)

  • slurmdbd : correct charset (!757) (8cbb085)

  • slurmctld : replace deprecated ensure_packages with stdlib::ensure_packages (!758) (8e828c3)

  • vars : resolve unknown variable warnings (!759) (3746c08)

  • url : fix gitlab url (!761) (a7d70cf)

  • insights : Remove insights-client from the motd (!742) (4592bc9)

  • checkzone : Avoid errors with squash commits (!753) (63a183b)

  • slurmctld : HOTFIX: Return to service set to 1 (!880) (57c5060)

  • HOTFIX: add ood ips from kuma network (!846) (2008dd0)

  • sausage : HOTFIX: bump to 1.2.2.1 (!841) (8b847c7)

  • jed,kuma : HOTFIX: fix regex issue on nodes with nhc.conf (!831) (80fdbda)

  • slurm/prolog : HOTFIX: Stop the GPU monitoring services before reloading the nvidia module (!788) (69db6a3)

  • slurm/kuma : HOTFIX: Add MaxMemPerCPU to limit RAM usage (!771) (5868090)

  • hosts : HOTFIX: scoldap IP has change (!791) (781c14b)

  • slurm : HOTFIX: epilog/prolog profiling gpu: fix reloading all missing modules (!767) (7021134)

  • slurm : epilog: fix typo in gpu profiling (!766) (2124000)

  • packages : HOTFIX: Add libglvnd-opengl package to resolve libOpenGL.so.0 dependency issue (!764) (11a40db)

  • packages : HOTFIX: Add numactl-devel into compute group (!747) (687e954)

  • packages : HOTFIX: Add numactl-devel and libglvnd-opengl packages as requested by Daniel (!746) (a6dc0b4)

Chores

  • dns : add testnfs-deb13 future VM IP in DNS (!885) (50ed8d2)

  • prometheus : use last version of community module (!833) (b9a6b56)

  • MR : Add info section to template for openproject OP code and AI assistance (!876) (6ca759e)

  • dns : Add mtc004 node (!867) (37ce9f4)

  • staging dns use cluster addr (!871) (4332bd5)

  • dns : Add Alias for zaphod ui container (!870) (999666b)

  • dns : Make s3.hpc.epfl.ch as alias for scitas-object (!869) (8b4be9d)

  • package : add tmux and numactl libs (!722) (d893558)

  • dns : add taloa01.staging (!859) (d6a4495)

  • dns : Re-add kuma2 behind kuma.hpc.epfl.ch (!849) (357fb9c)

  • dns : Re-add jed2 behind jed.hpc.epfl.ch (!845) (5ab7c78)

  • dns : Remove jed2 and kuma2 from their respective round-robin entries (!842) (c72ebf9)

  • dns : add staging vms (!838) (0b7bc93)

  • dns : Re-add jed1 and kuma1 to their respective round-robin entries (!829) (08fe809)

  • dns : Remove jed1 and kuma1 from their respective round-robin entries (!826) (07c8f9b)

  • dns : Add monitoring02 (!819) (37b361f)

  • dns : add chat.hpc.epfl.ch (d7ee3b8)

  • slurm : bump slurmrestd plugin and data_parser for 24.11+ version (!803) (b7c017e)

  • dns : Add cname for openondemand on ood2 (!807) (9fdd9de)

  • clean 2025 maintenance (!745) (c6c661f)

  • packages : add torrent client (INC0718696) (!695) (6e2c063)

  • dns : Add cname for openondemand on ood2 (c08b90f)

  • account : cs471 QOS TASK0256401 (!799) (a88ab90)

  • dns : add manticore nodes to cluster zone (!790) (19263b0)

  • docker : Add new version for Mattermost docker image: 10.11 (!787) (1db4636)

  • dns : Add whispers to containers01 (!784) (0ece13d)

  • dns : Add ood2 (!772) (93ba424)

  • dns : Add scitas-object to hpc.epfl.ch domain and permit usage from vm network (!751) (d02df61)

  • dns : Re-add kuma2 to kuma's rr-dns (fa3a9da)

  • dns : add manticore dns addresses to bind (!743) (6aebeb4)

  • dns : Add admin.zaphod to bind (!740) (ba1f6f5)

Features

  • test infrastructure for cfgmgmt (!823) (e6b2150)

  • slurm : allow override of default configuration (!857) (265bb68)

  • fqdn : add crontab for scratch_dir_cleanup script to admin nodes (!866) (a9be83e)

  • cron : add scratch cleanup in hpc_jobs crontab and fix kadmin crontab management (!863) (36ac674)

  • lua/maxcpupergpu : add logic to limit cores per gpu (!777) (fbeffee)

  • Add monitoring01 config (!818) (74f2280)

  • Add Debian compatibility for profile::base (!858) (f4d408d)

  • Open OnDemand deployement (!806) (cf3ed0c)

  • docker : Add more parameters to docker-compose files and fix Debian package removal bugs (!851) (9feacc4)

  • object : letsencrypt certificates (!843) (695a538)

  • compute : add cgroup and nvidia exporter (!815) (be12822)

  • sausage : HOTFIX: bump to 2.0.0 (!776) (eeec605)

  • docker/runner : prune docker system instead of image (!812) (1d6cee4)

  • common : prometheus node exporter everywhere (!814) (a650d2d)

  • docker : misc improvements (!830) (4bd6c1f)

  • dns : add hal test vm address (!832) (db9d16c)

  • dns : add hal to hpc.epfl.ch (!825) (5d814bb)

  • scitas_puppet_docker : Add Docker Compose deployement (!810) (3765c13)

  • ldap : use ldap.epfl.ch instead of scolap.epfl.ch (!811) (7b280c0)

  • izar : add bio-468 qos for a course TASK0257143 (!809) (3d46929)

  • izar : add cs-500 qos TASK0257105 (!808) (0d3c596)

  • clustername : modify clustername definition for vms hostgroup (!802) (25498c3)

  • letsencrypt (!796) (0ad40e1)

  • kuma/admin : add a cron to fetch the scitas-hpcusers (!769) (7a2cec6)

  • common : parametrize 'git' attribute from user resource (!770) (c6d91fb)

  • ssh : Adjust the limits on the number of unauthenticated connections (!749) (762744d)

  • add quota for local home (!774) (27fd3a9)

  • security : add fail2ban and rkhunter classes (!773) (f443f2e)

  • add ganeti basic host installation (!780) (f4f2ab7)

  • cloud : move cloud clusters to RHEL9.4 (!756) (6d50c98)

  • slurm : make sackd service state configurable via sackd_ensure parameter (!760) (6857294)

  • checkzone : Prevent zone updates without a serial change (!752) (15c376f)

  • gpfs : add a flag for gpfs to not automount /archive on compute nodes (!735) (16adf00)

  • nhc : Add nhc check for /export FS presence on all nodes of kuma jed and izar (!732) (a96300a)

  • HOTFIX: add new partition academic for master and courses (!854) (1b87084)

  • HOTFIX: SSH host based authentication for custom hosts (!840) (87e921c)

  • sausage : HOTFIX: bump to 2.0.0 (!776) (dcec57a)

  • gpfs : HOTFIX: Update gpfs to version 5.2.3-4 on jed and kuma (!827) (3b719a4)

  • kuma/slurm : enable gpu profiling epi/prolog (!765) (7d69008)

Refactoring

  • profiling_gpu : Move common code to a single file (!789) (306d5c5)

  • profiling_gpu : HOTFIX: Move common code to a single file (!789) (128fa9c)

Unknown

  • Revert "feat(cron): add scratch cleanup in hpc_jobs crontab and fix kadmin... (!864) (b8fbc1c)

  • Revert "feat(sausage): HOTFIX: bump to 2.0.0 (!776)"

This reverts commit eeec60522e005e53da4574a3bae91fba3ae2ae94. (6daab18)

  • Revert "chore(dns): Add cname for openondemand on ood2"

This reverts commit c08b90fb29c70d8fbeecf40d71552e536038d27a. (f587ff9)

  • Revert "feat(sausage): HOTFIX: bump to 2.0.0 (!776)"

This reverts commit dcec57a6ecc0c170d890cd1502d493626e83c4d7. (f673f79)

Helvetios: Helvetios down - run your jobs on Jed

Helvetios is currently fully unavailable due to a major network failure. An on-site intervention is required to restore a minimal service.

If you have already copied your data to the central storage, you can continue running your jobs on the Jed cluster (jed.hpc.epfl.ch) using the new academic partition.

How to proceed

Submit your jobs on Jed using the academic partition, either:

  • On the command line:
--partition=academic
  • In your submission script:
#SBATCH --partition=academic

Important note about the software stack

If you are using our software stack, please recompile your code on Jed, as the software stack available there is more recent and module versions may differ from Helvetios.

We will keep you informed as soon as Helvetios is restored to a minimal operational state to allow data copying as soon as possible.

Thank you for your understanding and cooperation.

Helvetios: Severe cluster issues

Our aging Helvetios academic cluster regularly experiences severe storage and network issues as the hardware starts failing.

We already had to isolate Helvetios from our central storage system earlier this year to mitigate the impact of these issues on the other clusters.

All data stored on Helvetios is at risk of being lost in the event of a fatal failure!

Please ensure you have a copy of ALL important data on a separate, reliable storage system. Do NOT rely solely on this cluster to store critical files.

We highly recommend copying important data currently stored on Helvetios to our central storage system (accessible from all our other clusters).

We remind you that Helvetios runs on obsolete hardware that is no longer supported by the vendors. SCITAS can only provide best-effort support for its maintenance. We are working on the next CPU computing solution for students and courses.

Thank you for your understanding and cooperation.