Behind the Scenes of SCITAS Platforms - February 2026 Release
At SCITAS, the DevOps team is responsible for purchasing, installing in the data center, deploying, and operating the computing, visualization, and storage infrastructures that serve more than 1,700 EPFL researchers and students every day.
These platforms support a very wide range of scientific activities, from algorithm training and inference, meteorology, computational chemistry, fluid and solid mechanics, bioengineering, to nuclear fusion and many others.
Operating such infrastructures means constantly arbitrating between conflicting constraints:
- laboratories that require maximum stability,
- others that need very recent software stacks,
- security requirements that cannot be compromised,
- and vendor-imposed hardware and firmware updates tied to support contracts.
At the same time, we are fully aware that any production downtime has a direct and negative impact on research and teaching schedules. In academia, deadlines are increasingly tight, and lost compute time is rarely recoverable.
A multi-year transformation of our practices
Over the past few years, we have engaged in a deep transformation of our operational methods and practices, with clear objectives:
- reduce time to resolution for critical production incidents,
- deploy critical changes faster without compromising stability,
- lower the change failure rate,
- and minimize, to the strict necessary minimum, maintenance operations requiring full production outages.
Much of this work is invisible to users by design. However, it is starting to have a very concrete impact: in 2024 and 2025, we reached exceptionally high availability levels, with very few critical incidents and short resolution times.
How we deliver changes
All changes are grouped into formal releases. We plan to deploy 4 to 5 releases per year, and when needed, we also apply hotfixes to resolve critical issues within the same day.
As some of you may have noticed, between December 2025 and today, we successfully performed, for the first time, a complete upgrade of all our storage systems without any production downtime.
February 2026: a first without planned downtime
In February 2026, we will deploy the first release of the year with no planned production outage. Because this is a first, we decided to explicitly communicate these changes. In the long term, our goal is to make this blog the primary channel for sharing such information.
Deployment schedule
- JED: 2026-02-17 to 2026-02-19
- KUMA: 2026-02-24 to 2026-02-26
Whatβs new for users in this release
Among other improvements and fixes, this release introduces changes that enable a new GPU partition based on MIG (Multi-Instance GPU). This will allow us to offer low-cost virtual GPU instances, particularly well suited for lightweight workloads, testing, prototyping, and teaching, while making more efficient use of our hardware resources.
The full changelog for this release is available below.
We hope this gives you a better view of the work happening behind the scenes and why we believe these changes matter for your daily research and teaching activities.
Changelog
v4.0.0 (2026-02-05)
Highlights
-
kuma/slurm : enable gpu profiling epi/prolog (!765) (
7d69008) -
lua/maxcpupergpu : add logic to limit cores per gpu (!777) (
fbeffee) -
These changes enable a new GPU partition based on MIG (Multi-Instance GPU), offering low-cost virtual GPU instances suitable for lightweight workloads, testing, and teaching.
Bug fixes
-
vault : use certbot cert for scitasmgmt (!886) (
ad916ba) -
sinfo2influx : Ignore unknown extra lines in the reason field (!872) (
46d8d87) -
hpc_jobs : fix merge strategy for cron::job::multiple (!865) (
97bc05a) -
packages/compute : add ninja-build, a build system (!793) (
4d285a3) -
fqdn : Fix monitoring01 host config definition (!862) (
1f6367c) -
docker : compose_config error if undef (!853) (
def1460) -
docker : fix docker compose service gpfs interaction (!852) (
f3e9d44) -
merging sshd class with development (
a7d2037) -
dns : ood1 is now on jed by default (!844) (
d8822b5) -
dns : change openondemand on ondemand (!839) (
8efc51c) -
docker : compose: wrong type for compose_env_file (!834) (
d03c1e7) -
network : dhpc typo to dhcp (!837) (
e67dc72) -
gpfs : Enable the deployment of GPFS on new clusters (!836) (
eed1fd2) -
letsencrypt : Enable renew with Cron (!835) (
1a76059) -
hostkeys : Give the node's ssh host keys a name (!816) (
64d05fc) -
dns : ood2 typo again (
f2aa869) -
dns : ood2 reverse typo (!821) (
939cb12) -
docker-compose : Fix typos in Puppet recipe (!817) (
1918a3a) -
slurm/kuma : Add MaxMemPerCPU to limit RAM usage (!771) (
72442ef) -
telegraf : do not check nfs4 mount (!805) (
4722baf) -
slurm : do not reload slurm but use sigusr2 when logrotate runs (!755) (
5d458f0) -
quota : set fallback defaults to local home quotas (!794) (
2c55b04) -
nfs : increase smartproxy nfs share threads (!795) (
dcadde5) -
lmod : suppress error when there is no stack (!801) (
8b2ee92) -
hiera : nhc lookup regex (!800) (
d1d36dd) -
slurm/prolog : Stop the GPU monitoring services before reloading the nvidia module (!788) (
f6ba8f9) -
kuma/nhc : maintenance: kuma admin node ib port (!723) (
4eee693) -
nhc : only mark compute node as offline if needed (!721) (
2c24245) -
selinux deactivated without reboot (!779) (
3309fc9) -
allow apt as package provider (!778) (
e8c0954) -
slurmdbd : correct charset (!757) (
8cbb085) -
slurmctld : replace deprecated ensure_packages with stdlib::ensure_packages (!758) (
8e828c3) -
vars : resolve unknown variable warnings (!759) (
3746c08) -
url : fix gitlab url (!761) (
a7d70cf) -
insights : Remove insights-client from the motd (!742) (
4592bc9) -
checkzone : Avoid errors with squash commits (!753) (
63a183b) -
slurmctld : HOTFIX: Return to service set to 1 (!880) (
57c5060) -
HOTFIX: add ood ips from kuma network (!846) (
2008dd0) -
sausage : HOTFIX: bump to 1.2.2.1 (!841) (
8b847c7) -
jed,kuma : HOTFIX: fix regex issue on nodes with nhc.conf (!831) (
80fdbda) -
slurm/prolog : HOTFIX: Stop the GPU monitoring services before reloading the nvidia module (!788) (
69db6a3) -
slurm/kuma : HOTFIX: Add MaxMemPerCPU to limit RAM usage (!771) (
5868090) -
hosts : HOTFIX: scoldap IP has change (!791) (
781c14b) -
slurm : HOTFIX: epilog/prolog profiling gpu: fix reloading all missing modules (!767) (
7021134) -
slurm : epilog: fix typo in gpu profiling (!766) (
2124000) -
packages : HOTFIX: Add libglvnd-opengl package to resolve libOpenGL.so.0 dependency issue (!764) (
11a40db) -
packages : HOTFIX: Add numactl-devel into compute group (!747) (
687e954) -
packages : HOTFIX: Add numactl-devel and libglvnd-opengl packages as requested by Daniel (!746) (
a6dc0b4)
Chores
-
dns : add testnfs-deb13 future VM IP in DNS (!885) (
50ed8d2) -
prometheus : use last version of community module (!833) (
b9a6b56) -
MR : Add info section to template for openproject OP code and AI assistance (!876) (
6ca759e) -
dns : Add mtc004 node (!867) (
37ce9f4) -
staging dns use cluster addr (!871) (
4332bd5) -
dns : Add Alias for zaphod ui container (!870) (
999666b) -
dns : Make s3.hpc.epfl.ch as alias for scitas-object (!869) (
8b4be9d) -
package : add tmux and numactl libs (!722) (
d893558) -
dns : add taloa01.staging (!859) (
d6a4495) -
dns : Re-add kuma2 behind kuma.hpc.epfl.ch (!849) (
357fb9c) -
dns : Re-add jed2 behind jed.hpc.epfl.ch (!845) (
5ab7c78) -
dns : Remove jed2 and kuma2 from their respective round-robin entries (!842) (
c72ebf9) -
dns : add staging vms (!838) (
0b7bc93) -
dns : Re-add jed1 and kuma1 to their respective round-robin entries (!829) (
08fe809) -
dns : Remove jed1 and kuma1 from their respective round-robin entries (!826) (
07c8f9b) -
dns : Add monitoring02 (!819) (
37b361f) -
dns : add chat.hpc.epfl.ch (
d7ee3b8) -
slurm : bump slurmrestd plugin and data_parser for 24.11+ version (!803) (
b7c017e) -
dns : Add cname for openondemand on ood2 (!807) (
9fdd9de) -
clean 2025 maintenance (!745) (
c6c661f) -
packages : add torrent client (INC0718696) (!695) (
6e2c063) -
dns : Add cname for openondemand on ood2 (
c08b90f) -
account : cs471 QOS TASK0256401 (!799) (
a88ab90) -
dns : add manticore nodes to cluster zone (!790) (
19263b0) -
docker : Add new version for Mattermost docker image: 10.11 (!787) (
1db4636) -
dns : Add whispers to containers01 (!784) (
0ece13d) -
dns : Add ood2 (!772) (
93ba424) -
dns : Add scitas-object to hpc.epfl.ch domain and permit usage from vm network (!751) (
d02df61) -
dns : Re-add kuma2 to kuma's rr-dns (
fa3a9da) -
dns : add manticore dns addresses to bind (!743) (
6aebeb4) -
dns : Add admin.zaphod to bind (!740) (
ba1f6f5)
Features
-
test infrastructure for cfgmgmt (!823) (
e6b2150) -
slurm : allow override of default configuration (!857) (
265bb68) -
fqdn : add crontab for scratch_dir_cleanup script to admin nodes (!866) (
a9be83e) -
cron : add scratch cleanup in hpc_jobs crontab and fix kadmin crontab management (!863) (
36ac674) -
lua/maxcpupergpu : add logic to limit cores per gpu (!777) (
fbeffee) -
Add monitoring01 config (!818) (
74f2280) -
Add Debian compatibility for profile::base (!858) (
f4d408d) -
Open OnDemand deployement (!806) (
cf3ed0c) -
docker : Add more parameters to docker-compose files and fix Debian package removal bugs (!851) (
9feacc4) -
object : letsencrypt certificates (!843) (
695a538) -
compute : add cgroup and nvidia exporter (!815) (
be12822) -
sausage : HOTFIX: bump to 2.0.0 (!776) (
eeec605) -
docker/runner : prune docker system instead of image (!812) (
1d6cee4) -
common : prometheus node exporter everywhere (!814) (
a650d2d) -
docker : misc improvements (!830) (
4bd6c1f) -
dns : add hal test vm address (!832) (
db9d16c) -
dns : add hal to hpc.epfl.ch (!825) (
5d814bb) -
scitas_puppet_docker : Add Docker Compose deployement (!810) (
3765c13) -
ldap : use ldap.epfl.ch instead of scolap.epfl.ch (!811) (
7b280c0) -
izar : add bio-468 qos for a course TASK0257143 (!809) (
3d46929) -
izar : add cs-500 qos TASK0257105 (!808) (
0d3c596) -
clustername : modify clustername definition for vms hostgroup (!802) (
25498c3) -
letsencrypt (!796) (
0ad40e1) -
kuma/admin : add a cron to fetch the scitas-hpcusers (!769) (
7a2cec6) -
common : parametrize 'git' attribute from user resource (!770) (
c6d91fb) -
ssh : Adjust the limits on the number of unauthenticated connections (!749) (
762744d) -
add quota for local home (!774) (
27fd3a9) -
security : add fail2ban and rkhunter classes (!773) (
f443f2e) -
add ganeti basic host installation (!780) (
f4f2ab7) -
cloud : move cloud clusters to RHEL9.4 (!756) (
6d50c98) -
slurm : make sackd service state configurable via sackd_ensure parameter (!760) (
6857294) -
checkzone : Prevent zone updates without a serial change (!752) (
15c376f) -
gpfs : add a flag for gpfs to not automount /archive on compute nodes (!735) (
16adf00) -
nhc : Add nhc check for /export FS presence on all nodes of kuma jed and izar (!732) (
a96300a) -
HOTFIX: add new partition academic for master and courses (!854) (
1b87084) -
HOTFIX: SSH host based authentication for custom hosts (!840) (
87e921c) -
sausage : HOTFIX: bump to 2.0.0 (!776) (
dcec57a) -
gpfs : HOTFIX: Update gpfs to version 5.2.3-4 on jed and kuma (!827) (
3b719a4) -
kuma/slurm : enable gpu profiling epi/prolog (!765) (
7d69008)
Refactoring
-
profiling_gpu : Move common code to a single file (!789) (
306d5c5) -
profiling_gpu : HOTFIX: Move common code to a single file (!789) (
128fa9c)
Unknown
-
Revert "feat(cron): add scratch cleanup in hpc_jobs crontab and fix kadmin... (!864) (
b8fbc1c) -
Revert "feat(sausage): HOTFIX: bump to 2.0.0 (!776)"
This reverts commit eeec60522e005e53da4574a3bae91fba3ae2ae94. (6daab18)
- Revert "chore(dns): Add cname for openondemand on ood2"
This reverts commit c08b90fb29c70d8fbeecf40d71552e536038d27a. (f587ff9)
- Revert "feat(sausage): HOTFIX: bump to 2.0.0 (!776)"
This reverts commit dcec57a6ecc0c170d890cd1502d493626e83c4d7. (f673f79)