Skip to content

Jed scratch full#

Affected systems: Jed

History log#

  • 2023-12-13 09:35 we get a user report about errors on Jed with: No space left on device.
  • 2023-12-13 09:43 Sysadmin team member sees the report, checks our monitoring system, which does not show any anomalies, and proceeds to check the health of Jed manually.
  • 2023-12-13 09:45 Sysadmin team member finds that the /scratch storage is at 77%, but the filesystem inodes are near full (98%).
  • 2023-12-13 09:47 Rest of the sysadmin team is alerted and discusses possible mitigations.
  • 2023-12-13 10:00 A new clean-up policy is implemented and shared with the team.
  • 2023-12-13 10:04 The clean-up policy is started on Jed scratch.
  • 2023-12-13 10:25 All SCITAS users are notified via email about the possible disruption.

User-visible consequences#

Some users may have experienced an error such as: No space left on device when creating files or directories on Jed's scratch space.

Root causes#

  • Jed /scratch filesystem inodes reaching 100%. Two users had > 100M files, with one user having > 200M files.

Action items#

  • Improve monitoring system to detect scratch filesystem inodes filling up, in addition to our existing raw capacity monitoring.

Lessons Learned#

What went wrong#

  • We only found out about this issue when a user notified the team. Our monitoring system alerts to the storage capacity filling up, which does not detect running out of filesystem inodes.
  • No existing auto-remediation in place.

What went well#

  • The sysadmin team was able to swiftly identify and remediate the issue, which caused no cascading effects on the rest of the systems.

Where we got lucky#

  • The filesystem did not reach 100% for a very long time, meaning that the issues were likely not seen by most users.
  • The issue happened during working hours.