Jed scratch full#
Affected systems: Jed
History log#
- 2023-12-13 09:35 we get a user report about errors on Jed with:
No space left on device. - 2023-12-13 09:43 Sysadmin team member sees the report, checks our monitoring system, which does not show any anomalies, and proceeds to check the health of Jed manually.
- 2023-12-13 09:45 Sysadmin team member finds that the /scratch storage is at 77%, but the filesystem inodes are near full (98%).
- 2023-12-13 09:47 Rest of the sysadmin team is alerted and discusses possible mitigations.
- 2023-12-13 10:00 A new clean-up policy is implemented and shared with the team.
- 2023-12-13 10:04 The clean-up policy is started on Jed scratch.
- 2023-12-13 10:25 All SCITAS users are notified via email about the possible disruption.
User-visible consequences#
Some users may have experienced an error such as: No space left on device when creating files or directories on Jed's scratch space.
Root causes#
- Jed
/scratchfilesystem inodes reaching 100%. Two users had > 100M files, with one user having > 200M files.
Action items#
- Improve monitoring system to detect scratch filesystem inodes filling up, in addition to our existing raw capacity monitoring.
Lessons Learned#
What went wrong#
- We only found out about this issue when a user notified the team. Our monitoring system alerts to the storage capacity filling up, which does not detect running out of filesystem inodes.
- No existing auto-remediation in place.
What went well#
- The sysadmin team was able to swiftly identify and remediate the issue, which caused no cascading effects on the rest of the systems.
Where we got lucky#
- The filesystem did not reach 100% for a very long time, meaning that the issues were likely not seen by most users.
- The issue happened during working hours.