At 07:43 Wed 12feb2025 PST the psinject process on h1hwinj1 crashed and did not restart automatically. H1 was not locked at this time.
The failure to restart was tracked to an expired leap-seconds file on h1hwinj1. This is a Scientific Linux 7 machine, this OS is obsolete and updated tzdata packages are not available. As a work around to get psinject running again, I hand copied a debian12 version of the /usr/share/zoneinfo/[leapseconds, leap-seconds.list] files. At this point monit was able to successfully restart the psinject systemd process.
In conversation with Mike Thomas at LLO, he had seen this problem several weeks ago and implemented the same solution (hand copy of the leapsecond files). Both sites will schedule an upgrade their hwinj machines to Rocky Linux post O4.
Timeline is (all times PST):
07:43 psinject process fails, H1 is unlocked
08:41 H1 ready for observe, but blocked due to lack of hw-injections
09:34 psinject problem resolved, H1 able to transition to observe.
Lost observation time: 53 minutes.
We cannot definitely say why psinject failed today at LHO and several weeks ago at LLO. Mike suspects a local cache expired, causing the code to re-read the leapseconds file and discover that it had expired 28dec2024.
Post Script:
While investigating the loss of the h1calinj CW_EXC testpoint related to this failure, EJ found that the root file system on h1vmboot1 (primary boot server) was 100% full. We deleted some 2023 logs to quickly bring this down to 97% full. At this time we don't think this had anything to do with the psinject issue.
This had happened before, details in FRS30046