Reports until 11:21, Wednesday 12 February 2025
H1 CDS
david.barker@LIGO.ORG - posted 11:21, Wednesday 12 February 2025 - last comment - 13:16, Wednesday 12 February 2025(82769)
Hardware Injection process failed at 07:43 PST and failed to restart

At 07:43 Wed 12feb2025 PST the psinject process on h1hwinj1 crashed and did not restart automatically. H1 was not locked at this time.

The failure to restart was tracked to an expired leap-seconds file on h1hwinj1. This is a Scientific Linux 7 machine, this OS is obsolete and updated tzdata packages are not available. As a work around to get psinject running again, I hand copied a debian12 version of the /usr/share/zoneinfo/[leapseconds, leap-seconds.list] files. At this point monit was able to successfully restart the psinject systemd process.

In conversation with Mike Thomas at LLO, he had seen this problem several weeks ago and implemented the same solution (hand copy of the leapsecond files). Both sites will schedule an upgrade their hwinj machines to Rocky Linux post O4.

Timeline is (all times PST):

07:43 psinject process fails, H1 is unlocked

08:41 H1 ready for observe, but blocked due to lack of hw-injections

09:34 psinject problem resolved, H1 able to transition to observe.

Lost observation time: 53 minutes.

We cannot definitely say why psinject failed today at LHO and several weeks ago at LLO. Mike suspects a local cache expired, causing the code to re-read the leapseconds file and discover that it had expired 28dec2024.

Post Script:

While investigating the loss of the h1calinj CW_EXC testpoint related to this failure, EJ found that the root file system on h1vmboot1 (primary boot server) was 100% full. We deleted some 2023 logs to quickly bring this down to 97% full. At this time we don't think this had anything to do with the psinject issue.

Comments related to this report
david.barker@LIGO.ORG - 13:16, Wednesday 12 February 2025 (82770)

This had happened before, details in FRS30046