Reports until 17:57, Wednesday 09 May 2018
H1 GRD
jameson.rollins@LIGO.ORG - posted 17:57, Wednesday 09 May 2018 - last comment - 18:02, Wednesday 09 May 2018(41919)
guardian watchdog timeout on reload

[TJ, Jamie]

Today we had an instance of one of the guardian nodes (ALIGN_IFO) being killed by systemd because it was taking too long to reload the code, which caused it to go long on it's watchdog check-in.  The problem was actually not on loading the code, but on committing the new code to the code git archive.

The guardian code archives live on NFS (/ligo/cds), and for whatever reason access to this resource can be very slow.  When access is slow, it can cause guardian to run long on it's systemd watchdog notification, which will cause systemd to kill the process.  We ran into this problem yesterday on system boot, where archive access was too slow and many of the nodes weren't able to check in in time and were killed.  We resolved the issue yesterday by increasing the startup timeout (systemd property 'TimeoutStartSec=') to 3 minutes, to ride out the massive traffic jam on boot.

Since the issue that TJ ran into today was just for reload during normal operation, it's the main loop watchdog timeout that's the issue, not the startup timeout.  I've increased the watchdog timeout to 20 seconds ('WatchdogSec=20s') which should be plenty of time to access the archive on the NFS, without letting us suffer a dead node for too long if something else goes wrong.

We can probably increase the timeout even more if need be, but if it's really taking more than 30 s to access files on the NFS then maybe there's something really wrong with how the NFS is configured...

Comments related to this report
jameson.rollins@LIGO.ORG - 18:02, Wednesday 09 May 2018 (41920)

I discovered that it's also possible for guardian to inform systemd that it needs more time, and to temporarily extend the watchdog timeout.  The following would extend the timeout to 20 seconds temporarily:

sd_notify("EXTEND_TIMEOUT_USEC=20000000")

This would allow for giving the code reload + archive more time, while keeping the normal run watchdog timeout tight.

I think having the watchdog timeout be even as long as 30 seconds or so is still probably ok, so we'll stick with that solution for now.  I just mention as an alternative.