aLIGO LHO Logbook

H1 GRD (SEI)

thomas.shaffer@LIGO.ORG - posted 16:42, Thursday 06 September 2018 - last comment - 21:37, Thursday 06 September 2018(43865)

GRD SEI Crashing on Reload

This morning LLO reported that their SEI nodes would crash when reloaded. Since we just had a big earthquake, we had time to test. I tried random nodes for the most part:

ISI_HAM4	OK
HPI_HAM4	OK
SEI_HAM4	OK
ISI_ETMY_ST1	Crash
ISI_ETMY_ST2	Crash
HPI_ETMY	Crash
SEI_ETMY	OK
ISI_HAM3	Crash
HPI_HAM3	Crash

The chamber managers seemed to be okay, but the ISIs and HPIs would most likely crash. The crashes were all because of a 20sec watchdog timeout. Doing a guardctrl restart would bring the nodes back without issue.

It's odd that we are just seeing this now, Jim has reloaded ISI_ETMX_ST2 a few times and not had an issue. There are a few other nodes listed above that crashed today, but have been reloaded since the new Guardian update as well. This could just be because they got lucky and finished whatever they needed to just before the timeout.

Attached a small bit of the log.

Images attached to this report

Comments related to this report

keith.thorne@LIGO.ORG - 21:07, Thursday 06 September 2018 (43871)

Link

We are seeing the same problem at LLO  LLO log 40575.  Arnaud created a local FRS issue on this FRS 11410, but should be turned into an Integration Issue

jameson.rollins@LIGO.ORG - 21:37, Thursday 06 September 2018 (43872)

Link

When this was reported at LLO I suspected that this was a systemd watchdog timeout issue. Thank you for confirming, TJ. The work-around fix is trivial.

A relevant question though is why are the nodes taking so long to reload that they're tripping the timeout watchdog. The two "intensive" parts of the reload are reading all the code files from USERAPPS, and then committing them to the node archive. Both of those things involve NFS read/write. Could there be an issue with the NFS mount, either in general or specifically on h1guardian1?