aLIGO LHO Logbook

H1 CDS

david.barker@LIGO.ORG - posted 13:32, Monday 16 May 2016 - last comment - 13:54, Monday 16 May 2016(27221)

weekend timeline

we had some system problems over the weekend, here is a summary of the time-line. (These would appear to be unrelated events)

Sat 14th 18:15 FMCS H2O-RO alarm
Sat 14th 22:49 PSL turned off
Sun 15th 03:05 h1hwsmsr IOC failed
Sun 15th 04:52 h1ecatx1 IOC reported buffer error

The last issue caused Guardian problems with the ALS_XARM node. We did the following to try to fix it:

stop-start the node
reboot h1guardian0
restart the h1slow-h1fe epics gateway
power cycle h1guardian0 with epics gateway off

The power up of h1guardian sans epics-gateway gave even more CA connection errors, with HWS IOC and h1seib3 FEC. These were very confusing, and seemed to go away when the h1slow-h1fe epics gateway was restarted which added to the confusion. We need to reproduce this error.

After Patrick restarted the h1ecatx1 IOC the guardian errors went away.

Comments related to this report

jameson.rollins@LIGO.ORG - 13:54, Monday 16 May 2016 (27222)

Link

Rebooting the entire guardian machine just because one node was having a problem seems like extreme overkill to me. I would not recommend that as a solution, since obviously it kills all other guardian processes, causing them to loose their state and current channel connections. I don't see any reason to disturb the other nodes because one is having trouble. Any problem that would supposedly be fixed by rebooting the machine should also be fixed by just kill and restarting the affected node process.

The actual problem with the node is not specified, but the only issue I know of that would cause a node to become unresponsive and immune to a simple "guardctrl restart" is the EPICS mutex thread lock issue, which has been reported both at LLO and LHO, and both with solutions that don't require rebooting the entire machine. Presumably the issue being reported here is somehow different? It would be good to have a better description of what exactly the problem was.