Reports until 15:52, Monday 16 May 2016
H1 CDS (GRD)
david.barker@LIGO.ORG - posted 15:52, Monday 16 May 2016 (27227)
guardian EPICS CA problems this morning, details

T.J, Ed, Jim, Patrick, Dave:

The ALS_XARM node was reporting no connection to the epics channels H1:ALS-X_LOCK_ERROR_FLAG and H1:SYS-MOTION_Y_SHUTTER_A_OPEN. This was preventing any further IFO locking. Logging into h1guardian0 as user controls and issuing caget commands, we were able to get H1:ALS-X_LOCK_ERROR_FLAG's value, but not H1:SYS-MOTION_Y_SHUTTER_A_OPEN (reported data "not complete"). Both caget's complained that two IOCs were serving the data, the h1slow-h1fe epics gateway (spans between the slow controls 10.105 network and the front end 10.101 network) and the IOC itself, h1ecatx1. On the other hand, caget on a control room workstation could get both channels.

Since restarting the end-x Beckhoff slow controls system seemed the most impact, we followed the sequence:

reload ALS_XARM node

restart ALS-XARM node

restart h1slow-h1fe gateway

reboot h1guardian (then we remembered that we had decided that power cycles were preferred)

power-cycle h1guardian0

restart h1ecatx1 epics IOC*

* Patrick reminded us that the Beckhoff IOC could be restarted without the PLCs needing a restart, so this was far less impactful than anticipated. This ultimately fixed the problem.

h1guardian0 was scheduled for a reboot tomorrow to load patches, so this reboot was brought forward one day.

We still don't know why h1ecatx1 IOC was accepting all CAGET commands from the control room network caget's, only partial caget's from h1guardian0, and none from the guardian nodes.

postscript: when we power-cycled h1guardian0 we kept the epics gateway off to ensure CA clients would connected directly to h1ecatx1 and not use the epics-gateway. The nodes did not start up the same way as the previous reboot, with more nodes not connecting the channels. The h1ecatx1 channels were in the list, but so were channels on h1hwsmsr (a 10.105 machine) and h1seib3 (on the same front end network). After about 15 minutes TJ reported these channels reconnected, and the only thing we had done was to restart the epics gateway, or perhaps wait long enough.