Reports until 20:11, Tuesday 18 March 2014
H1 SYS (CDS, SYS)
jameson.rollins@LIGO.ORG - posted 20:11, Tuesday 18 March 2014 - last comment - 12:13, Wednesday 19 March 2014(10855)
supervised guardian nodes unable to talk to h1ecatx1 Beckhoff

For some reason (maybe related to the CDS boogie man???) the supervised guardian nodes (ie. the nodes running under the guardian supervision infrastructure on h1guardian0) are unable to talk to any of the h1ecatx1 TwinCat IOC channels.

Sheila first noticed this when guard:ALS_XARM was not able to connect to the H1:ALS-X_LOCK_ERROR_STATUS channel.  The truly weird thing, though, is that all other channel access methods can access the h1ecatx1 channels just fine.  We can caget channels from the command line and from ezca in python.  We can do this on operator workstations and even from the terminal on h1guardian0.  It's only the supervised guardian nodes that can't connect to the channels.

I tried reloading the code, restarting the guardian nodes, nothing helped.  The same problem regardless of which node was used.  Note:

I'm at a loss.

There's clearly something different about the environment inside the guardian supervision infrastructure that makes this kind of failure even possible, although I honestly have no idea what the issue could be.

I'm going to punt on trying to spend any more time diagnosing the problem, since I'm just going to chalk it up to the other weirdness.  Hopefully things will fix themselves tomorrow.

Comments related to this report
sheila.dwyer@LIGO.ORG - 22:45, Tuesday 18 March 2014 (10858)

The other thing to note is that I did an svn update on that computer right before it crashed, it might be worth looking at what was included in the update to see if it changed the behavoir of the IoC somehow.  

cyrus.reed@LIGO.ORG - 12:13, Wednesday 19 March 2014 (10866)

(As also discussed in person.)  This may be due to a difference in the environment setup between the Guardian supervisors, and a login shell.  The EPICS gateway processes are still in place on the FE subnet, as we have not changed the data concentrator or FE systems to directly broadcast to other subnets.  So, the channel behavior will be dependent on the setting of the EPICS_CA_ADDR_LIST environment variable, specifically whether CA will traverse the gateway or route through the core switch.  The problem described sounds a lot like the issue the gateway has with reconnecting to Beckhoff IOCs, if the Guardian processes are connecting to the gateway then this would explain the behavior described.  Jaime was going to look at the Guardian environment setup as time permits, to see how it differs from the current cdscfg setup.