we had some system problems over the weekend, here is a summary of the time-line. (These would appear to be unrelated events)
The last issue caused Guardian problems with the ALS_XARM node. We did the following to try to fix it:
The power up of h1guardian sans epics-gateway gave even more CA connection errors, with HWS IOC and h1seib3 FEC. These were very confusing, and seemed to go away when the h1slow-h1fe epics gateway was restarted which added to the confusion. We need to reproduce this error.
After Patrick restarted the h1ecatx1 IOC the guardian errors went away.
Rebooting the entire guardian machine just because one node was having a problem seems like extreme overkill to me. I would not recommend that as a solution, since obviously it kills all other guardian processes, causing them to loose their state and current channel connections. I don't see any reason to disturb the other nodes because one is having trouble. Any problem that would supposedly be fixed by rebooting the machine should also be fixed by just kill and restarting the affected node process.
The actual problem with the node is not specified, but the only issue I know of that would cause a node to become unresponsive and immune to a simple "guardctrl restart" is the EPICS mutex thread lock issue, which has been reported both at LLO and LHO, and both with solutions that don't require rebooting the entire machine. Presumably the issue being reported here is somehow different? It would be good to have a better description of what exactly the problem was.