Reports until 12:32, Tuesday 10 March 2015
H1 GRD (GRD, ISC, SUS, SYS)
jeffrey.kissel@LIGO.ORG - posted 12:32, Tuesday 10 March 2015 (17173)
An Exercise in Guardian Management Initialization
J. Kissel, B. Weaver, J. Rollins

We (Betsy and I) had begun returning all the corner station SUS to ALIGNED while Jamie was restarting all of the Guardian code. After the dust settled, we noticed that ITMY and PRM were stalled in their SAFE state. After a little bit of poking around in the guardian logs, we discovered that upon initialization, LSC_CONFIGs manager took possession of SUS_ITMY and SUS_PRM as expected. However, a minute or so after, when the IFO_ALIGN manager initialized, it superseded control. Neither the LSC_CONFIGS nor IFO_ALIGN managers' initialization state is coded enough to acknowledge that its subordinate has stalled in SAFE and to issue its commands, so the SUS remained stalled in SAFE. 

This *didn't* happen on any of the other SUS because the SUS were already in the ALIGNED state -- the state that IFO_ALIGN (who had gained management this time) expects them to be in -- so no action was taken. Betsy and I had not yet gotten to requesting ALIGNED of PRM and ITMY. Therefore the nodes went down in the SAFE condition, restarted, initialized, found the SUS in the SAFE state, jumped to SAFE, and stalled because their manager didn't tell them to continue up to the requested state of ALIGNED.

In summary, there are several flaws that caused this issue:
- PRM and ITMY (and probably other SUS) are doubly managed by LSC_CONFIGS and IFO_ALIGN.
- LSC_CONFIGS lost management of PRM and ITMY, and didn't acknowledge it or throw a warning. Nor did LSC_CONFIGS notice the stall and tell the SUS to keep going.
- IFO_ALIGN doesn't know how to handle a stalled subordinate node; again, upon initialization, it doesn't acknowledge the node is stalled, throw a warning, or do anything about.
- Globally, the ISC managers were not properly initialized to send out the proper commands to their subordinates upon a reboot.

We've relieved the stall by asking LSC_CONFIGs directly to go to DOWN (even though LSC_CONFIGS itself is managed by ISC_LOCK). LSC_CONFIGS then regained control of SUS_PRM and ITMY, and they traversed their graphs to ALIGNED as expected.

We'll work with the ISC team to sort out the management confusion and flush out the proper initialization sequence.