Something happened at lock loss and the IFO did not reach the defined DOWN state.
Symptoms:
Dave, Jaime, Sheila, operators, and others are investigating.
Let me try to give a slightly more detailed narrative as we were able to reconstruct it:
So what's the take away:
bug number 1: guardian should have caught the NDS connection error during the NDS restart and gone into a "connection error" (CERROR) state. In that case, it would have continually checked the NDS connection until it was re-established, at which point it would have continued normal operation. This is in contrast to the ERROR state where it waits for operator intervention. I will work on fixing this for the next release.
bug number 2: The operators didn't know or didn't repond to the fact that the IMC_LOCK guardian had gone into ERROR. This is not good, since we need to respond quickly to these things to keep the IFO operating robustly. I propose we set up an alarm in case any guardian node goes into ERROR. I'll work with Dave et. al to get this setup.
As an aside, I'm going to be working over the next week to clean up the guardian and SDF/SPM situation to eliminate all the spurious warnings. We've got too many yellow lights on the guardian screen, which means that we're now in the habit of just ignoring them. They're supposed to be there to inform us of problems that require human intervention. If we just leave them yellow all the time they end up having zero affect and we're left with a noisy alarm situation that everyone just ignores.
A series of events lead to the ISC_LOCK Gaurdian to not understand that there was a lockloss.
To prevent this from happening in the future, Jamie will have Guardian continue to wait for the NDS server to reconnect, rather than stopping and waiting for user intervention before becoming active again. I also added a verbal alarm for Guardian nodes in Error to alert Operators/Users that action is required.
(If i missed something here please let me know)