I didn't quite fully anticipate all of the affects of separating DOWN from the rest of the graph. In particular, one really bad unanticipated effect was that after lockloss, when the ISC_LOCK jumps to the LOCKLOSS state, it doesn't find any paths from LOCKLOSS to the last requested state, which causes it to just stall out in LOCKLOSS, and not proceed to DOWN. In other words, DOWN was not run after the lockloss this morning after last night's 10 hour lock.
When I came in this morning I therefore found a bit of a poo show that I then had to clean up. None of the control signals had been shut off, multiple SUS and SEI systems were tripped, and bouce roll modes were rung up. Evan and I eventually wrangled everything back under control, and we're now back to locking.
I have reconnected DOWN to the rest of the graph. NOTE, however, that this problem is not inherent in the fact that DOWN was disconnected. It's just that once you do something like that you remove the ability of guardian to find the right path for you, so you have to be careful to make sure you have all the appropriate jumps to get you where you need to be. I'll rethink things.
Some notable issues:
-
HAM3 had tripped, which somehow send IMC_LOCK into "christmas mode", where it was continually jumping between DOWN and MOVE_TO_OFFLINE. This was due to some questionable logic in the IMC_LOCK guardian that sends it to MOVE_TO_OFFLINE when HAM2 or HAM3 SEI are not in their nominal states. It's doing this even from the DOWN state, which makes no sense and was causing this problem. I'll try to clean up the logic here.
-
The "slow" LSC IMC feedback (IMC_MCL) was not shut off, even after we had recovered ISC_LOCK and IMC_LOCK into their respective DOWN states. This was causing signals to continually be sent to MC2, even though it was misaligned, which was causing HAM3 to continual trip during isolation, which therefore led to "christmas mode" above.
Lesson's learned:
-
We need to be very careful when working on things that might affect the prompt running of the ISC_LOCK DOWN state. DOWN needs to be run promptly after lockloss, to prevent ringing up the bounce/roll modes that take a while to damp out. Both yesterday and today we had different issues that resulted in DOWN not being run after lockloss, both of which resulted in extra hours of recovery.
-
We should consider adding triggering to the front end to shut off all the control signals after a lockloss, and not rely on something external to do it. This is a natural extension of the loop engage triggering, but we hadn't quite gotten there yet.
-
Seeminly inocuous guardian changes can sometime have unanticipated far reaching affects. We should be particularly careful when adjust the graph logic, as I did yesterday.