This morning the corner 3 ETMX coil driver took the ISI down. I finally understand what's going on.
1. It seems that some relays on the ETMX coil drivers are going bad, but these are just on a circuit that produces the status bits. The thermal protection relay is also part of this, but, I think, provides independent protection of the coil driver.
2. The coil driver binary bits produced by these circuits are already bypassed at the top level of the model, and haven't been connected to the watchdogs since 2017.
3. We've added a status bit at the top level of the model to monitor and connecting that to the guardian has bitten us now. The status bit records the bit word coming out of the binary inputs, then some math is done to overwrite the bits that would have gotten passed to the watchdogs. The chamber guardian watches the new status bit, and if it sees the COILMON_STATUS_ALL_OK go bad, it just puts both ST1 and ST2 guardians to READY. Doing that ungracefully is what has been causing the trips. Both stages are turned off at the same time. You can't do that and expect good results.
At this point we should just change the guardian behavior. If the STATUS_ALL_OK bit goes bad, it should just put up a notification. We should think about adding a flag, so we can ignore chambers where relays start going bad and the STATUS_ALL_OK is no longer a useful notification. We have swapped out the "broken" coil drivers on ETMX, I think we can swap them back in when we have a guardian fix.
Longer term, we should implement a hardware fix for the broken relays. Marc and Richard have a plan, but even if we had a design and approvals, we probably wouldn't have the people power to fix all of the coil drivers immediately.
FWIW - the "_TEST" inputs were used to test the WD system functionality. No longer needed since the WD inputs have been disconnected