Reports until 15:32, Thursday 19 May 2022
H1 SEI
jim.warner@LIGO.ORG - posted 15:32, Thursday 19 May 2022 - last comment - 08:13, Friday 20 May 2022(63236)
ETMX coil driver work today

This morning the corner 3 ETMX coil driver took the ISI down. I finally understand what's going on.

1. It seems that some relays on the ETMX coil drivers are going bad, but these are just on a circuit that produces the status bits. The thermal protection relay is also part of this, but, I think, provides independent protection of the coil driver.

2. The coil driver binary bits produced by these circuits are already bypassed at the top level of the model, and haven't been connected to the watchdogs since 2017.

3. We've added a status bit at the top level of the model to monitor and connecting that to the guardian has bitten us now. The status bit records the bit word coming out of the binary inputs, then some math is done to overwrite the bits that would have gotten passed to the watchdogs. The chamber guardian watches the new status bit, and if it sees the COILMON_STATUS_ALL_OK go bad, it just puts both ST1 and ST2 guardians to READY. Doing that ungracefully is what has been causing the trips. Both stages are turned off at the same time. You can't do that and expect good results.

At this point we should just change the guardian behavior. If the STATUS_ALL_OK bit goes bad, it should just put up a notification. We should think about adding a flag, so we can ignore chambers where relays start going bad and the STATUS_ALL_OK is no longer a useful notification. We have swapped out the "broken" coil drivers on ETMX, I think we can swap them back in when we have a guardian fix.

Longer term, we should implement a hardware fix for the broken relays. Marc and Richard have a plan, but even if we had a design and approvals, we probably wouldn't have the people power to fix all of the coil drivers immediately.

 

Comments related to this report
jeffrey.kissel@LIGO.ORG - 16:57, Thursday 19 May 2022 (63239)
In order to increase our understanding, Jim and I followed the binary IO signal from the coil drivers through a journey through the BSC ISI simulink model. 

For uber clarity, which we all like when anyone mentions changes to watchdogs or action that surrounds watchdogs, I show you that journey.

Fact 1: There are 4 binary readbacks per coil driver, and for a BSC ISI, with two stages of 6 actuators, that means there needs to be 12 bits that report the status of this thermal overload relay condition. They're grouped by corner, so the first chassis drives ST1 H1, ST2 H1, ST1 V1, and ST2 V1, the second chassis drives the two H2 and two V2 corners, and the third chassis drives the H3s and V3s. See D0901301.

Fact 2: There are 3 halves of 64 channel cards worth of binary readbacks for a BSC ISI, and the 12 coil driver thermal relay state readbacks are on the first 24 bits of the lower 32 channels of the first 64 channel card (BIO 1). 

Fact 3: Each chassis four readbacks are spaced out on the 24 readback channels, so 
    - the first group of 4, corner 1, readbacks on bits 0-3 (counting starting at 0), then 4-7 are unused. 
    - the second group of 4, corner 2, readbacks on bits 8-11, then 12-15 are unused
    - the third group of 4, corner 3, readbacks on bits 16-19, then 20-23 are unused

Fact 4: The last 8 channels on the 32-bit lower half, 24-31 begin the collection of readbacks of other switch states unrelated to the coil drivers, namely the gain and whitening switch readbacks for the ST1 L4Cs and ST2 GS13s.

Fact 5: In the early days, the 12 coil driver thermal relay states were "ANDed" into two bits, creating one as a trigger for the ST1 watchdog, and one as trigger for the ST2 watchdog.

Fact 6: In 2017, the SEI team "bypassed the watchdog" by putting a new COILMON library block at the top level of each model that 
    (a) intercepts the the whole 32 bit word from the lower half of the first BIO card,
    (b) In parallel to what's sent to the main library block, 
          (1)makes a copy of the bit word and sends it into a STATUS subsystem that 
          (2) has the option to add in a bit word to modify the readback bit word H1:ISI-ETMX_COILMON_STATUS_TEST_IN, altering the information encoded,
          (3) has a copy of the bit-word decoding for all 12 of the coil driver relay states, with a logical AND of ALL the stages readbacks, which feeds into a 60 second "wait" block to spit out a warning flag, H1:ISI-ETMX_COILMON_STATUS_ALL_OK
    (c) But in the main path to the main library block, uses a logical AND to apply a "bit mask" to that 32 bit word, 
        1111 1111 0000 0000 0000 0000 0000 0000 (binary)
        = 4278190080 (decimal)
        = 0xFF000000 (hex)
    where one must recall that binary is read from right to left, so bit 0 is 1, and bit 23 is 0. This forces all of the lowest bits, 0 through 23, to zero, regardless of the real state, but leaves the last, upper, gain and whitening status bits 24-31 untouched reporting the same status as before, then
    (d) adds in the binary equivalent of "1 for each of the 24 coil driver monitor channels," i.e. 
        0000 1111 0000 1111 0000 1111 
        = 986895 (decimal)
        = 0xF0F0F (hex) 
       with bit 0 having value 1, and bit 23 having value 0.
And it's this "bits 0-3, 8-11, 16-19 are all set to 1" status of the channel H1:ISI-ETMX_COILMON_BIO_OUT channel that gets fed into the main library block, feeding each stage's watchdog an "all is good!" message at all times regardless of the *actual* state of the coil driver switch relay status.

Confusingly, inside the main library block, there is *also* the option to add values the binary readbacks, with the channels like 
    - H1:ISI-ETMX_BIO_IN_BIO_IN_TEST (for the lower 32 bits of the 1st card)
    - H1:ISI-ETMX_BIO_IN_BIO_IN_TEST1 (for the upper 32 bits of the 1st card)
    - H1:ISI-ETMX_BIO_IN_BIO_IN_TEST2 (for the lower 32 bits of the 2nd card)
These were probably the beginning hacks to fool the watchdog, that eventually became the upper-level top-level COILMON block, and just never removed.

That's all the facts you need to know, and attached is a screenshot journey to aide the proof of these facts.
    - Attachment 1, top level model, focused on where the BIO card readbacks come in,
    - Attachment 2, top level model, focused on where the lower 32 bits of BIO card gets intercepted
    - Attachment 3, inside the intercepter COILMON block
    - Attachment 4, inside the parallel path COILMON_STATUS block
    - Attachment 5, zooming back out to the top level, a repeat of the top level model, focused on where the lower 32 bits of BIO card gets intercepted
    - Attachment 6, diving into the main library block, to focus on where all binary input gets decoded in the BIO_IN,
    - Attachment 7, into the BIO_IN block, focusing on the initial decoding
    - Attachment 8, then, after decoding, the AND'ing into ST1 and ST2 WD flags
    - Attachment 9, zooming back out, a repeat of the main library part's BIO_IN block to show the flag tag,
    - Attachment 10, then up and back out and over for where the ST1 flag is sent into the ST1 block,
    - Attachment 11, then into the ST1, where the flag comes into the ST1_WD block, then
    - Attachment 12, finally inside the ST1_WD block 

MEDM screens don't really convey this journey very well, which aide the confusion, but the message is that 
Guardian watches the parallel STATUS path channel H1:ISI-ETMX_COILMON_STATUS_ALL_OK, where the ST1 and ST2 watchdogs watch the old path, H1:ISI-ETMX_COILMON_BIO_OUT, that has been hard-coded to be fooled in the front-end

As such, it's the *guardian* that needs changing, NOT anything in the front-end. (Unless we want to be kind to us in 2024, and clean-up all the hacks that make the journey confusion. But almost all will say "not worth it.")
Images attached to this comment
brian.lantz@LIGO.ORG - 08:13, Friday 20 May 2022 (63244)

FWIW - the "_TEST" inputs were used to test the WD system functionality. No longer needed since the WD inputs have been disconnected