13:25:11 Wed 21 May 2025 PDT all models on h1sush7 stopped running. The SWWD to h1iopseih7 started its countdown with 100% IPC receive errors.
I set a bypass time of 999999 on h1iopseih7 to interrupt the countdown.
We first restarted all the models. This did not clear the error, and we found a PCI bus issue whereby only one of the four ADCs could be found.
As a precautionary measure we fenced h1sush7 from the Dolphin fabric.
Next we power cycled the computer. This fixed the PCI bus issues, all cards are visible and all models starting running.
I cleared the SWWDs and handed the system over to the control room.
Note, there were TIMING error flashes during this recovery time which we have not traced down.
[Wed May 21 13:25:11 2025] h1iopsush7: ERROR - An ADC timeout error has been detected, waiting for an exit signal.
[Wed May 21 13:25:11 2025] h1susauxh7: ERROR - An ADC timeout error has been detected, waiting for an exit signal.
[Wed May 21 13:25:11 2025] h1sussqzin: ERROR - An ADC timeout error has been detected, waiting for an exit signal.
[Wed May 21 13:25:11 2025] h1susfc1: ERROR - An ADC timeout error has been detected, waiting for an exit signal.
Added this crash to FRS20317