Lockloss @ 05:50 UTC, caused by Dolphin glitch.
First event I noticed was HEPI HAM1 watchdog tripping, then seeing IOP DACKILL tripped for iopsusb123, iopsush2a, iopsush34, and iopsush56 and a very red CDS overview (attached). Called Dave and we're now working on recovery.
EDIT: Attached CDS overview screenshot at time of glitch.
At first glace Dave believes the glitch originated from the OMC model, causing everything else to trip. He's isolating it from the Dolphin network and restarting it now.
Ryan S, Dave:
After diag resetting which cleared a bunch of cached IPC errors we were left with:
1. Every receiver of IPCs originating from h1omc0 were continuously bad, including those at the end stations.
2. The IOP DACKILLs were permanently asserted on h1sus[b123, h2a, h34, h56]
So it looks like the IX Dolphin card on h1omc0 has gone offline and caused a glitch which took down the corner station SUS listed above.
Recovery:
Log into h1omc0 and check it can see its IO Chassis [it can] and see if the dmesg logs show anything from this time [they dont].
Fence h1omc0 from Dolphin and reboot.
When h1omc0 came back and its models restarted, all the outstanding IPC receive errors cleared.
Onto SUS. For each we safed the SUS, bypassed the SEI SWWD receivers, stopped all the models then started all the models. When the IOP was running again, I unbypassed the corresponding SEI SWWDs.
This worked well for h1susb123, h1sush2a and h1sush34. It did not work for h1sush56, the IOP model failed to restart.
I stopped the partially started h1sush56 models, checked the IO Chassis was visible [it was], fenced it from Dolphin and rebooted.
As h1sush56 came back from reboot, we saw a lot of IPC flashes on various systems. I had seen one or two flashes, but h1sush56 flashed the IPCs for many seconds until its IOP model got going. From that point onwards the Dolphin network was good, no new IPC errors.