J. Kissel, for D. Barker Dave came in this morning to find that the LSC front-end computer had failed. Upon hard booting the computer and IO chassis, even though he tried his darnedest to gracefully take the computer out of the Dolphin Network, upon restart it broke all front-end processes in the corner station, and even killed the h1susham2 computer software entirely. We're now in the process of recovering.
details on this event:
Jeff, Corey, Fil, Dave:
h1lsc0 models were not running, the computer could not see its IO Chassis. In the CER, there was no activity reported near the ISC chassis at the time of crash (09:15 PST), we found the IO Chassis was fully powered.
On h1lsc0 I closed the models and issued the disconnect-from-dolphin command, and then powered down. On power up, h1lsc0 dolphin glitched the corner station, so all models except for sus-aux stopped running. Around the time h1ioplsc0 model started, h1sush2a froze up. Its console was blank and no sshd service. I waited for h1lsc0 to fully start up to verify its problem was transient.
I rebooted h1sush2a using the front panel reset button. As expected, when it came back it had glitched the dolphin fabric and taken down the lsc models.
I then restarted all models on all dolphin'ed corner station front end computers (except for h1sush2a which had just restarted). Procedure was /etc/kill_models.sh followed by /etc/start_models.sh. Restart order in attached restart log.
Once all the models were running again, I cleared the IPC and CRC errors.
Jeff and Corey are recovering the systems.
A small consolation is that the new Dolphin PCIe Gen2 hardware uses a switch-based method to disable ports that seems to be more robust.
Recovery complete -- see LHO aLOG 40580.