WP7502
Jeff K, Corey, TJ, Jamie, Dave:
All front end computers with up-times approaching or exceeding 208 days were rebooted. The sequence was: stop all models on the computers before rebooting (leaving PSL till last) then reboot all computers. Dolphin'ed machines waited until the last computer was rebooted before starting their models.
I had a repeat of yesterday's h1susex problem, after a reboot it lost communications with its IO Chassis. Today's sequence was:
remotely rebooted h1susex, it did not come back (no ping response)
remotely reset h1susex via IPMI management port, it booted but lost communication with IO Chassis
at the EX end station, powered h1susex down, power cycled IO Chassis, powered h1susex back. This time models started.
Despite removing h1susex from the Dolphin fabric, h1seiex and h1iscex glitched and had their models restarted. Ditto for h1oaf0 and h1lsc0 in the corner station.
Machines rebooted (as opposed to just having their models restarted) were:
h1psl0, h1seih16, h1seih23, h1seih45, h1seib1, h1seib2, h1seib3, h1sush2b, h1sush34, h1sush56, h1susb123, h1susauxh2, h1susauxh56, h1asc0, h1susauxey, h1seiex, h1iscex
Some guardian nodes stopped running as a result of these restarts. Jamie and TJ are investigating.
guardian processes killed for watchdog timeout when front ends were rebooted
At roughly 10:05 AM local time, 45 of the guardian nodes went dead (list at bottom). This time was coincident with all the front end reboots. Technically this was not a crash of the guardian nodes. Instead systemd actually killed the processes because they did not check in within their 3 second watchdog timeout:
This is both good and bad. It's good that systemd has this watchdog fascility to keep potentially dead processes. But it's bad the guardian processes did not check in in time. The guardian loop runs at 16 Hz, and it checks in with the watchdog once a cycle, so missing three seconds of cycles is kind of a big deal. There were even logs, from the main daemon process, reporting EPICS connection errors right up until the point it was killed. If the daemon was reporting those logs it should have also been checking in with the watchdog.
Clearly the issue is EPICS related. The only connection that I am aware of between guardian and the front end models is EPICS. But all the EPICS connections from guardian to the front ends is done via ezca in the guardian-worker process, and even if that process got hamstrung it shouldn't affect the execution of the daemon.
Very confusing. I'll continue to look in to the issue and see if I can reproduce.
Here's the full list of nodes that died (times are service start times, not death times (I'll try to make those times be last service status change instead)):
While I was restarting the dead nodes listed above, two had to be restarted a second time: ISI_ITMX_ST1, ISI_ITMY_ST1. Both had the same "GuardDaemonError: worker exited unexpectedly, exit code: -11". I didn't think too much of it at the time because I had tried to restart large groups of nodes at once and thought this may have been the issue. They came back after another restart without any problems.
But, SUS_MC2 node just crashed 3 seconds after getting a request for aligned with the same error code as the ISI_ITMs (screenshot attached). Restarted and it seems to be okay now.