Reports until 12:26, Tuesday 24 April 2018
H1 CDS
david.barker@LIGO.ORG - posted 12:26, Tuesday 24 April 2018 - last comment - 16:02, Tuesday 24 April 2018(41636)
Frontend with long uptime rebooted

WP7502

Jeff K, Corey, TJ, Jamie, Dave:

All front end computers with up-times approaching or exceeding 208 days were rebooted. The sequence was: stop all  models on the computers before rebooting (leaving PSL till last) then reboot all computers. Dolphin'ed machines waited until the last computer was rebooted before starting their models.

I had a repeat of yesterday's h1susex problem, after a reboot it lost communications with its IO Chassis. Today's sequence was:

remotely rebooted h1susex, it did not come back (no ping response)

remotely reset h1susex via IPMI management port, it booted but lost communication with IO Chassis

at the EX end station, powered h1susex down, power cycled IO Chassis, powered h1susex back. This time models started.

Despite removing h1susex from the Dolphin fabric, h1seiex and h1iscex glitched and had their models restarted. Ditto for h1oaf0 and h1lsc0 in the corner station.

Machines rebooted (as opposed to just having their models restarted) were:

h1psl0, h1seih16, h1seih23, h1seih45, h1seib1, h1seib2, h1seib3, h1sush2b, h1sush34, h1sush56, h1susb123, h1susauxh2, h1susauxh56, h1asc0, h1susauxey, h1seiex, h1iscex

Some guardian nodes stopped running as a result of these restarts. Jamie and TJ are investigating.

Comments related to this report
jameson.rollins@LIGO.ORG - 14:07, Tuesday 24 April 2018 (41642)

guardian processes killed for watchdog timeout when front ends were rebooted

At roughly 10:05 AM local time, 45 of the guardian nodes went dead (list at bottom). This time was coincident with all the front end reboots. Technically this was not a crash of the guardian nodes. Instead systemd actually killed the processes because they did not check in within their 3 second watchdog timeout:

...
2018-04-24_17:06:09.417388Z ISI_HAM5 CERROR: State method raised an EzcaConnectionError exception.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: Current state method will be rerun until the connection error clears.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: If CERROR does not clear, try setting OP:STOP to kill worker, followed by OP:EXEC to resume.
2018-04-24_17:06:10.018595Z ISI_HAM5 EZCA CONNECTION ERROR. attempting to reestablish...
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: State method raised an EzcaConnectionError exception.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: Current state method will be rerun until the connection error clears.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: If CERROR does not clear, try setting OP:STOP to kill worker, followed by OP:EXEC to resume.
2018-04-24_17:06:10.018595Z ISI_HAM5 EZCA CONNECTION ERROR. attempting to reestablish...
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: State method raised an EzcaConnectionError exception.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: Current state method will be rerun until the connection error clears.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: If CERROR does not clear, try setting OP:STOP to kill worker, followed by OP:EXEC to resume.
2018-04-24_17:06:10.018595Z ISI_HAM5 EZCA CONNECTION ERROR. attempting to reestablish...
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: State method raised an EzcaConnectionError exception.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: Current state method will be rerun until the connection error clears.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: If CERROR does not clear, try setting OP:STOP to kill worker, followed by OP:EXEC to resume.
2018-04-24_17:06:10.018595Z ISI_HAM5 EZCA CONNECTION ERROR. attempting to reestablish...
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: State method raised an EzcaConnectionError exception.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: Current state method will be rerun until the connection error clears.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: If CERROR does not clear, try setting OP:STOP to kill worker, followed by OP:EXEC to resume.
2018-04-24_17:06:10.018595Z ISI_HAM5 EZCA CONNECTION ERROR. attempting to reestablish...
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: State method raised an EzcaConnectionError exception.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: Current state method will be rerun until the connection error clears.
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: If CERROR does not clear, try setting OP:STOP to kill worker, followed by OP:EXEC to resume.
2018-04-24_17:06:10.018595Z ISI_HAM5 EZCA CONNECTION ERROR. attempting to reestablish...
2018-04-24_17:06:10.018595Z ISI_HAM5 CERROR: State method raised an EzcaConnectionError exception.
2018-04-24_17:06:10.014956Z guardian@ISI_HAM5.service: Watchdog timeout (limit 3s)!
2018-04-24_17:06:10.015011Z guardian@ISI_HAM5.service: Killing process 1797 (guardian ISI_HA) with signal SIGABRT.
2018-04-24_17:06:10.015060Z guardian@ISI_HAM5.service: Killing process 2839 (guardian-worker) with signal SIGABRT.
2018-04-24_17:06:10.084021Z guardian@ISI_HAM5.service: Main process exited, code=dumped, status=6/ABRT
2018-04-24_17:06:10.084278Z guardian@ISI_HAM5.service: Unit entered failed state.
2018-04-24_17:06:10.084289Z guardian@ISI_HAM5.service: Failed with result 'watchdog'.

This is both good and bad. It's good that systemd has this watchdog fascility to keep potentially dead processes. But it's bad the guardian processes did not check in in time. The guardian loop runs at 16 Hz, and it checks in with the watchdog once a cycle, so missing three seconds of cycles is kind of a big deal. There were even logs, from the main daemon process, reporting EPICS connection errors right up until the point it was killed. If the daemon was reporting those logs it should have also been checking in with the watchdog.

Clearly the issue is EPICS related. The only connection that I am aware of between guardian and the front end models is EPICS. But all the EPICS connections from guardian to the front ends is done via ezca in the guardian-worker process, and even if that process got hamstrung it shouldn't affect the execution of the daemon.

Very confusing. I'll continue to look in to the issue and see if I can reproduce.

Here's the full list of nodes that died (times are service start times, not death times (I'll try to make those times be last service status change instead)):

ALIGN_IFO                enabled    failed     2018-04-20 09:39:57-07:00
DIAG_EXC                 enabled    failed     2018-04-20 09:39:51-07:00
DIAG_MAIN                enabled    failed     2018-04-20 09:40:03-07:00
DIAG_SDF                 enabled    failed     2018-04-20 09:39:51-07:00
HPI_BS                   enabled    failed     2018-04-20 09:39:52-07:00
HPI_HAM1                 enabled    failed     2018-04-20 09:39:52-07:00
HPI_HAM2                 enabled    failed     2018-04-20 09:39:52-07:00
HPI_HAM3                 enabled    failed     2018-04-20 09:39:52-07:00
HPI_HAM4                 enabled    failed     2018-04-20 09:39:52-07:00
HPI_HAM5                 enabled    failed     2018-04-20 09:39:52-07:00
HPI_HAM6                 enabled    failed     2018-04-20 09:39:52-07:00
HPI_ITMX                 enabled    failed     2018-04-20 09:39:52-07:00
HPI_ITMY                 enabled    failed     2018-04-20 09:39:52-07:00
ISI_BS_ST1               enabled    failed     2018-04-20 09:39:53-07:00
ISI_BS_ST1_BLND          enabled    failed     2018-04-20 09:39:51-07:00
ISI_BS_ST1_SC            enabled    failed     2018-04-20 09:39:51-07:00
ISI_BS_ST2               enabled    failed     2018-04-20 09:39:53-07:00
ISI_BS_ST2_BLND          enabled    failed     2018-04-20 09:39:51-07:00
ISI_BS_ST2_SC            enabled    failed     2018-04-20 09:39:51-07:00
ISI_HAM2                 enabled    failed     2018-04-20 09:39:53-07:00
ISI_HAM2_SC              enabled    failed     2018-04-20 09:39:51-07:00
ISI_HAM3                 enabled    failed     2018-04-20 09:39:53-07:00
ISI_HAM3_SC              enabled    failed     2018-04-20 09:39:51-07:00
ISI_HAM4                 enabled    failed     2018-04-20 09:39:53-07:00
ISI_HAM4_SC              enabled    failed     2018-04-20 09:39:51-07:00
ISI_HAM5                 enabled    failed     2018-04-20 09:39:53-07:00
ISI_HAM5_SC              enabled    failed     2018-04-20 09:39:51-07:00
ISI_HAM6                 enabled    failed     2018-04-20 09:39:53-07:00
ISI_HAM6_SC              enabled    failed     2018-04-20 09:39:51-07:00
ISI_ITMX_ST1             enabled    failed     2018-04-20 09:39:53-07:00
ISI_ITMX_ST1_BLND        enabled    failed     2018-04-20 09:39:51-07:00
ISI_ITMX_ST1_SC          enabled    failed     2018-04-20 09:39:51-07:00
ISI_ITMX_ST2             enabled    failed     2018-04-20 09:39:53-07:00
ISI_ITMX_ST2_BLND        enabled    failed     2018-04-20 09:39:51-07:00
ISI_ITMX_ST2_SC          enabled    failed     2018-04-20 09:39:51-07:00
ISI_ITMY_ST1             enabled    failed     2018-04-20 09:39:53-07:00
ISI_ITMY_ST1_BLND        enabled    failed     2018-04-20 09:39:51-07:00
ISI_ITMY_ST1_SC          enabled    failed     2018-04-20 09:39:51-07:00
ISI_ITMY_ST2             enabled    failed     2018-04-20 09:39:53-07:00
ISI_ITMY_ST2_BLND        enabled    failed     2018-04-20 09:39:51-07:00
ISI_ITMY_ST2_SC          enabled    failed     2018-04-20 09:39:51-07:00
SEI_HAM2                 enabled    failed     2018-04-20 09:39:45-07:00
SEI_HAM3                 enabled    failed     2018-04-20 09:39:46-07:00
SEI_HAM4                 enabled    failed     2018-04-20 09:39:46-07:00
SUS_ETMX                 enabled    failed     2018-04-20 09:39:46-07:00
SUS_TMSX                 enabled    failed     2018-04-20 09:39:46-07:00
thomas.shaffer@LIGO.ORG - 16:02, Tuesday 24 April 2018 (41648)GRD

While I was restarting the dead nodes listed above, two had to be restarted a second time: ISI_ITMX_ST1, ISI_ITMY_ST1. Both had the same "GuardDaemonError: worker exited unexpectedly, exit code: -11". I didn't think too much of it at the time because I had tried to restart large groups of nodes at once and thought this may have been the issue. They came back after another restart without any problems.

But, SUS_MC2 node just crashed 3 seconds after getting a request for aligned with the same error code as the ISI_ITMs (screenshot attached). Restarted and it seems to be okay now.

Images attached to this comment