aLIGO LHO Logbook

H1 SYS

jameson.rollins@LIGO.ORG - posted 23:21, Monday 23 February 2015 (16880)

SUS_PRM guardian hung, could not restart using guardctrl

SUS_PRM guardian became unresponsive, but has how been restored.

The control room reported that the SUS_PRM guardian had become completely unresponsive. The log was showing the following epicsMutex error:

2015-02-24T03:45:51.412Z SUS_PRM [ALIGNED.run] USERMSG: Alignment not enabled or offsets have changed from those saved.
epicsMutex pthread_mutex_unlock failed: error Invalid argument
epicsMutexOsdUnlockThread _main_ (0x2999f20) can't proceed, suspending.

The first line is the last legit guardian log message before the hang. The node was not responding to guardctrl stop or restart commands. I then tried killing the node using the guardctrl interface to the underlying runit supervision interface:

controls@h1guardian0:~ 0$ guardctrl sv kill SUS_PRM
controls@h1guardian0:~ 0$

This did kill the main process, but it unfortunately left the worker subprocess orphaned, which I then had to kill manually:

controls@h1guardian0:~ 0$ ps -eFH | grep SUS_PRM
...
controls 18783     1  4 130390 37256  7 Feb19 ?        05:13:18   guardian SUS_PRM (worker)        
controls@h1guardian0:~ 0$ kill 18783
controls@h1guardian0:~ 0$

After everything was cleared out, I was able to restart the node normally:

controls@h1guardian0:~ 0$ guardctrl restart SUS_PRM
stopping node SUS_PRM...
ok: down: SUS_PRM: 49s, normally up
starting node SUS_PRM...
controls@h1guardian0:~ 0$

At this point SUS_PRM appears to be back to funcitoning normally.

However, I have no idea why this happened or what it means. This is the first time I've seen this issue. The setpoint monitoring in the new guardian version installed last week means that nodes with the monitoring enabled (such as SUS_PRM) are doing many more EPICS reads per cycle than they were previously. As the channels being monitored aren't changing very much, these additional reads shouldn't incur much of a perfomance hit. I will investigate and continue monitoring.