SUS_PRM guardian became unresponsive, but has how been restored.
The control room reported that the SUS_PRM guardian had become completely unresponsive. The log was showing the following epicsMutex error:
2015-02-24T03:45:51.412Z SUS_PRM [ALIGNED.run] USERMSG: Alignment not enabled or offsets have changed from those saved.
epicsMutex pthread_mutex_unlock failed: error Invalid argument
epicsMutexOsdUnlockThread _main_ (0x2999f20) can't proceed, suspending.
The first line is the last legit guardian log message before the hang. The node was not responding to guardctrl stop
or restart
commands. I then tried killing the node using the guardctrl interface to the underlying runit supervision interface:
controls@h1guardian0:~ 0$ guardctrl sv kill SUS_PRM controls@h1guardian0:~ 0$
This did kill the main process, but it unfortunately left the worker subprocess orphaned, which I then had to kill manually:
controls@h1guardian0:~ 0$ ps -eFH | grep SUS_PRM ... controls 18783 1 4 130390 37256 7 Feb19 ? 05:13:18 guardian SUS_PRM (worker) controls@h1guardian0:~ 0$ kill 18783 controls@h1guardian0:~ 0$
After everything was cleared out, I was able to restart the node normally:
controls@h1guardian0:~ 0$ guardctrl restart SUS_PRM stopping node SUS_PRM... ok: down: SUS_PRM: 49s, normally up starting node SUS_PRM... controls@h1guardian0:~ 0$
At this point SUS_PRM appears to be back to funcitoning normally.
However, I have no idea why this happened or what it means. This is the first time I've seen this issue. The setpoint monitoring in the new guardian version installed last week means that nodes with the monitoring enabled (such as SUS_PRM) are doing many more EPICS reads per cycle than they were previously. As the channels being monitored aren't changing very much, these additional reads shouldn't incur much of a perfomance hit. I will investigate and continue monitoring.