aLIGO LHO Logbook

H1 SYS (SEI, SYS)

jameson.rollins@LIGO.ORG - posted 17:43, Tuesday 27 May 2014 (12096)

guardian memory issue likely identified; stopgap inserted into ISI guardian code

As Dave reported earlier (alog 12078) we again experienced memory issues on the h1guardian0 machine over the weekend, this time associated with the BSC ISI guardian nodes. The symptoms are the same as what we saw previously with the so-called "CSD" nodes: node worker process memory usage and thread count shoots through the roof. The problem seems to have in fact started Thursday evening, roughly around the time I restarted all the ISI nodes to incorporate a seemingly inocuous improvement to the handling of the ISI "MASTERSWITCH" (alog 12036). This turns out to be the key to the whole problem.

Later, on Friday, Arnaud reported that some of the ISI nodes went into error because of channel access errors associated with the masterswitch (alog 12058). Associating this with the other issues was the eureka! moment for me. Here's a summary of the issue:

The problem has to do with so-called "IFO rooted" channels. Guardian modules can specify a channel access prefix, which is prepended to all further channel access calls (e.g. the prefix 'ISI-ETMX_ST1' will cause calls to "ezca['FOO']" to be converted to the channel 'H1:ISI-ETMX_ST1_FOO'). This has a couple of benefits, most notably that we can use the exact same code for systems that differ only in the prefix. However, sometimes it is necessary for a prefix-specified module to monitor the state of a channel that is not in it's domain (e.g. ISI_ETMX needs to look at the state of the ISI masterswitch at 'H1:ISI-ETMX_MASTERSWITCH'). This is done by specifying the full channel starting with a colon (ezca[':ISI-ETMX_MASTERSWITCH']), which tells the ezca module to ignore the channel prefix for this channel. We've been noticing problems with these IFO-rooted channels for a while now, but it now appears that those problems and the memory/thread usage problems are in fact related. Each IFO-rooted channel call is somehow creating a new PV object that is not being managed efficiently.

Last Thursday I modified the ISI code to start monitoring the ISI masterswitch, which is an IFO-rooted channel for those nodes, after which the ISI nodes started having memory/thread issues. Similarly, the CSD nodes connect to only IFO-rooted channels (and thousands of them) thus all the problems we've been seeing with those nodes. (Note: the CSD problem can be easily solved by properly stripping prefixes from the channels it accesses).

The solution for the ISI nodes was a targetted change to the function that checks the current value of the masterswitch so that the switch state is not checked for the ISI stages. This is a temporary change until we can get the core ezca handling of IFO-rooted channels fixed.

To illustrate the change, here is an ISI node running with IFO-rooted masterswitch checking. After just a couple of minutes, the memory usage (SZ, RSS) and thread count (C) are both creeping up:

UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
controls 19314  1488  1 91613 27120   8 15:47 ?        00:00:01       guardian ISI_ITMY_ST1
controls 19342 19314 13 112312 34300  8 15:47 ?        00:00:13         guardian ISI_ITMY_ST1 (worker)

In contrast, here is the same node after the masterswitch check was removed, and after more time has ellapsed than the example above:

UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
controls 20932  1488  0 91621 27004   7 15:49 ?        00:00:05       guardian ISI_ITMY_ST1
controls 20967 20932  9 109968 24760  0 15:49 ?        00:00:48         guardian ISI_ITMY_ST1 (worker)

Node the memory usage and thread count are lower and stable after a longer time. This is pretty conclusive to me that these IFO-rooted channels are indeed the cause of the problem.

I have committed the patch (userapps svn r8075) and restarted all nodes for BSC ISI stages (ISI_{BS,{I,E}TM{X,Y}}_ST{1,2}). This should eliminate both the resource problems with the ISI nodes, and the EzcaErrors associated with the masterswitch channels.

Now it's off to figure out how to fix the problem upstream in the Ezca object...