aLIGO LHO Logbook

H1 GRD

jameson.rollins@LIGO.ORG - posted 10:14, Wednesday 28 February 2018 - last comment - 09:38, Thursday 08 March 2018(40765)

starting process of moving guardian nodes to new guardian supervision host

We are setting up a new guardian host machine. The new machine (currently "h1guardian1", but to be renamed "h1guardian0" after the transition is complete) is running Debian 9 "stretch", with all CDS software installed from pre-compiled packages from the new CDS debian software archives. It has been configured with a completely new "guardctrl" system that will manage all the guardian nodes under the default systemd process manager. A full description of the new setup will come in a future log, after the transition is complete.

The new system is basically ready to go, and I am now beginning the process of transferring guardian nodes over to the new host. For each node to be transferred, I will stop the process on the old machine, and start it fresh on the new system.

I plan on starting with SUS and SEI in HAM1, and will move through the system ending with HAM6.

Comments related to this report

jameson.rollins@LIGO.ORG - 17:02, Saturday 03 March 2018 (40831)

Link

There's been a bit of a hitch with the guardian upgrade. The new machine (h1guardian1) has been setup and configured. The new supervision system and control interface are fully in place, and all HAM1 and HAM2 SUS and SEI nodes have been moved to the new configuration. Configuration is currently documented in the guardian gitlab wiki.

Unfortunately, node processes are occasionally spontaneously seg faulting for no apparent reason. The failures are happening at a rate of roughly one every 6 hours or so. I configured systemd to catch and log coredumps from segfaults for inspection (using the systemd-coredump utility). After we caught our next segfault (which happened only a couple of hours later), Jonathan Hanks and I started digging into the core to see what we could ferret out. It appears to be some sort of memory corruption error, but we have not yet determined where in the stack the problem is coming from. I suspect that it's in the pcaspy EPICS portable channel access python bindings, but it could be in EPICS. I think it's unlikely that it's in python2.7 itself, although we aren't ruling anything out.

We then set up the processes to be run under electric fence to try to catch any memory out-of-bounds errors. This morning I found two processes that had been killed by efence, but I have not yet inspected the core files in depth. Below are the coredump summaries from coredumpctl on h1guardian1.

This does not bode well for the upgrade. Best case we figure out what we think is causing the segfaults early in the week, but there still won't be enough time to fix the issue, test, and deploy before the end of the week. A de-scoped agenda would be to just do a basic guardian core upgrade in the existing configuration on h1guardian0 and delay the move to Debian 9 and systemd until we can fully resolve the segfault issue.

Here is the full list of nodes currently running under the new system:

HPI_HAM1        enabled    active    
HPI_HAM2        enabled    active    
ISI_HAM2        enabled    active    
ISI_HAM2_CONF   enabled    active    
SEI_HAM2        enabled    active    
SUS_IM1         enabled    active    
SUS_IM2         enabled    active    
SUS_IM3         enabled    active    
SUS_IM4         enabled    active    
SUS_MC1         enabled    active    
SUS_MC2         enabled    active    
SUS_MC3         enabled    active    
SUS_PR2         enabled    active    
SUS_PR3         enabled    active    
SUS_PRM         enabled    active    
SUS_RM1         enabled    active    
SUS_RM2         enabled    active

If any of these nodes are show up white on the guardian overview screen it's likely because they have crashed. Please let me know and I will deal with them asap.

guardian@h1guardian1:~$ coredumpctl info 11512
           PID: 11512 (guardian SUS_MC)
           UID: 1010 (guardian)
           GID: 1001 (controls)
        Signal: 11 (SEGV)
     Timestamp: Sat 2018-03-03 11:56:20 PST (4h 50min ago)
  Command Line: guardian SUS_MC3 /opt/rtcds/userapps/release/sus/common/guardian/SUS_MC3.py
    Executable: /usr/bin/python2.7
 Control Group: /user.slice/user-1010.slice/user@1010.service/guardian.slice/guardian@SUS_MC3.service
          Unit: user@1010.service
     User Unit: guardian@SUS_MC3.service
         Slice: user-1010.slice
     Owner UID: 1010 (guardian)
       Boot ID: 870fed33cb4446e298e142ae901c1830
    Machine ID: 699a2492538f4c09861889afeedf39ab
      Hostname: h1guardian1
       Storage: /var/lib/systemd/coredump/core.guardianx20SUS_MC.1010.870fed33cb4446e298e142ae901c1830.11512.1520106980000000000000.lz4
       Message: Process 11512 (guardian SUS_MC) of user 1010 dumped core.
                
                Stack trace of thread 11512:
                #0  0x00007f1255965646 strlen (libc.so.6)
                #1  0x00007f12567c86ab EF_Printv (libefence.so.0.0)
                #2  0x00007f12567c881d EF_Exitv (libefence.so.0.0)
                #3  0x00007f12567c88cc EF_Exit (libefence.so.0.0)
                #4  0x00007f12567c7837 n/a (libefence.so.0.0)
                #5  0x00007f12567c7f30 memalign (libefence.so.0.0)
                #6  0x00007f1241cba02d new_epicsTimeStamp (_cas.x86_64-linux-gnu.so)
                #7  0x0000556e57263b9a call_function (python2.7)
                #8  0x0000556e57261d45 PyEval_EvalCodeEx (python2.7)
                #9  0x0000556e5727ea7e function_call.lto_priv.296 (python2.7)
                #10 0x0000556e57250413 PyObject_Call (python2.7)
...

guardian@h1guardian1:~$ coredumpctl info 11475
           PID: 11475 (guardian SUS_MC)
           UID: 1010 (guardian)
           GID: 1001 (controls)
        Signal: 11 (SEGV)
     Timestamp: Sat 2018-03-03 01:33:51 PST (15h ago)
  Command Line: guardian SUS_MC1 /opt/rtcds/userapps/release/sus/common/guardian/SUS_MC1.py
    Executable: /usr/bin/python2.7
 Control Group: /user.slice/user-1010.slice/user@1010.service/guardian.slice/guardian@SUS_MC1.service
          Unit: user@1010.service
     User Unit: guardian@SUS_MC1.service
         Slice: user-1010.slice
     Owner UID: 1010 (guardian)
       Boot ID: 870fed33cb4446e298e142ae901c1830
    Machine ID: 699a2492538f4c09861889afeedf39ab
      Hostname: h1guardian1
       Storage: /var/lib/systemd/coredump/core.guardianx20SUS_MC.1010.870fed33cb4446e298e142ae901c1830.11475.1520069631000000000000.lz4
       Message: Process 11475 (guardian SUS_MC) of user 1010 dumped core.
                
                Stack trace of thread 11475:
                #0  0x00007fa7579b5646 strlen (libc.so.6)
                #1  0x00007fa7588186ab EF_Printv (libefence.so.0.0)
                #2  0x00007fa75881881d EF_Exitv (libefence.so.0.0)
                #3  0x00007fa7588188cc EF_Exit (libefence.so.0.0)
                #4  0x00007fa758817837 n/a (libefence.so.0.0)
                #5  0x00007fa758817f30 memalign (libefence.so.0.0)
                #6  0x00005595da26610f PyList_New (python2.7)
                #7  0x00005595da28cb8e PyEval_EvalFrameEx (python2.7)
                #8  0x00005595da29142f fast_function (python2.7)
                #9  0x00005595da29142f fast_function (python2.7)
                #10 0x00005595da289d45 PyEval_EvalCodeEx (python2.7)
...

jameson.rollins@LIGO.ORG - 20:09, Wednesday 07 March 2018 (40891)

Link

After implementing the efence stuff above, we came in to find more coredumps the next day. On a cursory inspection of the coredumps, we noted that they all showed completely different stack traces. This is highly unusual and pathological, and prompted Jonathan to question the integrity of the physical RAM itself. We swapped out the RAM with a new 16G ECC stick and let it run for another 24 hours.

When next we checked, we discovered only two efence core dumps, indicating an approximate factor of three increase in the mean time to failure (MTTF). However, unlike the previous scatter shot of stack traces, these all showed identical "mprotect" failures, which seemed to point to a side effect of efence itself running in to limits on per process memory map areas. We increased the "max_map_count" (/proc/sys/vm/max_map_count) by a factor of 4, again left it running overnight, and came back to no more coredumps. We cautiously declared victory.

I then started moving the remaining guardian nodes over to the new machine. I completed the new setup by removing the efence, and rebooting the new machine a couple of times to work out the kinks. Everything seemed to be running ok...

Until more segfault/coredumps appeared . A couple of hours after the last reboot of the new h1guardian1 machine, there were three segfaults, all with completely different stack traces. I'm now wondering if efence was somehow masking the problem. My best guess there is that efence was slowing down the processes quite a bit (by increasing system call times) which increased the MTTF by a similar factor. Or the slower processes were less likely to run into some memory corruption race condition.

I'm currently running memtest on h1guardian1 to see if anything shows up, but it's passed all tests so far...

jameson.rollins@LIGO.ORG - 09:38, Thursday 08 March 2018 (40896)

Link

16 seg faults overnight, after rebooting the new guardian machine at about 9pm yesterday. I'll be reverting guardian to the previous configuration today.

Interestingly, though, almost all of the stack traces are of the same type, which is different than what we were seeing before where they're all different. Here's the trace we're seeing in 80% of the instances:

                #0  0x00007ffb9bfe4218 malloc_consolidate (libc.so.6)
                #1  0x00007ffb9bfe4ea8 _int_free (libc.so.6)
                #2  0x000055d2caca7bc5 list_dealloc.lto_priv.1797 (python2.7)
                #3  0x000055d2cacdb127 frame_dealloc.lto_priv.291 (python2.7)
                #4  0x000055d2caccb450 fast_function (python2.7)
                #5  0x000055d2caccb42f fast_function (python2.7)
                #6  0x000055d2caccb42f fast_function (python2.7)
                #7  0x000055d2caccb42f fast_function (python2.7)
                #8  0x000055d2cacc3d45 PyEval_EvalCodeEx (python2.7)
                #9  0x000055d2cace0a7e function_call.lto_priv.296 (python2.7)
                #10 0x000055d2cacb2413 PyObject_Call (python2.7)
                #11 0x000055d2cacf735e instancemethod_call.lto_priv.215 (python2.7)
                #12 0x000055d2cacb2413 PyObject_Call (python2.7)
                #13 0x000055d2cad69c7a call_method.lto_priv.2801 (python2.7)
                #14 0x000055d2cad69deb slot_mp_ass_subscript.lto_priv.1204 (python2.7)
                #15 0x000055d2cacc6c5b PyEval_EvalFrameEx (python2.7)
                #16 0x000055d2cacc3d45 PyEval_EvalCodeEx (python2.7)
                #17 0x000055d2cace0a7e function_call.lto_priv.296 (python2.7)
                #18 0x000055d2cacb2413 PyObject_Call (python2.7)
                #19 0x000055d2cacf735e instancemethod_call.lto_priv.215 (python2.7)
                #20 0x000055d2cacb2413 PyObject_Call (python2.7)
                #21 0x000055d2cad69c7a call_method.lto_priv.2801 (python2.7)
                #22 0x000055d2cad69deb slot_mp_ass_subscript.lto_priv.1204 (python2.7)
                #23 0x000055d2cacc6c5b PyEval_EvalFrameEx (python2.7)
                #24 0x000055d2cacc3d45 PyEval_EvalCodeEx (python2.7)
                #25 0x000055d2cace0a7e function_call.lto_priv.296 (python2.7)
                #26 0x000055d2cacb2413 PyObject_Call (python2.7)
                #27 0x000055d2cacf735e instancemethod_call.lto_priv.215 (python2.7)
                #28 0x000055d2cacb2413 PyObject_Call (python2.7)
                #29 0x000055d2cad69c7a call_method.lto_priv.2801 (python2.7)
                #30 0x000055d2cad69deb slot_mp_ass_subscript.lto_priv.1204 (python2.7)

Here's the second most common trace:

                #0  0x00007f7bf5c32218 malloc_consolidate (libc.so.6)
                #1  0x00007f7bf5c32ea8 _int_free (libc.so.6)
                #2  0x00007f7bf5c350e4 _int_realloc (libc.so.6)
                #3  0x00007f7bf5c366e9 __GI___libc_realloc (libc.so.6)
                #4  0x000055f7eaad766f list_resize.lto_priv.1795 (python2.7)
                #5  0x000055f7eaad6e55 app1 (python2.7)
                #6  0x000055f7eaafd48b PyEval_EvalFrameEx (python2.7)
                #7  0x000055f7eab0142f fast_function (python2.7)
                #8  0x000055f7eab0142f fast_function (python2.7)
                #9  0x000055f7eab0142f fast_function (python2.7)
                #10 0x000055f7eab0142f fast_function (python2.7)
                #11 0x000055f7eaaf9d45 PyEval_EvalCodeEx (python2.7)
                #12 0x000055f7eab16a7e function_call.lto_priv.296 (python2.7)
                #13 0x000055f7eaae8413 PyObject_Call (python2.7)
                #14 0x000055f7eab2d35e instancemethod_call.lto_priv.215 (python2.7)
                #15 0x000055f7eaae8413 PyObject_Call (python2.7)
                #16 0x000055f7eab9fc7a call_method.lto_priv.2801 (python2.7)
                #17 0x000055f7eab9fdeb slot_mp_ass_subscript.lto_priv.1204 (python2.7)
                #18 0x000055f7eaafcc5b PyEval_EvalFrameEx (python2.7)
                #19 0x000055f7eaaf9d45 PyEval_EvalCodeEx (python2.7)
                #20 0x000055f7eab16a7e function_call.lto_priv.296 (python2.7)
                #21 0x000055f7eaae8413 PyObject_Call (python2.7)
                #22 0x000055f7eab2d35e instancemethod_call.lto_priv.215 (python2.7)
                #23 0x000055f7eaae8413 PyObject_Call (python2.7)
                #24 0x000055f7eab9fc7a call_method.lto_priv.2801 (python2.7)
                #25 0x000055f7eab9fdeb slot_mp_ass_subscript.lto_priv.1204 (python2.7)
                #26 0x000055f7eaafcc5b PyEval_EvalFrameEx (python2.7)
                #27 0x000055f7eaaf9d45 PyEval_EvalCodeEx (python2.7)
                #28 0x000055f7eab16a7e function_call.lto_priv.296 (python2.7)
                #29 0x000055f7eaae8413 PyObject_Call (python2.7)
                #30 0x000055f7eab2d35e instancemethod_call.lto_priv.215 (python2.7)