Reports until 09:41, Wednesday 25 April 2018
H1 GRD
jameson.rollins@LIGO.ORG - posted 09:41, Wednesday 25 April 2018 - last comment - 12:14, Thursday 26 April 2018(41658)
Why the guardian upgrade took so long

By Jameson Rollins and Jonathan Hanks

segfaults after upgrade

Soon after the new guardian machine (h1guardian1 [0]) was moved into production (i.e. after all guardian nodes were moved to the new machine) we started seeing efence and valgrind but never saw a crash in either case, presumably either because the MTTF was increased significantly or because the crashes were circumvented entirely by serialized system calls (valgrind).

Adding to the confusion was the inability to reproduce the crashes in a test environment. 50 test guardian nodes running under the new production environment, subscribing to hundreds of front end channels (but with no writes), and with test clients subscribed to all control channels, failed to show any crashes in two weeks of straight running. (See below for the later discovered reason why this was.)

The following steps were taken to help diagnose the problem:

Inspection of the coredumps generally turned up no useful information other than the problem was a memory corruption error, likely in the EPICS CAS (or in the pcaspy python wrapping of it). The relatively long MTTF pointed to a threading race condition.

[0] Configuration of the new h1guardian1 machine:

[1] systemd-coredump is a very useful package. All core dump files are logged and archived and the coredumpctl command provides access to those logs and an easy means for viewing them with gdb. Unfortunately the log files are cleared out by default after 3 days, and there's there doesn't seem to be a way to increase the expiration time. So be sure to backup the coredump files from /var/lib/systemd/coredump/ for later inspection.

libasan

In an attemp to get more informative error reporting with less impact on performance, Jonathan compiled python2.7 and pcaspy with libasan, the address sanitizer. libasan is similar to valgrind in that it wraps all memory allocation calls to detect memory errors that commonly lead to seg faults. But it's much faster and doesn't serialize the code, thereby leaving in place the threads that were likely triggering the crashes.

(As an aside, libtsan, the thread sanitizer, is basically impossible to use with python, since the python core itself seems to not be particularly thread safe. Running guardian with python libtsan caused guardian to crash immediately after launch with ~20k lines of tsan log output (yes, really). So this was abandoned as an avenue of investigation.)

Once we finally got guardian running under libasan [0], we started to observe libasan-triggered aborts. The libasan abort logs were consistent:

==20277==ERROR: AddressSanitizer: heap-use-after-free on address 0x602001074d90 at pc 0x7fa95a7ec0f1 bp 0x7fff99bc1660 sp 0x7fff99bc0e10
WRITE of size 8 at 0x602001074d90 thread T0
    #0 0x7fa95a7ec0f0 in __interceptor_strncpy (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x6f0f0)
    #1 0x7fa94f6acd78 in aitString::copy(char const*, unsigned int, unsigned int) (/usr/lib/x86_64-linux-gnu/libgdd.so.3.15.3+0x2bd78)
    #2 0x7fa94f6a8fd3  (/usr/lib/x86_64-linux-gnu/libgdd.so.3.15.3+0x27fd3)
    #3 0x7fa94f69a1e0 in gdd::putConvert(aitString const&) (/usr/lib/x86_64-linux-gnu/libgdd.so.3.15.3+0x191e0)
    #4 0x7fa95001bcc3 in gdd_putConvertString pcaspy/casdef_wrap.cpp:4136
    #5 0x7fa95003320d in _wrap_gdd_putConvertString pcaspy/casdef_wrap.cpp:7977
    #6 0x564b9ecccda4 in call_function ../Python/ceval.c:4352
...
    #67 0x564b9eb3fce9 in _start (/opt/python/python-2.7.13-asan/bin/python2.7+0xd9ce9)
0x602001074d90 is located 0 bytes inside of 8-byte region [0x602001074d90,0x602001074d98)
freed by thread T3 here:
    #0 0x7fa95a840370 in operator delete[](void*) (/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc3370)
    #1 0x7fa94f6996de in gdd::setPrimType(aitEnum) (/usr/lib/x86_64-linux-gnu/libgdd.so.3.15.3+0x186de)
previously allocated by thread T0 here:
    #0 0x7fa95a83fd70 in operator new[](unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc2d70)
    #1 0x7fa94f6acd2c in aitString::copy(char const*, unsigned int, unsigned int) (/usr/lib/x86_64-linux-gnu/libgdd.so.3.15.3+0x2bd2c)
Thread T3 created by T0 here:
    #0 0x7fa95a7adf59 in __interceptor_pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x30f59)
    #1 0x564b9ed5e942 in PyThread_start_new_thread ../Python/thread_pthread.h:194
SUMMARY: AddressSanitizer: heap-use-after-free (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x6f0f0) in __interceptor_strncpy
Shadow bytes around the buggy address:
  0x0c0480206960: fa fa fd fa fa fa 00 fa fa fa fd fd fa fa fa fa
  0x0c0480206970: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x0c0480206980: fa fa fd fd fa fa fd fa fa fa fa fa fa fa fd fd
  0x0c0480206990: fa fa fa fa fa fa fd fa fa fa fd fa fa fa fd fa
  0x0c04802069a0: fa fa fa fa fa fa 00 fa fa fa fa fa fa fa fd fa
=>0x0c04802069b0: fa fa[fd]fa fa fa fd fd fa fa fd fd fa fa 00 fa
  0x0c04802069c0: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x0c04802069d0: fa fa 00 fa fa fa fd fa fa fa fd fd fa fa fd fa
  0x0c04802069e0: fa fa fa fa fa fa fd fa fa fa 00 fa fa fa fd fd
  0x0c04802069f0: fa fa 00 fa fa fa fd fd fa fa fd fd fa fa fd fd
  0x0c0480206a00: fa fa fd fd fa fa fd fd fa fa fd fd fa fa 00 fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==20277==ABORTING

Here's a stack trace from a similar crash (couldn't find the trace from the exact same process, but the libasan aborts are all identical):

    Stack trace of thread 18347:
    #0  0x00007fe5c173efff __GI_raise (libc.so.6)
    #1  0x00007fe5c174042a __GI_abort (libc.so.6)
    #2  0x00007fe5c24ae329 n/a (libasan.so.3)
    #3  0x00007fe5c24a39ab n/a (libasan.so.3)
    #4  0x00007fe5c249db57 n/a (libasan.so.3)
    #5  0x00007fe5c2442113 __interceptor_strncpy (libasan.so.3)
    #6  0x00007fe5b7bf2d79 strncpy (libgdd.so.3.15.3)
    #7  0x00007fe5b7beefd4 _ZN9aitString4copyEPKcj (libgdd.so.3.15.3)
    #8  0x00007fe5b7be01e1 _Z10aitConvert7aitEnumPvS_PKvjPK18gddEnumStringTable (libgdd.so.3.15.3)
    #9  0x00007fe5b8561cc4 gdd_putConvertString (_cas.so)
    #10 0x00007fe5b857920e _wrap_gdd_putConvertString (_cas.so)
    #11 0x000055f7f0c86da5 call_function (python2.7)

"strncpy" is the known-problematic string copy function, in this case used to copy strings into the EPICS GDD type used by the channel access server.

GDB backtraces of the core files show that string being copied was always "seconds". The only place that the string "seconds" is used in guardian is as the value of the "units" sub-record given to pcaspy for the EXECTIME and EXECTIME_LAST channels.

[0] systemd drop-in file used to run guardian under libasan python/pcaspy (~guardian/.confg/systemd/user/guardian@.service.d/instrumented.conf):

[Service]
Type=simple
WatchdogSec=0
Environment=PYTHONPATH=/opt/python/pcaspy-0.7.1-asan/build/lib.linux-x86_64-2.7:/home/guardian/guardian/lib:/usr/lib/python2.7/dist-packages
Environment=ASAN_OPTIONS=abort_on_error=1:disable_coredump=0
ExecStartPre=
ExecStart=
ExecStart=/opt/python/python-2.7.13-asan/bin/python -u -t -m guardian %i

revelation about unseen crashes in the test setup

The discovery that the crash was caused by copying the string "seconds" in CAS led to the revelation about why the test setup had not been reproducing the crashes. The "units" sub-record is part of the EPICS DBR_CTRL_* records and is the only sub-record being used of string type. The test clients were only subscribing to the base records of all the guardian channels, not the DBR_CTRL records. MEDM, on the other hand, subscribes to the CTRL records. Guardian overview screens are open all over the control room, subscribing to all the CTRL records of all the production guardian nodes.

CTRL record subscriptions involve copying the "units" string sub-record, and therefore trigger the crashes. No CTRL record subscriptions, no crashes.

pcaspy and threading

So this all led us to take a closer look at how exactly guardian was using pcaspy.

The pcaspy documentation implies that pcaspy is thread safe. The package even provides a helper function that runs the server in a separate thread for you. The implication here is that running the server in a separate thread and pushing/pulling channel updates from/to a main thread into/out of the cas thread is safe to do. Guardian was originally written to run the pcaspy.Server in a separate thread explicitly because of this implication in the documentation.

The main surface for threading issues in the guardian usage of pcaspy was between client writes that trigger pcaspy.Driver.setParams() and pcaspy.Driver.updatePVs() calls inside of pcaspy.Driver.write(), and status channel updates being pushed from the main daemon thread in to the driver that also trigger updatePVs. At Jonathan's suggestion all guardian interaction with the core pcaspy cas functions (Driver.setParams(), Driver.updatePVs()) were wrapped with locks. But we were skeptical that this would actually solve the problem, though, since pcaspy itself provides no means to lock it's internal reading of the updated PVs for shipment out over the EPICS CTRL records (initiated during pcaspy.Server.process()). And in fact this turned out to be correct; crashes persisted even after the locks were in place.

We then started looking into ways to get rid of the separate pcaspy thread altogether. The main daemon loop runs at 16 Hz. The main logic in the daemon loop takes only about 5 ms to run. This leaves ~57 ms to run Server.process(), which should be plenty of time to process the cas without slowing things down noticeably. Moving the CAS select processing into the dead time of the main loop forces the main loop to keep track of it's own timing. This has the added benefit of allowing us to drop the separate clock thread that had been keeping track of timing, elliminating two separate threads instead of just one.

So a patch was prepared to eliminate the separate CAS thread from guardian, and it was tested on about a half dozen nodes. No crashes were observed after a day of running (far exceeding the previous MTTF).

conclusions

A new version of guardian was wrapped up and put into production, and we have seen no spontaneous segfaults in nearly a week. We will continue to monitor the system to confirm that the behavior has not been adversely affected in any way by the ellimination of the CAS thread (full lock recovery would be the best test of that), but we're fairly confident the issue has been resolved.

pcaspy and CAS are not thread safe. This is the main take away. It's possible that guardian is the most intensive user of this library out there, which is why this has not been seen previously. I will report the issue to the EPICS community. We should be more aware of how we use this library in the future, and avoid running the server in a separate thread.

Multi-threading is tricky.

Comments related to this report
jameson.rollins@LIGO.ORG - 10:31, Wednesday 25 April 2018 (41660)

A quick followup about systemd-coredump.  It's possible to control the expiration of the coredump files by dropping the following file into the file system:

root@h1guardian1:~# cat /etc/tmpfiles.d/00_coredump.conf
d /var/lib/systemd/coredump 0755 root root -
root@h1guardian1:~#

The final "-" tells systemd-tmpfiles to put no expiration on files in this directory.  The "00" prefix is needed to make sure it always sorts lexically before any config that would set the expiration on this path (the system default is in /usr/lib/tmpfiles.d/systemd.conf).

jameson.rollins@LIGO.ORG - 12:14, Thursday 26 April 2018 (41689)