We are setting up a new guardian host machine. The new machine (currently "h1guardian1", but to be renamed "h1guardian0" after the transition is complete) is running Debian 9 "stretch", with all CDS software installed from pre-compiled packages from the new CDS debian software archives. It has been configured with a completely new "guardctrl" system that will manage all the guardian nodes under the default systemd process manager. A full description of the new setup will come in a future log, after the transition is complete.
The new system is basically ready to go, and I am now beginning the process of transferring guardian nodes over to the new host. For each node to be transferred, I will stop the process on the old machine, and start it fresh on the new system.
I plan on starting with SUS and SEI in HAM1, and will move through the system ending with HAM6.
There's been a bit of a hitch with the guardian upgrade. The new machine (h1guardian1) has been setup and configured. The new supervision system and control interface are fully in place, and all HAM1 and HAM2 SUS and SEI nodes have been moved to the new configuration. Configuration is currently documented in the guardian gitlab wiki.
Unfortunately, node processes are occasionally spontaneously seg faulting for no apparent reason. The failures are happening at a rate of roughly one every 6 hours or so. I configured systemd to catch and log coredumps from segfaults for inspection (using the systemd-coredump utility). After we caught our next segfault (which happened only a couple of hours later), Jonathan Hanks and I started digging into the core to see what we could ferret out. It appears to be some sort of memory corruption error, but we have not yet determined where in the stack the problem is coming from. I suspect that it's in the pcaspy EPICS portable channel access python bindings, but it could be in EPICS. I think it's unlikely that it's in python2.7 itself, although we aren't ruling anything out.
We then set up the processes to be run under electric fence to try to catch any memory out-of-bounds errors. This morning I found two processes that had been killed by efence, but I have not yet inspected the core files in depth. Below are the coredump summaries from coredumpctl on h1guardian1.
This does not bode well for the upgrade. Best case we figure out what we think is causing the segfaults early in the week, but there still won't be enough time to fix the issue, test, and deploy before the end of the week. A de-scoped agenda would be to just do a basic guardian core upgrade in the existing configuration on h1guardian0 and delay the move to Debian 9 and systemd until we can fully resolve the segfault issue.
Here is the full list of nodes currently running under the new system:
If any of these nodes are show up white on the guardian overview screen it's likely because they have crashed. Please let me know and I will deal with them asap.
After implementing the efence stuff above, we came in to find more coredumps the next day. On a cursory inspection of the coredumps, we noted that they all showed completely different stack traces. This is highly unusual and pathological, and prompted Jonathan to question the integrity of the physical RAM itself. We swapped out the RAM with a new 16G ECC stick and let it run for another 24 hours.
When next we checked, we discovered only two efence core dumps, indicating an approximate factor of three increase in the mean time to failure (MTTF). However, unlike the previous scatter shot of stack traces, these all showed identical "mprotect" failures, which seemed to point to a side effect of efence itself running in to limits on per process memory map areas. We increased the "max_map_count" (/proc/sys/vm/max_map_count) by a factor of 4, again left it running overnight, and came back to no more coredumps. We cautiously declared victory.
I then started moving the remaining guardian nodes over to the new machine. I completed the new setup by removing the efence, and rebooting the new machine a couple of times to work out the kinks. Everything seemed to be running ok...
Until more segfault/coredumps appeared . A couple of hours after the last reboot of the new h1guardian1 machine, there were three segfaults, all with completely different stack traces. I'm now wondering if efence was somehow masking the problem. My best guess there is that efence was slowing down the processes quite a bit (by increasing system call times) which increased the MTTF by a similar factor. Or the slower processes were less likely to run into some memory corruption race condition.
I'm currently running memtest on h1guardian1 to see if anything shows up, but it's passed all tests so far...
16 seg faults overnight, after rebooting the new guardian machine at about 9pm yesterday. I'll be reverting guardian to the previous configuration today.
Interestingly, though, almost all of the stack traces are of the same type, which is different than what we were seeing before where they're all different. Here's the trace we're seeing in 80% of the instances:
Here's the second most common trace: