The Diode chiller filters are fine. No debris or discoloration. The Crystal chiller filters are dirty due to the 70W amp install work. Replacement filters have been ordered and will be installed when the 70W changes are complete. Closing FAMIS task #8306
Chandra reiterated not to adjust purge air valves without first contacting someone from the vacuum group.
Corey reminded the group to get work permits signed off before starting work.
Thanks to the new BRS summary pages, Michael and Krishna noticed BRSX had crashed over the weekend. I logged in this morning and the C code had crashed again. I screen capped (attached png) the message this time, but the error message window has the annoying trait that you can't resize it, so I had to copy the message to a notepad window first.
I just started a chiller at Mid-Y, the temperature in the VEA was at 69.3 degrees F. The temperature is slowly coming down. I have the set point in that area at 67 F.
0310 hrs. local PT246B = 9.41 x 10-10 Torr, PT245 = 9.34 x 10-7 Torr, PT210B = 8.26 x 10-9 Torr TC3 = 44.5C, TC4 = 62.1C.
05:05 Local time
PT246B = 9.31 x 10-10 Torr.
PT245B = 9.97 x 10-07 Torr.
PT210B = 9.54 x 10-09 Torr.
TC3 = 45.4 oC
TC4 = 63.6 oC
Briefly checking purge-air supplies at Corner Station, X-end and Y-end. Will also be in Y-mid VEA making temperature measurements of CP4 bake using hand-held T/C probe. Chandra R. and Gerardo M. are my "phone buddies". I will make a comment to this entry when I leave the site.
I noticed that the sole remaining original pneumatic valve on the right drying tower of the Kobelco unit has finally wore out. It is leaking air (fairly significantly) past its actuator shaft. It is still opening/closing for now but is quite wastefull -> I will try and replace it this week (sooner rather than later). All of the "RUNNING" lamps on the X-end purge-air compressor control panel are burned-out. Otherwise, the X-end purge air is good. I found a "switching error" on the Y-end purge-air drying tower control panel so I reset it and it is switching normally (for now). This is a nuisance but is so infrequent that we can limp along until one of us runs out of higher priorities. Also, one of the compressor pressure relief valves is begining to fail. I couldn't decipher for sure but it is comming from the bottom three compressors, i.e. #3, #4 or #5. Again, a relatively low priority at this time. Upon entering the Y-mid VEA, I noticed that the portable air compressor used to supply compressed air to CP4's turbo backing cart (isolation valve - 1 pneumatic solenoid, 1 electric) was running continuously. Investigating, I found that the 1/4" poly line had split open and was leaking 50 psi air into the room -> I cut out the split section, installed a union and then also reduced the supply pressure down to 40 psi from the as-found 50 psi. Using a hand-held rigid Type K thermocouple probe, I measured the SST surface temperatures at several locations. All of the 13 zones comprising the independently heated (manual variacs + flexible heaters) 10" gate valve, Turbo pump and RGA were within the target 45C -55C -> I upped the variac outputs by an additional 2% to 14%. I pealed back the insulating blankets encompassing CP4, GV11 and GV12 to gain access to "probe" a few areas of interest within the enclosed volume. Of most interest were the "far" side of the 44" gate valve valve bodies. I assumed that, being somewhat shielded, they may be significantly cooler than the sides exposed to the source of heated air. In summary, what I found was that all SST surfaces were between the values measured by TC3 and TC4 (note TC1 and TC2 measure air temp while TC3 and TC4 are anchored and in good thermal contact with the SST). In fact, using the average of these two values might be a good choice as a single temperature representation. Also, I noticed a significant opening/leak past the insulating blankets on the BT side of GV12 were all four blankets are suppose to meet/overlap. From the ground level, they seemed fine but when viewed from above, the opening was very obvious - I plugged this hole with some rockwool. And finally, the turbo foreline setpoint to isolate is currently set at 2 x 10-1 Torr which is about 10X its current value. 1915 hrs. local time -> Kyle leaving site now.
Loaded visit! Thanks, Kyle. Tagging VE here.
Working at MY. Kyle and Gerardo are my phone buddies.
Removed restrictor plate from return duct. System is circulating full flow now and so far, so good on temperature profiles. We suspect the restricted return flow was not allowing the bottom section of enclosure to heat up enough to match the other zones. Full flow (2000 cfm) compensates for leaks in enclosure. Set point up'ed to 70C+10C error, at 1C/hr ramp rate and 1 hr hold time. We will be back on site tomorrow to check on system and raise set point again.
I checked on all three purge air systems while on site. EY is running three compressors. EX is running four compressors. All three systems are switching towers. I did not measure D.P. or go in VEAs.
Leaving site now. Will monitor MY closely from home.
During CP4's bake-out, we would like to send cell phone alarms if the temperature's rate-of-change exceeds 1degC per hour. I'm testing a python script I wrote yesterday which accumulates data and reports the differences between the current value and those from various times in the past. During the testing phase both the code and its companion EPICS IOC are running on my workstation (zotws6).
I modified the dog clamps on 3 of the 4 PMC feet that hold the PMC to the PSL table. The 4th foot attachment point is located between the ISS and the PMC, and will be addressed at a later date.
For 3 of the 4 feet:
- Cheryl, Matt, Ed
I wanted to note that in order to tighten a couple of the PMC dog clamps, I had to disconnect the cable to the accelerometer that's near the PMC. The accelerometer appears to be very strongly attached to the PSL table top. When I went to remove the cable, I found that it wasn't fully secured to the top of the accelerometer, and I don't know if that signal has been having noise issues, but wanted to log that. The cable is now reconnected and secure.
State of the ISIs: HAM2-isolated-damped: HAM3-damped: HAM4-damped: HAM5-damped: HAM6-locked: ITMX-damped: ITMY-isolated-damped: BS-damped: ETMX-locked: ETMY-locked.
Terry, Sheila, Nutsinee, Daniel
This afternoon there was a lot of work around HAM6, Terry got a beatnote for the FSS using our temporary laser, Daniel swapped in vac cables (480818), and we installed the new long green fiber that we received from MIT today.
We are getting about 50% transmission from the fiber. Before installing it Terry inspected both ends of the fiber using the fiber microscope. We used only 2.7mW into the fiber.
J. Oberling, E. Merilh, M. Heintze
A frustrating day with little progress. We started this morning finishing up the setup for the beam caustic measurement. We installed an iris to block the ASE from the FE, installed a pump light filter to block any 808nm pump light, and installed an 95% output coupler to dump most of the FE beam into a beam dump. This done, we began the measurement. It was immediately apparent that the 200mm lens installed in L15 was too much for the Wincam, so we swapped that with a 300mm lens. We then noticed that we had a very nice LG01 mode instead of the TEM00 mode we expect. This is very strange, as we would not be able to lock the PMC with the mode out of the FE being that bad, and we had the PMC lock yesterday. We noticed the beam clipping on the bottom of PBS02 (which could be causing the mode issue), and upon relocking the PMC found that now only 13W was incident on it. Something in the alignment clearly shifted. Our first suspect was AMP_PBS01 not looking very secure in its mount, so we removed it, mounted it better and reinstalled it. We then spent the rest of the day realigning AMP_PBS01 to the PMC. At the end of this alignment we now have 25.7W incident on the PMC, with 3.6W reflected and 22.1W transmitted. The beam is still slightly clipping on PBS02, so apparently the new pick-off has changed the launch angle out of the FE; since the beam still hits mirror M33, we will fix this by shaving a few mm off of a couple of spare mounts and re-mount WP02 and PBS02 to this new beam line. The FE was left running with PMC locked over the weekend to see if our alignment shifts again; if so, then we have another issue to hunt down before proceeding with the beam caustic measurement.
We are setting up a new guardian host machine. The new machine (currently "h1guardian1", but to be renamed "h1guardian0" after the transition is complete) is running Debian 9 "stretch", with all CDS software installed from pre-compiled packages from the new CDS debian software archives. It has been configured with a completely new "guardctrl" system that will manage all the guardian nodes under the default systemd process manager. A full description of the new setup will come in a future log, after the transition is complete.
The new system is basically ready to go, and I am now beginning the process of transferring guardian nodes over to the new host. For each node to be transferred, I will stop the process on the old machine, and start it fresh on the new system.
I plan on starting with SUS and SEI in HAM1, and will move through the system ending with HAM6.
There's been a bit of a hitch with the guardian upgrade. The new machine (h1guardian1) has been setup and configured. The new supervision system and control interface are fully in place, and all HAM1 and HAM2 SUS and SEI nodes have been moved to the new configuration. Configuration is currently documented in the guardian gitlab wiki.
Unfortunately, node processes are occasionally spontaneously seg faulting for no apparent reason. The failures are happening at a rate of roughly one every 6 hours or so. I configured systemd to catch and log coredumps from segfaults for inspection (using the systemd-coredump utility). After we caught our next segfault (which happened only a couple of hours later), Jonathan Hanks and I started digging into the core to see what we could ferret out. It appears to be some sort of memory corruption error, but we have not yet determined where in the stack the problem is coming from. I suspect that it's in the pcaspy EPICS portable channel access python bindings, but it could be in EPICS. I think it's unlikely that it's in python2.7 itself, although we aren't ruling anything out.
We then set up the processes to be run under electric fence to try to catch any memory out-of-bounds errors. This morning I found two processes that had been killed by efence, but I have not yet inspected the core files in depth. Below are the coredump summaries from coredumpctl on h1guardian1.
This does not bode well for the upgrade. Best case we figure out what we think is causing the segfaults early in the week, but there still won't be enough time to fix the issue, test, and deploy before the end of the week. A de-scoped agenda would be to just do a basic guardian core upgrade in the existing configuration on h1guardian0 and delay the move to Debian 9 and systemd until we can fully resolve the segfault issue.
Here is the full list of nodes currently running under the new system:
HPI_HAM1 enabled active
HPI_HAM2 enabled active
ISI_HAM2 enabled active
ISI_HAM2_CONF enabled active
SEI_HAM2 enabled active
SUS_IM1 enabled active
SUS_IM2 enabled active
SUS_IM3 enabled active
SUS_IM4 enabled active
SUS_MC1 enabled active
SUS_MC2 enabled active
SUS_MC3 enabled active
SUS_PR2 enabled active
SUS_PR3 enabled active
SUS_PRM enabled active
SUS_RM1 enabled active
SUS_RM2 enabled active
If any of these nodes are show up white on the guardian overview screen it's likely because they have crashed. Please let me know and I will deal with them asap.
guardian@h1guardian1:~$ coredumpctl info 11512
PID: 11512 (guardian SUS_MC)
UID: 1010 (guardian)
GID: 1001 (controls)
Signal: 11 (SEGV)
Timestamp: Sat 2018-03-03 11:56:20 PST (4h 50min ago)
Command Line: guardian SUS_MC3 /opt/rtcds/userapps/release/sus/common/guardian/SUS_MC3.py
Executable: /usr/bin/python2.7
Control Group: /user.slice/user-1010.slice/user@1010.service/guardian.slice/guardian@SUS_MC3.service
Unit: user@1010.service
User Unit: guardian@SUS_MC3.service
Slice: user-1010.slice
Owner UID: 1010 (guardian)
Boot ID: 870fed33cb4446e298e142ae901c1830
Machine ID: 699a2492538f4c09861889afeedf39ab
Hostname: h1guardian1
Storage: /var/lib/systemd/coredump/core.guardianx20SUS_MC.1010.870fed33cb4446e298e142ae901c1830.11512.1520106980000000000000.lz4
Message: Process 11512 (guardian SUS_MC) of user 1010 dumped core.
Stack trace of thread 11512:
#0 0x00007f1255965646 strlen (libc.so.6)
#1 0x00007f12567c86ab EF_Printv (libefence.so.0.0)
#2 0x00007f12567c881d EF_Exitv (libefence.so.0.0)
#3 0x00007f12567c88cc EF_Exit (libefence.so.0.0)
#4 0x00007f12567c7837 n/a (libefence.so.0.0)
#5 0x00007f12567c7f30 memalign (libefence.so.0.0)
#6 0x00007f1241cba02d new_epicsTimeStamp (_cas.x86_64-linux-gnu.so)
#7 0x0000556e57263b9a call_function (python2.7)
#8 0x0000556e57261d45 PyEval_EvalCodeEx (python2.7)
#9 0x0000556e5727ea7e function_call.lto_priv.296 (python2.7)
#10 0x0000556e57250413 PyObject_Call (python2.7)
...
guardian@h1guardian1:~$ coredumpctl info 11475
PID: 11475 (guardian SUS_MC)
UID: 1010 (guardian)
GID: 1001 (controls)
Signal: 11 (SEGV)
Timestamp: Sat 2018-03-03 01:33:51 PST (15h ago)
Command Line: guardian SUS_MC1 /opt/rtcds/userapps/release/sus/common/guardian/SUS_MC1.py
Executable: /usr/bin/python2.7
Control Group: /user.slice/user-1010.slice/user@1010.service/guardian.slice/guardian@SUS_MC1.service
Unit: user@1010.service
User Unit: guardian@SUS_MC1.service
Slice: user-1010.slice
Owner UID: 1010 (guardian)
Boot ID: 870fed33cb4446e298e142ae901c1830
Machine ID: 699a2492538f4c09861889afeedf39ab
Hostname: h1guardian1
Storage: /var/lib/systemd/coredump/core.guardianx20SUS_MC.1010.870fed33cb4446e298e142ae901c1830.11475.1520069631000000000000.lz4
Message: Process 11475 (guardian SUS_MC) of user 1010 dumped core.
Stack trace of thread 11475:
#0 0x00007fa7579b5646 strlen (libc.so.6)
#1 0x00007fa7588186ab EF_Printv (libefence.so.0.0)
#2 0x00007fa75881881d EF_Exitv (libefence.so.0.0)
#3 0x00007fa7588188cc EF_Exit (libefence.so.0.0)
#4 0x00007fa758817837 n/a (libefence.so.0.0)
#5 0x00007fa758817f30 memalign (libefence.so.0.0)
#6 0x00005595da26610f PyList_New (python2.7)
#7 0x00005595da28cb8e PyEval_EvalFrameEx (python2.7)
#8 0x00005595da29142f fast_function (python2.7)
#9 0x00005595da29142f fast_function (python2.7)
#10 0x00005595da289d45 PyEval_EvalCodeEx (python2.7)
...
After implementing the efence stuff above, we came in to find more coredumps the next day. On a cursory inspection of the coredumps, we noted that they all showed completely different stack traces. This is highly unusual and pathological, and prompted Jonathan to question the integrity of the physical RAM itself. We swapped out the RAM with a new 16G ECC stick and let it run for another 24 hours.
When next we checked, we discovered only two efence core dumps, indicating an approximate factor of three increase in the mean time to failure (MTTF). However, unlike the previous scatter shot of stack traces, these all showed identical "mprotect" failures, which seemed to point to a side effect of efence itself running in to limits on per process memory map areas. We increased the "max_map_count" (/proc/sys/vm/max_map_count) by a factor of 4, again left it running overnight, and came back to no more coredumps. We cautiously declared victory.
I then started moving the remaining guardian nodes over to the new machine. I completed the new setup by removing the efence, and rebooting the new machine a couple of times to work out the kinks. Everything seemed to be running ok...
Until more segfault/coredumps appeared . A couple of hours after the last reboot of the new h1guardian1 machine, there were three segfaults, all with completely different stack traces. I'm now wondering if efence was somehow masking the problem. My best guess there is that efence was slowing down the processes quite a bit (by increasing system call times) which increased the MTTF by a similar factor. Or the slower processes were less likely to run into some memory corruption race condition.
I'm currently running memtest on h1guardian1 to see if anything shows up, but it's passed all tests so far...
16 seg faults overnight, after rebooting the new guardian machine at about 9pm yesterday. I'll be reverting guardian to the previous configuration today.
Interestingly, though, almost all of the stack traces are of the same type, which is different than what we were seeing before where they're all different. Here's the trace we're seeing in 80% of the instances:
#0 0x00007ffb9bfe4218 malloc_consolidate (libc.so.6)
#1 0x00007ffb9bfe4ea8 _int_free (libc.so.6)
#2 0x000055d2caca7bc5 list_dealloc.lto_priv.1797 (python2.7)
#3 0x000055d2cacdb127 frame_dealloc.lto_priv.291 (python2.7)
#4 0x000055d2caccb450 fast_function (python2.7)
#5 0x000055d2caccb42f fast_function (python2.7)
#6 0x000055d2caccb42f fast_function (python2.7)
#7 0x000055d2caccb42f fast_function (python2.7)
#8 0x000055d2cacc3d45 PyEval_EvalCodeEx (python2.7)
#9 0x000055d2cace0a7e function_call.lto_priv.296 (python2.7)
#10 0x000055d2cacb2413 PyObject_Call (python2.7)
#11 0x000055d2cacf735e instancemethod_call.lto_priv.215 (python2.7)
#12 0x000055d2cacb2413 PyObject_Call (python2.7)
#13 0x000055d2cad69c7a call_method.lto_priv.2801 (python2.7)
#14 0x000055d2cad69deb slot_mp_ass_subscript.lto_priv.1204 (python2.7)
#15 0x000055d2cacc6c5b PyEval_EvalFrameEx (python2.7)
#16 0x000055d2cacc3d45 PyEval_EvalCodeEx (python2.7)
#17 0x000055d2cace0a7e function_call.lto_priv.296 (python2.7)
#18 0x000055d2cacb2413 PyObject_Call (python2.7)
#19 0x000055d2cacf735e instancemethod_call.lto_priv.215 (python2.7)
#20 0x000055d2cacb2413 PyObject_Call (python2.7)
#21 0x000055d2cad69c7a call_method.lto_priv.2801 (python2.7)
#22 0x000055d2cad69deb slot_mp_ass_subscript.lto_priv.1204 (python2.7)
#23 0x000055d2cacc6c5b PyEval_EvalFrameEx (python2.7)
#24 0x000055d2cacc3d45 PyEval_EvalCodeEx (python2.7)
#25 0x000055d2cace0a7e function_call.lto_priv.296 (python2.7)
#26 0x000055d2cacb2413 PyObject_Call (python2.7)
#27 0x000055d2cacf735e instancemethod_call.lto_priv.215 (python2.7)
#28 0x000055d2cacb2413 PyObject_Call (python2.7)
#29 0x000055d2cad69c7a call_method.lto_priv.2801 (python2.7)
#30 0x000055d2cad69deb slot_mp_ass_subscript.lto_priv.1204 (python2.7)
Here's the second most common trace:
#0 0x00007f7bf5c32218 malloc_consolidate (libc.so.6)
#1 0x00007f7bf5c32ea8 _int_free (libc.so.6)
#2 0x00007f7bf5c350e4 _int_realloc (libc.so.6)
#3 0x00007f7bf5c366e9 __GI___libc_realloc (libc.so.6)
#4 0x000055f7eaad766f list_resize.lto_priv.1795 (python2.7)
#5 0x000055f7eaad6e55 app1 (python2.7)
#6 0x000055f7eaafd48b PyEval_EvalFrameEx (python2.7)
#7 0x000055f7eab0142f fast_function (python2.7)
#8 0x000055f7eab0142f fast_function (python2.7)
#9 0x000055f7eab0142f fast_function (python2.7)
#10 0x000055f7eab0142f fast_function (python2.7)
#11 0x000055f7eaaf9d45 PyEval_EvalCodeEx (python2.7)
#12 0x000055f7eab16a7e function_call.lto_priv.296 (python2.7)
#13 0x000055f7eaae8413 PyObject_Call (python2.7)
#14 0x000055f7eab2d35e instancemethod_call.lto_priv.215 (python2.7)
#15 0x000055f7eaae8413 PyObject_Call (python2.7)
#16 0x000055f7eab9fc7a call_method.lto_priv.2801 (python2.7)
#17 0x000055f7eab9fdeb slot_mp_ass_subscript.lto_priv.1204 (python2.7)
#18 0x000055f7eaafcc5b PyEval_EvalFrameEx (python2.7)
#19 0x000055f7eaaf9d45 PyEval_EvalCodeEx (python2.7)
#20 0x000055f7eab16a7e function_call.lto_priv.296 (python2.7)
#21 0x000055f7eaae8413 PyObject_Call (python2.7)
#22 0x000055f7eab2d35e instancemethod_call.lto_priv.215 (python2.7)
#23 0x000055f7eaae8413 PyObject_Call (python2.7)
#24 0x000055f7eab9fc7a call_method.lto_priv.2801 (python2.7)
#25 0x000055f7eab9fdeb slot_mp_ass_subscript.lto_priv.1204 (python2.7)
#26 0x000055f7eaafcc5b PyEval_EvalFrameEx (python2.7)
#27 0x000055f7eaaf9d45 PyEval_EvalCodeEx (python2.7)
#28 0x000055f7eab16a7e function_call.lto_priv.296 (python2.7)
#29 0x000055f7eaae8413 PyObject_Call (python2.7)
#30 0x000055f7eab2d35e instancemethod_call.lto_priv.215 (python2.7)