WP 7504
Field cabling for SQZT6 has been disconnected. The fibers were pulled back on top of HAM5. RF and power cables are by the HAM5 door. PZT and fast shutter high voltage power supplies have been powered off. The interlock cable for the HV power supplies was reconnected.
PeterF pointed out that it would be good to see what beam jitter the IMC WFS are seeing now that we have the new 70W laser operating with the lower water flow.
I attach 2 screenshots: one of all the WFS, and one with just WFS_B traces, since that has historically and is still the sensor that sees the jitter motion the best. You can see that the coherence between the WFS and the PEM accelerometer on the PSL periscope has decreased, as well as just the overall spectra. There is a small new feature at about 583 Hz, but otherwise the spectra above 100 Hz are all notably better. I haven't confirmed the source of extra low freq noise in the WFS right now, but there's a lot going on in the LVEA, and the comparison time is Observation mode.
Perhaps I'll ask one of our Fellows who is working on the noise budget to use the old coupling TF to try to project what this noise would mean for our O2 DARM, but hopefully we'll also have significantly less coupling now that we've replaced ITMX, so that projection would be an upper limit.
EDIT: Note that these are the RF channels (I just realized that I forgot to include that information in my DTT-froze-on-me / redo things fiasco). I'll soon post a version with the calibrated WFS DC channels.
Really what we want to see is the WFS DC spectra, in calibrated units so that we can see the ratio of 1,0 modes to 0,0 modes. However, the times recently that the IMC has been locked have either been at such a low power (0.9W or less), or when the beam was very far off center from the WFS that the data isn't great.
I have some data from a lock on April 4th with the new 70W amplifier but before the rotation stage was locked out at low power (and before the PMC and EOM were swapped), so the IMC was locked with 5.2W injected into the vacuum. Comparing with alog 34845 from March 2017, some of the peaks look perhaps a little better, but I need to retake the data with the IMC locked at higher input power to have better SNR. I don't have the unlocked version of WFS at this time - we went straight from locked to laser safe.
J. Kissel Grabbed a few closeout measurements of H1 SUS OMC. All clear; good to go for door closure. Transfer functions show that the dynamics are virtually identical to what they were before. See - Individual measurement 2018-04-25_2324_H1SUSOMC_M1_ALL_TFs.pdf - Comparison with others allomcss_2018-04-25_2324_H1SUSOMC_M1_Phase3a_ALL_ZOOMED_TFs.pdf High Frequency OSEM sensor noise ASD, - 2018-04-26_1831_H1SUSOMC_M1_OSEM_Noise_ASDs.png has some spikes, but nothing egregious like a grounded BOSEM (e.g. LHO aLOG 40787). If I had to complain, I would complain about the T1 OSEM being a little higher in noise starting around 500 Hz. But I don't have to complain, so I won't. #WORKSFORME TF Data Templates 2018-04-25_2324_H1SUSOMC_M1_WhiteNoise_L_0p02to50Hz.xml 2018-04-25_2324_H1SUSOMC_M1_WhiteNoise_P_0p02to50Hz.xml 2018-04-25_2324_H1SUSOMC_M1_WhiteNoise_R_0p02to50Hz.xml 2018-04-25_2324_H1SUSOMC_M1_WhiteNoise_T_0p02to50Hz.xml 2018-04-25_2324_H1SUSOMC_M1_WhiteNoise_V_0p02to50Hz.xml 2018-04-25_2324_H1SUSOMC_M1_WhiteNoise_Y_0p02to50Hz.xml ASD Template 2018-04-26_1831_H1SUSOMC_M1_OSEM_ASDs.xml
I'm bypassing LX alarms to cell phones while the IOC is being worked on. Alarm bypass will expire 15:28 PDT.
bypass has been removed, h0velx is back.
As part of the HAM6 closeout and prep for pumping, I have disabled the picomotor drivers for HAM6/ISCT6 as well as Squeezer (those are the labels on the picomotor screen). All other picomotor drivers were already disabled.
TVo, Sheila, Dan Brown
Summary: Astigmatism in the OPO beam (that seems to be happening in HAM5 somewhere) is limiting the mode matching to the OMC. If the astigmatism can be fixed we could get 95% or better matching.
Along with the OMC mode scans taken the other night we also took beam profile measurement in HAM6 with the Nanoscan. On the OMC side of the table we took profile between HAM5 and OM1, OM1 and OM2, and on OMC REFL, on the OPO side we took them after ZM1 and the propagated off of the beam diverter onto SQZT6. Using the as built Finesse model for the HAM5 to OMC path I fit the input x and y plane beam parameters to the data. This is compared to the as built OMC mode propagated to the SRM AR surface (beam parameter values in the legend).
The x-plane beam has an overlap of 83%, the y-plane 95%. Taking the ratio of the 2nd to 0th order peaks in the OMC scan will average the two planes, so we measure 89% matching. Which agrees pretty well with what we measured the other night. Although the OMC scan ratio should underestimate the mismatch in general due to astigmatism in the OMC, the non-zero 1st order peak from misalignment also couples a bit into the 2nd order modes, which causes an overestimate in the mismatch as it makes 02/20 larger. These two effects just happen to cancel each other out in these scans it seems.
I also compared our measurements to what I originally predicted we should have got. As can be seen the y prediction vs fit isn't that far off, but the x-plane astigmatism is causing it to be significantly different.
Lastly, I fit the beam profiles to a mode propagating away from ZM1. If there were no astigmatism (and the model parameters are correct) this beam propagated to the OMC would have ~98% matching.
BRS-X excursion is due to clean-rooms and work being done at the location. BRS-Y needs to be addressed. This downward trend is typical.
TITLE: 04/26 Day Shift: 15:00-23:00 UTC (08:00-16:00 PST), all times posted in UTC
STATE of H1: Planned Engineering
OUTGOING OPERATOR: None
CURRENT ENVIRONMENT:
Wind: 5mph Gusts, 3mph 5min avg
Primary useism: 0.03 μm/s
Secondary useism: 0.18 μm/s
QUICK SUMMARY:
All Quiet on the Hanford Front.
I have locked PRX using the ALIGN_IFO guardian node, after a bit of alignment tweaking by hand. In particular, ITMX is quite far from where it was. PRX is not yet aligned well, since I got distracted trying to close the REFL DC centering loops (which are now closed).
Next up:
I think there are some sign flip shenanigans going on, which I think that I have fixed and they make sense now. But, the DC2 centering loops are still unstable for their old nominal gains. Right now the gains are lower by a factor of 4 - increasing them causes the loops to go unstable.
Recall that in alog 40853 JeffK flipped some signs so that the suspensions were matching the proper sign conventions. TVo, Sheila, and others found that this meant they needed to flip the feedback signs in the AS DC centering loops, as noted in alog 41436. In that alog, Sheila preemptively flipped the signs also in the REFL DC centering loops. However, as Jeff noted in his alog, RM2 didn't need the sign flip that the other *Ms did. So, just flipping the DC centering loop sign isn't quite the right thing. In the end, I put the REFL DC centering loops back to their previously nominal negative signs, and then flipped the sign of the RM1 elements in the ASC pitch and yaw output matrices, leaving the RM2 elements alone. I think this achieves the correct sign-flippage. But, the DC2 loops are still unstable when I go to their full gains (-10 for pit and -12 for yaw). So, for now they're set at -3 for both pitch and yaw. Tomorrow I'll measure the loops and see what's going on.
Attached is the alignment slider screenshot of where I have things right now.
Related entries:
https://alog.ligo-wa.caltech.edu/aLOG/index.php?callRep=29416
https://services.ligo-la.caltech.edu/FRS/show_bug.cgi?id=6446
Summary:
For a very long time we've been limping along with some of the TMSX BOSEMs (namely F1, LF and RT) somehow seemingly less sensitive than they used to be (see the above alog and FRS).
By merely pushing BOSEMs closer to the magnets using adjustment nuts and roughly setting them to half the open value, the response of these BOSEMs were restored.
We also changed RT BOSEM (SN 164) to the one Betsy gave us (SN 083) because we could, but it was unnecessary in a retrospect. We won't change it back because it's a pain.
Details:
We measured the OSEMINF_INMON for suspected ones fully open by pulling BOSEMs away from the mass using adjustment nuts. (In the case of RT we removed the BOSEM plate from the cage and didn't see a large change.)
We found that all of them were already very close to the open values, and that open values were smaller than they used to be judging from the offsets that were set a long time ago. The former means that BOSEM bodies are much farther away from the magnets than they used to be. Apparently TMSX sagged and rolled.
OSEMINF_INMON before (counts) |
Open count | 2*|Offset| before | |
F1 | 19.1k | 21.1k | 25.44k |
LF | 18.1k | 18.7k | 23.16k |
RT | 21.4k | 22.1k (26.95k after the swap) | 26.172k |
We swapped RT (SN 164) with the known good one (SN 083). No particular reason for the choice of RT rather than LF, it's just that Corey was working on RT at that time and thus was convenient to test.
We set the offsets to half the corresponding open count, and adjusted the BOSEM depth so that they come close to the offset.
The attachment shows the coil to coil transfer coefficient (OSEMINF/COILOUTF_EXC) at about 0.02 Hz before/after the change for F1, LF and RT (it also shows F2 and F3 but there's no "after" for these). You want to compare pink open circle with red solid disk, or blue open triangle with blue solid triangle.
As you can see the sensitivity increased by a factor of 1.9 for F1, 2.75 for LF and 3.3 for RT.
Measurement before | Measurement after | sensitivity increase (after/before) | |
F1 | 8.1e-4 | 1.53e-3 | 1.9 |
LF | 2.28e-4 | 6.27e-4 | 2.8 |
RT | 2.54e-4 | 8.34e-4 | 3.3 (=1.2*2.8) |
As for RT, a factor of 1.2 came from the increase in the LED power (26.95k/22.1k=1.22), so the change caused by the depth of OSEM is 3.3/1.2=2.8, almost the same as LF.
I removed a factor of 3 that was added to the PIT damping gain at some point, now all damping gains are 1 (except SD).
I'm also somewhat worried about TMSY LF and RT just because the OSEMINF_INMON is much larger than the offset (second attachment). However, similar measurement for TMSY (third attachment, left half is TMSX, right half is TMSY), it doesn't look as if F3 and LF are terrible.
It seems as if TMS sank and LED lost power at the same time, so it would have been difficult to judge what was going on just by looking at the data.
J. Kissel I've completed the update to the open light current compensation in the OSEMINF filters by installing the normalization gains (H1:SUS-TMSX_M1_OSEMINF_${DOF}_GAIN) for the above updated OSEMs (F1, LF and RT), and then accepted the new offsets and gains in the SDF system. See attached screenshots, but for the sake of future searchability: OSEM Open Light Current OFFSET GAIN (ADC counts) (OLC / -2) (30000/OLC) F1 21030 -10515 1.4265 LF 18700 -9350 1.6043 RT 26950 -13475 1.1132
Thomas Vo, Sheila, TJ, Nutsinee
This morning Thomas TJ and I set the alignments of the OMs back to the alignments we found using the single bounce beam (41540), and steered the seed beam on to the AS WFS. This meant that ZM1 was almost railed, as we had seen yesterday, so we moved it in yaw. We were then able to center on the AS WFS for a couple of different pico alignments.
We decided that this is the last thing we need to do in HAM6 for the squeezer before putting doors on, so after lunch Nutsinee TVo and I went back and removed our apertures and tools, wiped the table surfaces off, used the flashlight array to look at the VIP and other optics (didn't see any real problems), and Nutsinee took many photos. We also spent some time clearing some of our equipment out of the area in preparation for moving the tables tomorrow. Nutsinee checked that everything is secured inside both SQZT6 and ISCT6, so after removing the cables and carefully stowing the fibers we should be ready to move the tables tomorrow.
Daniel, Nutsinee
Last week we went out to SQZT6 with a halogen light bulb and measured shotnoise of the OPO refl PD (aLIGO broadband PD). An input of 0.5mA (0.7V/1400 Ohms transimpedance) gives ~200nVrms/sqrt(Hz). Using a responsivity of 0.3 A/W I calculated what shotnoise would be like for a green input power of 7mW (right below threshold power of 7.4-8mW, see alog41150 for details) and 3.5 mW (~half the threshold power, possible operating point according to Sheila). At 7mW input to the OPO, 6.02 mW is expected to hit the PD (14% loss between fiber and SQZT6, see alog41623) we expect about 40% coming back when we're locked (alog41045) and we are about 2 orders of magnitude about a factor of 2 above dark noise at this operating power. At half power we are just a factor of 1.2 above the dark noise. The shotnoise limited power is about 1mW.
Note: OPO locking signal comes from this PD.
We also measured shotnoise of the homodyne. More alog to come.
Have yet to test to see if everything still work with these new cables. Just wanted them in before the table is moved.
TITLE: 04/25 Day Shift: 15:00-23:00 UTC (08:00-16:00 PST), all times posted in UTC
STATE of H1: Planned Engineering
INCOMING OPERATOR: None
LOG:
14:59 (7:59) Ken to various parts of CS -- Installing cameras
15:00 (8:00) Start of shift
16:00 (9:00) Sheila to HAM6 -- Split cover
16:01 (9:01) Hugh, Corey to LVEA -- Mark table location, lock HAM4 HEPI
16:05 (9:05) Chandra to MY
16:15 (9:15) Sheila back from HAM6
16:49 (9:49) Nutsinee to SQZ rack -- Plugging in cables
17:03 (10:03) Nutsinee back from SQZ rack
17:06 (10:06) Sheila to LVEA -- take lock off PSL light pipe
17:07 (10:07) Cheryl to LVEA
17:08 (10:08) Corey back from LVEA
17:08 (10:08) Hugh back from LVEA
17:10 (10:10) Sheila back from LVEA
17:15 (10:15) Cheryl back from LVEA
17:27 (10:27) Chandra back from MY
17:34 (10:34) Keita, Corey to EX -- TMSX work
17:40 (10:40) Peter to PSL enclosure
17:48 (10:48) Hugh to HAM4 -- finish lockup
17:59 (10:59) Sheila, TJ, TVo to HAM6 -- check SQZ/IFO beam alignments
18:20 (11:20) Hugh back from HAM4
18:37 (11:37) Rick to PSL enclosure
18:41 (11:41) Ken to LVEA
19:08 (12:08) Corey, Keita heading back from EX for lunch
19:20 (12:20) TJ, TVo,Sheila back from HAM6
20:09 (13:09) Peter, Rick out of PSL enclosure
20:10 (13:10) Keita, Corey to EX
20:20 (13:20) Ed to SQZ table -- Label connectors
20:48 (13:48) Sheila, TVo to HAM6 -- Take close-out photos
20:50 (13:50) Nutsinee to HAM6 -- assist Sheila and TVo
21:05 (14:05) Gerardo to HAM6
21:22 (14:22) Chandra to EY, MY
21:46 (14:46) TJ to Optics Lab
21:50 (14:50) Travis to EX -- Deliver BOSEMS
21:53 (14:53) TJ out of Optics Lab
22:00 (15:00) Gerardo back from LVEA
22:10 (15:10) Travis back from EX
22:19 (15:19) TVo, Sheila back from HAM6
22:32 (15:32) Nutsinee back from HAM6
22:35 (15:35) Corey, Keita back from EX
23:00 (16:00) End of shift
By Jameson Rollins and Jonathan Hanks
Soon after the new guardian machine (h1guardian1 [0]) was moved into production (i.e. after all guardian nodes were moved to the new machine) we started seeing efence and valgrind but never saw a crash in either case, presumably either because the MTTF was increased significantly or because the crashes were circumvented entirely by serialized system calls (valgrind).
Adding to the confusion was the inability to reproduce the crashes in a test environment. 50 test guardian nodes running under the new production environment, subscribing to hundreds of front end channels (but with no writes), and with test clients subscribed to all control channels, failed to show any crashes in two weeks of straight running. (See below for the later discovered reason why this was.)
The following steps were taken to help diagnose the problem:
Inspection of the coredumps generally turned up no useful information other than the problem was a memory corruption error, likely in the EPICS CAS (or in the pcaspy python wrapping of it). The relatively long MTTF pointed to a threading race condition.
[0] Configuration of the new h1guardian1 machine:
[1] systemd-coredump is a very useful package. All core dump files are logged and archived and the coredumpctl command provides access to those logs and an easy means for viewing them with gdb. Unfortunately the log files are cleared out by default after 3 days, and there's there doesn't seem to be a way to increase the expiration time. So be sure to backup the coredump files from /var/lib/systemd/coredump/ for later inspection.
In an attemp to get more informative error reporting with less impact on performance, Jonathan compiled python2.7 and pcaspy with libasan, the address sanitizer. libasan is similar to valgrind in that it wraps all memory allocation calls to detect memory errors that commonly lead to seg faults. But it's much faster and doesn't serialize the code, thereby leaving in place the threads that were likely triggering the crashes.
(As an aside, libtsan, the thread sanitizer, is basically impossible to use with python, since the python core itself seems to not be particularly thread safe. Running guardian with python libtsan caused guardian to crash immediately after launch with ~20k lines of tsan log output (yes, really). So this was abandoned as an avenue of investigation.)
Once we finally got guardian running under libasan [0], we started to observe libasan-triggered aborts. The libasan abort logs were consistent:
==20277==ERROR: AddressSanitizer: heap-use-after-free on address 0x602001074d90 at pc 0x7fa95a7ec0f1 bp 0x7fff99bc1660 sp 0x7fff99bc0e10 WRITE of size 8 at 0x602001074d90 thread T0 #0 0x7fa95a7ec0f0 in __interceptor_strncpy (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x6f0f0) #1 0x7fa94f6acd78 in aitString::copy(char const*, unsigned int, unsigned int) (/usr/lib/x86_64-linux-gnu/libgdd.so.3.15.3+0x2bd78) #2 0x7fa94f6a8fd3 (/usr/lib/x86_64-linux-gnu/libgdd.so.3.15.3+0x27fd3) #3 0x7fa94f69a1e0 in gdd::putConvert(aitString const&) (/usr/lib/x86_64-linux-gnu/libgdd.so.3.15.3+0x191e0) #4 0x7fa95001bcc3 in gdd_putConvertString pcaspy/casdef_wrap.cpp:4136 #5 0x7fa95003320d in _wrap_gdd_putConvertString pcaspy/casdef_wrap.cpp:7977 #6 0x564b9ecccda4 in call_function ../Python/ceval.c:4352 ... #67 0x564b9eb3fce9 in _start (/opt/python/python-2.7.13-asan/bin/python2.7+0xd9ce9) 0x602001074d90 is located 0 bytes inside of 8-byte region [0x602001074d90,0x602001074d98) freed by thread T3 here: #0 0x7fa95a840370 in operator delete[](void*) (/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc3370) #1 0x7fa94f6996de in gdd::setPrimType(aitEnum) (/usr/lib/x86_64-linux-gnu/libgdd.so.3.15.3+0x186de) previously allocated by thread T0 here: #0 0x7fa95a83fd70 in operator new[](unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.3+0xc2d70) #1 0x7fa94f6acd2c in aitString::copy(char const*, unsigned int, unsigned int) (/usr/lib/x86_64-linux-gnu/libgdd.so.3.15.3+0x2bd2c) Thread T3 created by T0 here: #0 0x7fa95a7adf59 in __interceptor_pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x30f59) #1 0x564b9ed5e942 in PyThread_start_new_thread ../Python/thread_pthread.h:194 SUMMARY: AddressSanitizer: heap-use-after-free (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x6f0f0) in __interceptor_strncpy Shadow bytes around the buggy address: 0x0c0480206960: fa fa fd fa fa fa 00 fa fa fa fd fd fa fa fa fa 0x0c0480206970: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd 0x0c0480206980: fa fa fd fd fa fa fd fa fa fa fa fa fa fa fd fd 0x0c0480206990: fa fa fa fa fa fa fd fa fa fa fd fa fa fa fd fa 0x0c04802069a0: fa fa fa fa fa fa 00 fa fa fa fa fa fa fa fd fa =>0x0c04802069b0: fa fa[fd]fa fa fa fd fd fa fa fd fd fa fa 00 fa 0x0c04802069c0: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd 0x0c04802069d0: fa fa 00 fa fa fa fd fa fa fa fd fd fa fa fd fa 0x0c04802069e0: fa fa fa fa fa fa fd fa fa fa 00 fa fa fa fd fd 0x0c04802069f0: fa fa 00 fa fa fa fd fd fa fa fd fd fa fa fd fd 0x0c0480206a00: fa fa fd fd fa fa fd fd fa fa fd fd fa fa 00 fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Heap right redzone: fb Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack partial redzone: f4 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==20277==ABORTING
Here's a stack trace from a similar crash (couldn't find the trace from the exact same process, but the libasan aborts are all identical):
Stack trace of thread 18347: #0 0x00007fe5c173efff __GI_raise (libc.so.6) #1 0x00007fe5c174042a __GI_abort (libc.so.6) #2 0x00007fe5c24ae329 n/a (libasan.so.3) #3 0x00007fe5c24a39ab n/a (libasan.so.3) #4 0x00007fe5c249db57 n/a (libasan.so.3) #5 0x00007fe5c2442113 __interceptor_strncpy (libasan.so.3) #6 0x00007fe5b7bf2d79 strncpy (libgdd.so.3.15.3) #7 0x00007fe5b7beefd4 _ZN9aitString4copyEPKcj (libgdd.so.3.15.3) #8 0x00007fe5b7be01e1 _Z10aitConvert7aitEnumPvS_PKvjPK18gddEnumStringTable (libgdd.so.3.15.3) #9 0x00007fe5b8561cc4 gdd_putConvertString (_cas.so) #10 0x00007fe5b857920e _wrap_gdd_putConvertString (_cas.so) #11 0x000055f7f0c86da5 call_function (python2.7)
"strncpy" is the known-problematic string copy function, in this case used to copy strings into the EPICS GDD type used by the channel access server.
GDB backtraces of the core files show that string being copied was always "seconds". The only place that the string "seconds" is used in guardian is as the value of the "units" sub-record given to pcaspy for the EXECTIME and EXECTIME_LAST channels.
[0] systemd drop-in file used to run guardian under libasan python/pcaspy (~guardian/.confg/systemd/user/guardian@.service.d/instrumented.conf):
[Service] Type=simple WatchdogSec=0 Environment=PYTHONPATH=/opt/python/pcaspy-0.7.1-asan/build/lib.linux-x86_64-2.7:/home/guardian/guardian/lib:/usr/lib/python2.7/dist-packages Environment=ASAN_OPTIONS=abort_on_error=1:disable_coredump=0 ExecStartPre= ExecStart= ExecStart=/opt/python/python-2.7.13-asan/bin/python -u -t -m guardian %i
The discovery that the crash was caused by copying the string "seconds" in CAS led to the revelation about why the test setup had not been reproducing the crashes. The "units" sub-record is part of the EPICS DBR_CTRL_* records and is the only sub-record being used of string type. The test clients were only subscribing to the base records of all the guardian channels, not the DBR_CTRL records. MEDM, on the other hand, subscribes to the CTRL records. Guardian overview screens are open all over the control room, subscribing to all the CTRL records of all the production guardian nodes.
CTRL record subscriptions involve copying the "units" string sub-record, and therefore trigger the crashes. No CTRL record subscriptions, no crashes.
So this all led us to take a closer look at how exactly guardian was using pcaspy.
The pcaspy documentation implies that pcaspy is thread safe. The package even provides a helper function that runs the server in a separate thread for you. The implication here is that running the server in a separate thread and pushing/pulling channel updates from/to a main thread into/out of the cas thread is safe to do. Guardian was originally written to run the pcaspy.Server in a separate thread explicitly because of this implication in the documentation.
The main surface for threading issues in the guardian usage of pcaspy was between client writes that trigger pcaspy.Driver.setParams() and pcaspy.Driver.updatePVs() calls inside of pcaspy.Driver.write(), and status channel updates being pushed from the main daemon thread in to the driver that also trigger updatePVs. At Jonathan's suggestion all guardian interaction with the core pcaspy cas functions (Driver.setParams(), Driver.updatePVs()) were wrapped with locks. But we were skeptical that this would actually solve the problem, though, since pcaspy itself provides no means to lock it's internal reading of the updated PVs for shipment out over the EPICS CTRL records (initiated during pcaspy.Server.process()). And in fact this turned out to be correct; crashes persisted even after the locks were in place.
We then started looking into ways to get rid of the separate pcaspy thread altogether. The main daemon loop runs at 16 Hz. The main logic in the daemon loop takes only about 5 ms to run. This leaves ~57 ms to run Server.process(), which should be plenty of time to process the cas without slowing things down noticeably. Moving the CAS select processing into the dead time of the main loop forces the main loop to keep track of it's own timing. This has the added benefit of allowing us to drop the separate clock thread that had been keeping track of timing, elliminating two separate threads instead of just one.
So a patch was prepared to eliminate the separate CAS thread from guardian, and it was tested on about a half dozen nodes. No crashes were observed after a day of running (far exceeding the previous MTTF).
A new version of guardian was wrapped up and put into production, and we have seen no spontaneous segfaults in nearly a week. We will continue to monitor the system to confirm that the behavior has not been adversely affected in any way by the ellimination of the CAS thread (full lock recovery would be the best test of that), but we're fairly confident the issue has been resolved.
pcaspy and CAS are not thread safe. This is the main take away. It's possible that guardian is the most intensive user of this library out there, which is why this has not been seen previously. I will report the issue to the EPICS community. We should be more aware of how we use this library in the future, and avoid running the server in a separate thread.
Multi-threading is tricky.
A quick followup about systemd-coredump. It's possible to control the expiration of the coredump files by dropping the following file into the file system:
root@h1guardian1:~# cat /etc/tmpfiles.d/00_coredump.conf d /var/lib/systemd/coredump 0755 root root - root@h1guardian1:~#
The final "-" tells systemd-tmpfiles to put no expiration on files in this directory. The "00" prefix is needed to make sure it always sorts lexically before any config that would set the expiration on this path (the system default is in /usr/lib/tmpfiles.d/systemd.conf).
The GN2 temp was running low at around 120C. I raised the variac from 60% to 64%. As we run the Dewar dry, we will try to maintain a temperature around 180C. Dewar is currently 43% full, with a ~7%/day consumption. Flow is 40-50 scfhx100.
I valved out the UHP GN2 from the turbo foreline; foreline pressure is back to 7.1e-3 Torr.
GN2 temps rising over 200C this afternoon, so I a) lowered the variac from 64% to 60% and b) opened the pressure build valve by another 1/8 turn + 1/4 turn (I believe it's open 1/2 turn total now). The Dewar head pressure has fallen today from 17 to 16 psig. The late afternoon sun adds a variable. Flow was low ~20 scfhx100.
Spring enabled the EE shop to work on setting up power for the LEMIs, and I had a look at the new signals. The top plot in the figure shows that we can see Schumann Resonances quite well, up to quite close to 60 Hz. The bottom two plots show some transient signals that might interfere with a feed-forward system.
It looks like the signals are degraded by wind. I am not surprised because we see wind noise in buried seismometers. I think we would have this vibration problem even on a perfect flat because of the variation in Bernoulli’s forces associated with gusts. It may be that a LEMI signal is generated by the wind because of slight motions of the magnetometers in the earth’s huge DC magnetic field. We buried the LEMIs about 18 inches deep (https://alog.ligo-wa.caltech.edu/aLOG/index.php?callRep=29096). I think we might be able to mitigate the noise some by going much deeper. Once we have the vault seismometer working, it would be a good project to test the wind vibration hypothesis by comparing the LEMI and seismic signals.
There also seem to be some transients, some long and some short, possibly self inflicted by our system. It would be good to look into which transients would be a problem, and for those, details such as whether they are correlated with time of day, the average time between transients, etc., in order to help determine their source.
Finally, I would like to get the full system calibrated by comparing to a battery powered fluxgate magnetometer.
[Pat Meyers, Andrew Matas] We attach a few additional plots studying the Schumann resonances. Figures 1,2 show spectrograms using 16 hours of data from April 18, where the Schumann resonances are clearly visible. There are also a few glitches. We also show coherence (Figure 3) and cross power (Figure 4) between the Hanford and Livingston LEMIs. The first two Schumann resonances at about 8 Hz and 14 Hz are coherent between the sites.
We disabled the vault power on April 20th to upgrade the power supply, it will remain down until the this afternoon.