h1build had stopped running (208.5 day bug?). I rebooted by pressing the front panel reset button and started all the Beckhoff SDF systems
[Rich A, Calum T, Luis S, Craig C, Koji A, Georgia M]
We have begun testing and characterising the electric field meter (EFM) brought over from Caltech. The first photo shows the EFM with the sensing plates shorted. All of this activity took place in the optics lab.
We used the SR785 to take noise spectra with the plates grounded, and of the ambient electric field (or acoustic environmental noise) for the X and Y plates. These spectra are shown in the first plot, the y data have 25 averages while the X data has one. The grounded spectra are in agreement (the 60 Hz harmonics are present in the blue x trace also) which makes sense, however I’m not sure why there is a significant difference between the x and y ambient spectra.
We then tuned the digital potentiometers, using the Arduino and Luis' code, to optimise the common mode rejection between the X+ and X- plates and the Y+ and Y- plates. To do this we attached the calibration plates over the sensor plates (but electrically isolated), and drove the common mode with a 1V signal. This configuration is shown in the second photo. The transfer functions to the common mode, and to a single plate, are attached for the Y plates. (the X data did not save properly and we did not have time to go back and get it. If I remember correctly we had ~47 dB of common mode rejection on the X plates, but please correct me if this is wrong). We achieved roughly 60 dB of common mode rejection between the y plates at 1 kHz, though this gets worse at higher frequencies.
Good that you got the device together so fast. The noise curves for x should be lower and the ones for y do not make sense. I am wondering about the calibration for electric field. You will need to take account of the copper button. The field between the calibration plate and the button will be higher than in the region between the plates outside of the button. The ratio of the fields in the two places will be in the inverse ratio of the gap space. One will need to estimate the induced surface charge on the sense plate by adding the contribution for the surface charge times area of the two regions with different gap spacing. If this is not done the field sensitivity of the device will be estimated as too high.
We retook the noise spectra, the ambiant Y trace is fixed, added the SR785 noise, and made the number of averages consistent (15).
FAMIS 6945 The following signals appear high from visual inspection: BS_ST1_CPSINF_H1_I BS_ST2_CPSINF_H1_I BS_ST2_CPSINF_V1_I ITMY_ST1_CPSINF_V2 ITMY_ST2_CPSINF_V2 HAM3_CPSINF_V1 The script reports: ETMX_ST2_CPSINF_H1 high freq noise is high! ETMX_ST2_CPSINF_H2 high freq noise is high! ETMX_ST2_CPSINF_H3 high freq noise is high! ETMX_ST2_CPSINF_V1 high freq noise is high! ETMX_ST2_CPSINF_V2 high freq noise is high! ETMX_ST2_CPSINF_V3 high freq noise is high!
TVo, Sheila, Terra
In prep for HAM6 work tomorrow (for which we'll need single bounce X and Y), we relocked IMC without a problem and then aligned single bounce(s) onto AS_C. We should be set for tomorrow.
- - - - -
We started by trying slider values from here. Once we had some light on AS_C, we found that
This looked like things were aligned to the misaligned ITMX state, so we stepped down the ITMX M0 test offsets while compensating with PR2, effectively aligning to the ITMX aligned state. At this point we had flashing power at AS_C; turned out to be flashing MICH since ITMY misaligned state was not misaligned enough. We then increased the M0 offsets in the respective misaligned state until there was no light on AS_C. In short, offsets nominal and those we found for full misalignment in single bounce are:
ITMX nominal | ITMX MICH | ITMY nominal | ITMY MICH | |
Pitch | 50 | 200 | 50 | 600 |
Yaw | -80 | -200 | 600 | -400 |
Final alignment values here. Note we did not change ITMX, BS, PR3 slider values.
We spent a bit of time trying to align REFL; we could get some power onto REFL_B, but not _A before we decided to move on.
- - - - -
Side note: we tried to go to PRX_LOCKED in IFO_ALIGN but there was a user code error thrown. We tried poking around but seems like the error is deeper in the set_sus_config function. Since we didn't really need guardian tonight or tomorrow, we're leaving it for later. There was some recent work done on this node, as mentioned here.
Re: the guardian error, it just looks like the SRM node wasn't managed, and so INIT in the ALIGN_IFO node needed to be run. I don't want to test to confirm this right now, however, since HAM6 alignment work is beginning. I'll try to test it when they come out for lunch break, or something.
(Mark, Tyler, Gerardo, Kyle, Chandra, Bubba)
Ion pump 11 and chevron baffle assembly were removed from the beam tube and inspected.
The copper gasket removed from the gate valve/chevron baffle joint was visually inspected, that was a good seal.
The copper gasket removed from the chevron baffle/ion pump joint was visually inspected, also was a good seal.
Then the baffle housing was visually inspected (SN#001), right away Mark found the hole on one of the welds, see photo, Bubba, Tyler and I confirmed the spot, something to note are the stains on the metal due to the isopropanol alcohol, which was sprayed on the exterior of the chevron baffle housing and sipped towards the inside of the housing. We sprayed alcohol trying to determine the location of the leak.
Chandra found a suitable replacement baffle housing (SN002). The chevron baffle assembly was transfer to its new housing, and assembled per procedure. All items were re-assembled and while using a blank instead of the gate valve, the entire assembly was bench leak tested. We did a coarse helium spray, no detection noted. Then a bag was used such that each CFF joint was exposed but keeping the baffle's housing vacuum welds within the bagged volume. The bag volume was purged with bottled N2 while spraying helium around each CFF joint, there was no significant response. Then the procedure was reversed and helium was used to saturate the bagged volume, that yield no response. No leaks detected on this assembly.
To do next, vent IP11 volume with nitrogen to get it ready for install on the beam tube.
Cheryl, TVO, Daniel The TCSX and TCSY CO2 laser verbal alarms were triggered even though they were off because the controller box was powered down causing the flowmeter readback to reach negative values. Both controller boxes were turned on and the flowrate reached nominal values but verbal alarms still persisted. To silence verbal alarms we raised the alarm condition of the CO2 laser powerhead PD from 5 watts to 10 watts. This cut out most of the alarms but the attached figure shows that the laser powerhead PD channels still glitch above the threshold.
ISI code change to ramp DAC drive down when watchdog trips and not instantaneously go to zero volts. Dave composed this and I added in red text. Hugh and Dave. 10 April 2018 WP7462 to implement ECR E1800026 addressing II 9889 with details explained by Stocks & Lantz in T1800031. Summary: we attempted to test the new ISO_RAMP code, but were unable to reach that point due to filter modules outputting very large signals. We abandoned the test and restored the code and hardware back to its original configuration. ---------------------------------------------------------------------------- Hugh elected to test the new code on h1isietmy. Using the latest isi/common/models/isi2stagemaster.mdl (r17117) which uses the new isi/common/models/cushion_DOF.mdl (r17116) and the associated C code isi/common/src/RAMP_ISO_OFF.c (r17084) we compiled and installed h1isietmy. Following a C code review by Jonathan, it was expected that the model would have problems if the ramp time is zero of a divide-by-zero nature. Our first test was to execute the if-conditional block which would run the divide-by-zero lines. While Hugh was manually causing watch dog trips, he noticed that the ISI DAC channels were negatively railing to -10.0 Volts. This was tracked to ST1 and ST2 OUTF filters having a railed output of 1.0e+20 with a zero input. These filter modules have two filters installed: Comp and Sym. We found that the Sym filter (a simple gain stage) was causing the output to shoot to rail on WD trip. It was not obvious to us why this was happening. The rail could be removed by clearing the FM history, and could be prevented by turning off the Sym filter. Are you sure? Hugh's not sure if we ever prevented it... After further setup work Hugh discovered that ST[1,2] RZ T240SUBTRACT_Z filter modules were also railing at 1.0e+20. Due to the violent motion being caused by the ISI DAC outputs instantaneously driving to -10.0V each time the watchdog was tripped, we needed to stop the DAC output from driving the DAC cards. Our first attempt was to SWWD trip the output of the h1iopseiey model. However, the ISI code and/or guardian monitors this and this prevented our test from proceeding. Could not get the guardian to run with IOP(software WD) tripped. We eventually had to go to EY and power off the anti-imaging (AI) chassis for the ISI DAC. We could then green up all the software watchdogs and still be sure that no sudden motion was being driven to the seismic stack. Hugh spent a further hour trying to get the isolation ramp code to be executed, but was being thwarted by the misbehaving filter modules. Same drive to 1e+20 counts & -10Vs. We decided to abandon the tests as maintenance time was running out, and to revert the code back to what it was. I reverted the svn version of isi/common/models/isi2stagemaster.mdl from r17117(05apr2018) to r16398(31oct2017), and isi/common/models/isihammaster.mdl from r17118(05apr2018) to r16422(06nov2017). Thank you Dave! I rebuilt h1isietmy and we were happy to see the DAQ INI file return to its original content. We verified the checksum of the INI file was as it was originally. When the model was restarted, the DAQ-ini-mismatch error was cleared, so no DAQ restart was needed. Hugh then damped ISI-ETMY with no re-occurance of the 1.0e+20 FM outputs. Finally, using the ISI Guardian (no SEI Manager) the stages isolated stabily.
As far as the code problem--Don't have it to look at now since Dave reverted but I suspect some X Y Z etc DOF wires got crossed. Don't know if that addresses the railing FMs but it would not help.
I'm trying to reproduce the issue on our test-stand but I can not.
First test shows the behavior is what I expect - I set the filter module for stage 1, X, to just be a constant (100), gain =1, and turn the filter modules off so that the output of the filter is 100.
Then I trip the watchdog and compare the signal at the filter output to the signal at the output of the ramp stage. I do this 2 times, first with a ramp time of 0, and once with a ramp time of 0.1 sec. In the two attached plots one can see that the red filter module output drops immediately to 0, while the ramp output (blue triangles) either drops immediately to 0 or ramps to 0 in 0.1 seconds. I do not see a big glitch. I wonder what the heck is going on - will follow up with Hugh ASAP.
-Brian
Hey Brian,
We never even got to try the smooth ramp as we could not get the system isolated. We did not think to try a fake isolation state as I think you are indicating.
Hi Hugh, I did a quick test just now. In the isi2stagemaster model, I went through both stage 1 and stage 2 paths to ensure that there were no "crossed-wires." That is, I went through each dof in the isolation bank and ensured that an input signal appeared as it should in the final output of each stage (a 1 count offset in ST1 X appeared as solely a 1 count output in ST1_OUTF H1, etc). (note - we set the cart2act matrix to be the identity matrix for our convenience in this test - so X goes to H1, etc) For the stage 1 path I ensured this also held true for the ST1_ST2_DRIVE_COMP. This indeed held true in the updated model, the one with the changes Brian and I made. Thus we think that the problem lies elsewhere, possibly in the way that the model was started on the H1 machines; it is my understanding that IOP model was not restarted after grabbing the updated model from the svn. However, if you have any further suggestions for us to investigate, please send them our way.
Summary by building:
Detailed Activity List:
Today, Travis, Mark and I worked on installing the ETMX lower structure into the chamber. After it was in, while reassembling cables, we noticed that the test mass was really jammed into it's lower stops - upon further inspection, we discovered that both fibers on the left side (viewed from back of the suspension) were gone. We started looking for the debris and found it just outside of the chamber on the floor and on the trolley that the unit had been sitting in overnight last night and this morning. We did not find anything like a bug nearby, but investigation ongoing. Meanwhile, we're revising the schedule to back up a few steps and start the rebuild. History of lower structure after it was welded:
- Welded main chain unit sat on it's trolly in weld room for ~a week before we were ready to install it.
- Yesterday, we rolled the covered main chain on it's trolly ~60ft from weld room cleanroom to the staging cleanroom. We then uncovered it, and lifted it with the genie duct jack and placed it on the reaction chain trolley and mated the 2 units together. The fibers were intact at this point. We then covered it again and rolled the whole unit ~10ft from the staging cleanroom to the chamber cleanroom and parked it near the door. We left the cover on it and the genie duct jack parked around it with the forks loose around the structure to aid in protection.
- This morning we uncovered the unit, and then lifted the unit off of the trolley with the genie lift and set it up on the install arm elevator. We then continued with the installation of the unit into the chamber with the arm. A few hours later we discovered the broken fibers.
All of the above is standard install procedure, with the same equipment (trolly, cover, genie lift, install arm) used in each of the previous QUAD installations.
Since we found the fiber debris all over the trolley and on the ground outside of the chamber on the floor, it must have happened between yesterday late afternoon when we rolled it there and this mornign after we picked it up and set it into the install arm.
We believe that the fibers broke in the morning during the lift out of the LSAT with the genie duct jack lift since we recall specifically NOT seeing debris under the trolley before that. There is a fair amount of jostling that happens to the suspension during this, as the suspension needs to pull free from 8 supporting legs (with 8 nut bars and 8 screws) which are a tight fit around the structure, just above the upper fiber joint. The masses were locked into position, however the test mass was locked with most of the load still on the fibers. Since the lower EQ stops are viton tipped, they must have compressed more during the maneuver (or added up over the course of the previous maneuvers) and the load on the fibers became more than 100%. We will go back to utilizing the rail stops under the mass D060446, a tooling piece abandoned early on due to interferences and difficulty to use (the tooling for the stops we adapted instead has proved sufficient numerous times since, but it only takes one failure).
Attached are pictures of the ear and horns on the PUM and Test Mass. The PUM horns appear to be very short now, so we will take the opportunity to replace the PUM, especially since this is the one with the crack behind the prism which may not survive around of fiber welding.
I'm prepping PUM-ITM03 in the bonding lab. Yesterday the magnet/flag inserts and 1 prism were glued into place. Today the second prism, followed by an overnight low temp out-gassing air-bake.
WP 7464
Removed CPS satellite box for inspection of pcb DB25 connector, alog 41265. Pads on pcb board are not being stressed. Only thing to note is the connector on the pcb board is not sitting flush to the board. Unit was reinstalled.
F. Clara
And the platform isolated with no issues.
WP 7465 Daniel, Dave, Filiberto, Patrick, Richard I stopped the EPICS IOC that was running on h0epics and moved the DB9 serial cable from the Comtrol to Port 1 of the serial concentrator. This port connects to channel 1 of the Corner MSR L12_13 EL6002. I added the code for the readout of the weather station to PLC1 on h1ecatc1. No channel names were changed. Channels have been added to report errors. These should be added to the medm screens. The code will automatically reconnect if communication is interrupted, for example if the weather station is powercycled. The alarm levels for the wind speed have been put in the database, so a burtrestore is no longer required to set these either. I also took the opportunity to update the TCS chiller code. It seems to be working without issue.
Kyle R., Chris S. Today Chris S. and I took an FTIR sample from the closed gate of the 12.8" GV that isolates IP11 (VAT s/n 0002, see also https://alog.ligo-wa.caltech.edu/aLOG/index.php?callRep=41353). Additionally, we opened a new 12.8" GV (in clean room, VAT s/n 0005) and took a comparable swab from the same side of its closed gate. In both cases, the swiped area was ~20% of the exposed gate surface area.
Spun up vertex turbo right before lunch at ~0.6 Torr. Still need to close purge/vent isolation valves (OMC and IMC) and IP3 GV and little right angle valve.
Kyle closed the IMC & OMC purge valves and I closed the right angle metal valve on top of IP3 GV. Leak checker is connected to vertex turbo cart, valved out, and warming up for leak checking tomorrow or whenever He bottle is freed up from EY.
At ~2256utc on 3 April (or precisely 1206831398gps), all the CPS sensor readouts on both stages jumped to a different value. Eight of the twelve spiked to a 33kish number before going somewhere else. A few on Stage2 are not crazy numbers but still shifted. Most on Stage1 are extreme ~positive 8k. See attached
Another item to be investigated is the Coil Driver Thermal trip hit at 1557utc on the fifth...
Wasted a bit of time by power cycling the CPS power and the sensors interface chassis. Found the CPS Timing Sync fanout chassis power supply unplugged. Guess it got swept up in the VE group's house cleaning that was found to be necessary. While the senors don't read exactly what they did before, they are at least much more reasonable. And, the platform will isolate.
Today, Travis and I finished assembling the ETMX lower reaction chain. We assembled the PenRe mass in the structure, attaching the balance of parts to make the weight from last week including AOSEMs, cables, and cable routing brackets. We then installed the new annular AERM07 mass below the PenRe in the lower structure and adjusted it in it's 6 DOF's by eye before suspending it. We then reclamped everything and rolled the main lower chain up to it. A quick lift via the Genie duct jack and we set the main lower structure and newly fiber'ed masses into the trolley with the reaction set. We added the UIM and PUM magnets/flags (set with polarities per the QUAD controls poster), and then shoved the 2 structures together and fastened them.
We wrapped the now complete lower unit and rolled it to the chamber side. After attaching some of the LSAT lifting blocks, we staged the Genie duct jack so that we are ready to install it on the arm and into the chamber tomorrow.
Note - Travis noticed a door bolt sitting in the BSC9 flange in the unused hole at 3 o'clock. The washer from the bolt (which held nothinglooked to be rubbing funny on the o-ring so he removed it. The vacuum crew will need to inspect this after we remove the install arm. (Note, the install arm and stiffening flanges cover most of the right side of holes and o-ring on this chamber, except for a ~6" gap at 3 o'clock. There are o-ring covers everywhere else. This bolt should not have been in this hole, since it served no purpose.)
From Chandra and Kyle who inspected this o-ring portion yesterday late afternoon:
Kyle and I inspected the o-ring and deemed it ok to reuse. Kyle peeled away a sliver of viton material that was hanging off of the o-ring. To the naked eye the surface looks ok. We will take note of this when we install the door and pump down the annulus volume.
I have moved a subset of guardian nodes to the new configuration on h1guardian1. This is to try to catch more of the segfaults we were seeing during the last upgrade attempt, that we have not been able to reproduce in testing.
The nodes should function normally on the new system, but given what we saw before we expect to see segfaults with a mean time to failure of about 100 hours. I will be baby sitting the nodes on the new setup, and will restart them as soon as they crash.
The nodes that have been moved to the new system are all the SUS and SEI nodes in the input chambers, BS, and the arms. No nodes from HAM4, HAM5, or HAM6 were moved. Full list of nodes now running on h1guardian1:
jameson.rollins@opsws12:~ 0$ ssh guardian@h1guardian1 list HPI_BS HPI_ETMX HPI_ETMY HPI_HAM1 HPI_HAM2 HPI_HAM3 HPI_ITMX HPI_ITMY ISI_BS_ST1 ISI_BS_ST1_BLND ISI_BS_ST1_SC ISI_BS_ST2 ISI_BS_ST2_BLND ISI_BS_ST2_SC ISI_ETMX_ST1 ISI_ETMX_ST1_BLND ISI_ETMX_ST1_SC ISI_ETMX_ST2 ISI_ETMX_ST2_BLND ISI_ETMX_ST2_SC ISI_ETMY_ST1 ISI_ETMY_ST1_BLND ISI_ETMY_ST1_SC ISI_ETMY_ST2 ISI_ETMY_ST2_BLND ISI_ETMY_ST2_SC ISI_HAM2 ISI_HAM2_SC ISI_HAM3 ISI_HAM3_SC ISI_ITMX_ST1 ISI_ITMX_ST1_BLND ISI_ITMX_ST1_SC ISI_ITMX_ST2 ISI_ITMX_ST2_BLND ISI_ITMX_ST2_SC ISI_ITMY_ST1 ISI_ITMY_ST1_BLND ISI_ITMY_ST1_SC ISI_ITMY_ST2 ISI_ITMY_ST2_BLND ISI_ITMY_ST2_SC SEI_BS SEI_ETMX SEI_ETMY SEI_HAM2 SEI_HAM3 SEI_ITMX SEI_ITMY SUS_BS SUS_ETMX SUS_ETMY SUS_IM1 SUS_IM2 SUS_IM3 SUS_IM4 SUS_ITMX SUS_ITMY SUS_MC1 SUS_MC2 SUS_MC3 SUS_PR2 SUS_PR3 SUS_PRM SUS_RM1 SUS_RM2 SUS_TMSX SUS_TMSY jameson.rollins@opsws12:~ 0$
NOTE: Until the new system has been put fully into production, "guardctrl" interaction with these nodes on h1guardian1 is a bit different. To start/stop the nodes, or get status or view the logs, you will need to send the appropriate guardctrl command to guardian@h1guardian1 over ssh, e.g.:
jameson.rollins@opsws12:~ 0$ ssh guardian@h1guardian1 status SUS_BS ● guardian@SUS_BS.service - Advanced LIGO Guardian service: SUS_BS Loaded: loaded (/usr/lib/systemd/user/guardian@.service; enabled; vendor preset: enabled) Drop-In: /home/guardian/.config/systemd/user/guardian@.service.d └─timeout.conf Active: active (running) since Sun 2018-04-08 14:48:47 PDT; 1h 53min ago Main PID: 24724 (guardian SUS_BS) CGroup: /user.slice/user-1010.slice/user@1010.service/guardian.slice/guardian@SUS_BS.service ├─24724 guardian SUS_BS /opt/rtcds/userapps/release/sus/common/guardian/SUS_BS.py └─24745 guardian-worker SUS_BS /opt/rtcds/userapps/release/sus/common/guardian/SUS_BS.py Apr 08 14:48:50 h1guardian1 guardian[24724]: SUS_BS executing state: ALIGNED (100) Apr 08 14:48:50 h1guardian1 guardian[24724]: SUS_BS [ALIGNED.enter] Apr 08 16:01:45 h1guardian1 guardian[24724]: SUS_BS REQUEST: ALIGNED Apr 08 16:01:45 h1guardian1 guardian[24724]: SUS_BS calculating path: ALIGNED->ALIGNED Apr 08 16:01:45 h1guardian1 guardian[24724]: SUS_BS same state request redirect Apr 08 16:01:45 h1guardian1 guardian[24724]: SUS_BS REDIRECT requested, timeout in 1.000 seconds Apr 08 16:01:45 h1guardian1 guardian[24724]: SUS_BS REDIRECT caught Apr 08 16:01:45 h1guardian1 guardian[24724]: SUS_BS [ALIGNED.redirect] Apr 08 16:01:45 h1guardian1 guardian[24724]: SUS_BS executing state: ALIGNED (100) Apr 08 16:01:45 h1guardian1 guardian[24724]: SUS_BS [ALIGNED.enter] jameson.rollins@opsws12:~ 0$
A couple of the SEI systems did not come back up to the same states they were in before the move. This caused a trip on ETMY HPI, and ETMX ISI_ST1. I eventually recovered everything back to the states they were in at the beginning of the day.
The main problem I've been having is with the ISI_*_SC nodes. They all are supposed to be in the SC_OFF state, but a couple of the nodes are cycling between TURNING_OFF_SC and SC_OFF. For instance, ISI_ITMY_ST2_SC is showing the following:
2018-04-09_00:00:39.728236Z ISI_ITMY_ST2_SC new target: SC_OFF
2018-04-09_00:00:39.729272Z ISI_ITMY_ST2_SC executing state: TURNING_OFF_SC (-14)
2018-04-09_00:00:39.729667Z ISI_ITMY_ST2_SC [TURNING_OFF_SC.enter]
2018-04-09_00:00:39.730468Z ISI_ITMY_ST2_SC [TURNING_OFF_SC.main] timer['ramping gains'] = 5
2018-04-09_00:00:39.790070Z ISI_ITMY_ST2_SC [TURNING_OFF_SC.run] USERMSG 0: Waiting for gains to ramp
2018-04-09_00:00:44.730863Z ISI_ITMY_ST2_SC [TURNING_OFF_SC.run] timer['ramping gains'] done
2018-04-09_00:00:44.863962Z ISI_ITMY_ST2_SC EDGE: TURNING_OFF_SC->SC_OFF
2018-04-09_00:00:44.864457Z ISI_ITMY_ST2_SC calculating path: SC_OFF->SC_OFF
2018-04-09_00:00:44.865347Z ISI_ITMY_ST2_SC executing state: SC_OFF (10)
2018-04-09_00:00:44.865730Z ISI_ITMY_ST2_SC [SC_OFF.enter]
2018-04-09_00:00:44.866689Z ISI_ITMY_ST2_SC [SC_OFF.main] SENSCOR_Y_IIRHP FMs:[4] is not in the correct configuration
2018-04-09_00:00:44.866988Z ISI_ITMY_ST2_SC [SC_OFF.main] USERMSG 0: SENSCOR_Y_IIRHP FMs:[4] is not in the correct configuration
2018-04-09_00:00:44.927099Z ISI_ITMY_ST2_SC JUMP target: TURNING_OFF_SC
2018-04-09_00:00:44.927619Z ISI_ITMY_ST2_SC [SC_OFF.exit]
2018-04-09_00:00:44.989053Z ISI_ITMY_ST2_SC JUMP: SC_OFF->TURNING_OFF_SC
2018-04-09_00:00:44.989577Z ISI_ITMY_ST2_SC calculating path: TURNING_OFF_SC->SC_OFF
2018-04-09_00:00:44.989968Z ISI_ITMY_ST2_SC new target: SC_OFF
2018-04-09_00:00:44.991117Z ISI_ITMY_ST2_SC executing state: TURNING_OFF_SC (-14)
2018-04-09_00:00:44.991513Z ISI_ITMY_ST2_SC [TURNING_OFF_SC.enter]
2018-04-09_00:00:44.993546Z ISI_ITMY_ST2_SC [TURNING_OFF_SC.main] timer['ramping gains'] = 5
2018-04-09_00:00:45.053773Z ISI_ITMY_ST2_SC [TURNING_OFF_SC.run] USERMSG 0: Waiting for gains to ramp
Note that the problem seems to be that it's failing a check for the SENSCOR filter banks being in the correct state once SC_OFF has been achieved. Here are the nodes that are having problems, and the messages they're throwing:
ISI_HAM2_SC [SC_OFF.main] SENSCOR_GND_STS_Y_FIR FMs:[1] is not in the correct configuration ISI_HAM3_SC [SC_OFF.main] SENSCOR_GND_STS_Y_FIR FMs:[1] is not in the correct configuration ISI_BS_ST2_SC [SC_OFF.main] SENSCOR_Y_IIRHP FMs:[4] is not in the correct configuration ISI_BS_ST1_SC [SC_OFF.main] SENSCOR_GND_STS_Y_WNR FMs:[6] is not in the correct configuration ISI_ITMX_ST2_SC [SC_OFF.main] SENSCOR_Y_IIRHP FMs:[4] is not in the correct configuration ISI_ITMY_ST2_SC [SC_OFF.main] SENSCOR_Y_IIRHP FMs:[4] is not in the correct configuration ISI_ETMY_ST1_SC [SC_OFF.main] SENSCOR_GND_STS_Y_WNR FMs:[6] is not in the correct configuration ISI_ETMY_ST1_SC [SC_OFF.main] SENSCOR_GND_STS_Y_WNR FMs:[6] is not in the correct configuration
I've tried to track down where exactly the problem is coming from, but haven't been able to figure it out yet. It looks like the expected configuration just does not match with how they're currently set. I will need to consult with the SEI folks tomorrow to sort this out. In the mean time, I'm leaving all of the above nodes paused.
A note on the SC nodes:
Since these new SC nodes are still in a bit of a testing phase, I don't think all of the filters that will be used are in the configuration file. One way we could get around this, until the config file is set exactly how the SEI wants it, is to remove the check temporarily. I'm hesitant to remove it entirely, but that might be best since it doesn't allow for any testing of new filters.
As of 7:50 am this morning (after I restarted 5 nodes last night):
HPI_BS enabled active 2018-04-08 14:48:46-07:00 HPI_ETMX enabled active 2018-04-08 14:48:05-07:00 HPI_ETMY enabled active 2018-04-08 15:55:18-07:00 HPI_HAM1 enabled active 2018-04-08 14:42:32-07:00 HPI_HAM2 enabled failed 2018-04-08 14:40:54-07:00 HPI_HAM3 enabled active 2018-04-08 14:41:09-07:00 HPI_ITMX enabled active 2018-04-08 14:48:05-07:00 HPI_ITMY enabled failed 2018-04-08 14:48:05-07:00 ISI_BS_ST1 enabled active 2018-04-08 14:49:54-07:00 ISI_BS_ST1_BLND enabled active 2018-04-08 15:38:51-07:00 ISI_BS_ST1_SC enabled active 2018-04-08 16:16:57-07:00 ISI_BS_ST2 enabled failed 2018-04-08 14:49:54-07:00 ISI_BS_ST2_BLND enabled active 2018-04-08 15:38:51-07:00 ISI_BS_ST2_SC enabled active 2018-04-08 16:16:56-07:00 ISI_ETMX_ST1 enabled failed 2018-04-08 14:47:10-07:00 ISI_ETMX_ST1_BLND enabled active 2018-04-08 15:34:22-07:00 ISI_ETMX_ST1_SC enabled active 2018-04-08 16:16:57-07:00 ISI_ETMX_ST2 enabled active 2018-04-08 14:47:10-07:00 ISI_ETMX_ST2_BLND enabled active 2018-04-08 15:34:21-07:00 ISI_ETMX_ST2_SC enabled active 2018-04-08 16:16:56-07:00 ISI_ETMY_ST1 enabled active 2018-04-08 14:47:10-07:00 ISI_ETMY_ST1_BLND enabled active 2018-04-08 15:37:45-07:00 ISI_ETMY_ST1_SC enabled active 2018-04-08 16:16:57-07:00 ISI_ETMY_ST2 enabled active 2018-04-08 14:47:10-07:00 ISI_ETMY_ST2_BLND enabled failed 2018-04-08 15:37:45-07:00 ISI_ETMY_ST2_SC enabled active 2018-04-08 16:16:57-07:00 ISI_HAM2 enabled active 2018-04-08 14:40:54-07:00 ISI_HAM2_SC enabled failed 2018-04-08 16:16:57-07:00 ISI_HAM3 enabled failed 2018-04-08 14:41:09-07:00 ISI_HAM3_SC enabled active 2018-04-08 16:16:56-07:00 ISI_ITMX_ST1 enabled active 2018-04-08 14:47:10-07:00 ISI_ITMX_ST1_BLND enabled active 2018-04-08 23:17:14-07:00 ISI_ITMX_ST1_SC enabled active 2018-04-08 16:16:57-07:00 ISI_ITMX_ST2 enabled active 2018-04-08 23:17:14-07:00 ISI_ITMX_ST2_BLND enabled failed 2018-04-08 15:33:53-07:00 ISI_ITMX_ST2_SC enabled active 2018-04-08 16:16:57-07:00 ISI_ITMY_ST1 enabled active 2018-04-08 14:47:10-07:00 ISI_ITMY_ST1_BLND enabled active 2018-04-08 15:33:12-07:00 ISI_ITMY_ST1_SC enabled failed 2018-04-08 16:16:57-07:00 ISI_ITMY_ST2 enabled active 2018-04-08 14:47:10-07:00 ISI_ITMY_ST2_BLND enabled failed 2018-04-08 15:33:37-07:00 ISI_ITMY_ST2_SC enabled active 2018-04-08 16:16:57-07:00 SEI_BS enabled active 2018-04-08 14:48:45-07:00 SEI_ETMX enabled active 2018-04-08 14:47:26-07:00 SEI_ETMY enabled active 2018-04-08 23:17:13-07:00 SEI_HAM2 enabled active 2018-04-08 14:40:53-07:00 SEI_HAM3 enabled active 2018-04-08 14:41:08-07:00 SEI_ITMX enabled failed 2018-04-08 14:47:26-07:00 SEI_ITMY enabled active 2018-04-08 14:47:26-07:00 SUS_BS enabled active 2018-04-08 23:17:13-07:00 SUS_ETMX enabled active 2018-04-08 14:47:27-07:00 SUS_ETMY enabled active 2018-04-08 14:47:27-07:00 SUS_IM1 enabled active 2018-04-08 14:05:38-07:00 SUS_IM2 enabled failed 2018-04-08 14:05:38-07:00 SUS_IM3 enabled failed 2018-04-08 14:05:38-07:00 SUS_IM4 enabled failed 2018-04-08 14:05:38-07:00 SUS_ITMX enabled active 2018-04-08 14:47:27-07:00 SUS_ITMY enabled active 2018-04-08 23:17:13-07:00 SUS_MC1 enabled active 2018-04-08 23:17:14-07:00 SUS_MC2 enabled active 2018-04-08 14:40:11-07:00 SUS_MC3 enabled failed 2018-04-08 14:40:11-07:00 SUS_PR2 enabled failed 2018-04-08 14:40:11-07:00 SUS_PR3 enabled failed 2018-04-08 14:40:11-07:00 SUS_PRM enabled active 2018-04-08 14:40:11-07:00 SUS_RM1 enabled failed 2018-04-08 13:45:41-07:00 SUS_RM2 enabled active 2018-04-08 13:45:48-07:00 SUS_TMSX enabled active 2018-04-08 14:53:45-07:00 SUS_TMSY enabled active 2018-04-08 14:53:46-07:00
Including the five nodes I restarted last night, that's 23 seg faults out of 68 nodes in roughly 18 hours = 6 hour MTTF. That's higher than it was previously. I'm reverting all nodes back to h1guardian0.
All nodes have been reverted back to h1guardian0
I have attached a pdf with a breakdown of stack traces and pids, so that we can see what the causes of the failures were.