T.J, Ed, Jim, Patrick, Dave:
The ALS_XARM node was reporting no connection to the epics channels H1:ALS-X_LOCK_ERROR_FLAG and H1:SYS-MOTION_Y_SHUTTER_A_OPEN. This was preventing any further IFO locking. Logging into h1guardian0 as user controls and issuing caget commands, we were able to get H1:ALS-X_LOCK_ERROR_FLAG's value, but not H1:SYS-MOTION_Y_SHUTTER_A_OPEN (reported data "not complete"). Both caget's complained that two IOCs were serving the data, the h1slow-h1fe epics gateway (spans between the slow controls 10.105 network and the front end 10.101 network) and the IOC itself, h1ecatx1. On the other hand, caget on a control room workstation could get both channels.
Since restarting the end-x Beckhoff slow controls system seemed the most impact, we followed the sequence:
reload ALS_XARM node
restart ALS-XARM node
restart h1slow-h1fe gateway
reboot h1guardian (then we remembered that we had decided that power cycles were preferred)
power-cycle h1guardian0
restart h1ecatx1 epics IOC*
* Patrick reminded us that the Beckhoff IOC could be restarted without the PLCs needing a restart, so this was far less impactful than anticipated. This ultimately fixed the problem.
h1guardian0 was scheduled for a reboot tomorrow to load patches, so this reboot was brought forward one day.
We still don't know why h1ecatx1 IOC was accepting all CAGET commands from the control room network caget's, only partial caget's from h1guardian0, and none from the guardian nodes.
postscript: when we power-cycled h1guardian0 we kept the epics gateway off to ensure CA clients would connected directly to h1ecatx1 and not use the epics-gateway. The nodes did not start up the same way as the previous reboot, with more nodes not connecting the channels. The h1ecatx1 channels were in the list, but so were channels on h1hwsmsr (a 10.105 machine) and h1seib3 (on the same front end network). After about 15 minutes TJ reported these channels reconnected, and the only thing we had done was to restart the epics gateway, or perhaps wait long enough.
J. Kissel, B. Weaver Work Permit: #5880 FRS Ticket: #5489 Integration Issue: #FRS 5506 ECR: E1500045 As a band-aid solution to the ganged-coil-driver-state-switching-lock-losses-during-acquisition, recently identified to be PRM (see LHO aLOG 27158), we're going to modify the M3 stage of the HSTS's library part to have individual control over coil switching similar to what was done with the M2 stage of the Beam Splitter some time ago (see E1500045). Because of the breakdown of suspension-type library parts, we must changes to /opt/rtcds/userapps/release/sus/common/models/ HSTS_MASTER.mdl MC_MASTER.mdl and instead of copying the un-library-parted BSFM_MASTER M2 COILOUTF bank, we created a new library part to be copied into the HSTS_ and MC_MASTER_M3 stages, called /opt/rtcds/userapps/release/sus/common/models/FOUROSEM_COILOUTF_MASTER_INDIVIDUAL_CTRL.mdl We've confirmed that these updated models compile, and they've been committed to the userapps repo. We'll work on MEDM screens tomorrow morning, once we've compiled the updated models and we have new channels. There will be some settings (namely the coil driver state settings) that will be lost from the channel name switch, so we'll make sure to replace those in the SDF system accordingly. We'll also make our best attempt EVER at preserving the alignment of these SUS after the restart.
Here are some screen captures of the HSTS BIO and EULER2OSEM matricies which we will be incorporated into new MEDM screens tomorrow.
J. Kissel, B. Weaver, [J. Betzwiser remotely] As usual, we were only able to find bugs in the above SUS model changes for individual coil driver state switching after we'd finished making the MEDM screen modifications. After checking the coil driver compensation functionality after the model changes, we were immediately reminded that the STATE MACHINE logic for those HSTSs stages which have modified TACQ drivers is different from those stages without modification. The beam splitter M2 stage -- which we'd used as an example for the new infrastructure -- which uses an unmodified TACQ driver, where as the recycling-cavity HSTS's (PRM PR2 SRM and SR2 -- at H1 at least) have both their M2 and M3 TACQ drivers modified for extra actuation range. Once we reminded ourselves which suspensions had which drivers modified, we found it best to create a new library part in the /opt/rtcds/userapps/release/sus/common/models/STATE_BIO_MASTER.mdl library capable of individual coil state switching library that uses the correct compensation switching code for the modified driver. It's obscure, but it's as simple as copying over the existing INDIVIDUAL_CTRL block, then changing inline TACQ $SUS_SRC/CD_STATE_MACHINE.c to inline TACQ_M2 $SUS_SRC/CD_STATE_MACHINE.c in each of the cdsFunctionCall block. I've attached screenshots of the updated STATE_BIO_MASTER library and the innards of the parts inside. Note that I've - changed the names of the library blocks associated with the TACQ drivers at the top level to better differentiate between the differences. - made the inline function call visible under the cdsFunctionCall block (right-click > Properties ... > Block Annotation tab > double click on "Description" block property token to add to the list of things for annotation > hit OK) so that the difference in the code call is obvious without digging into the properties. These new library blocks (and the renamed existing blocks for the unmodified driver) were hooked up to the HSTS master models according to their arrangement of TACQ driver modifications: [For H1 ONLY see E1400369 and E1200931] HSTS_MASTER MC1, MC3 No TACQ Driver Mods MC_MASTER MC2 M2 stage driver modified RC_MASTER PRM, PR2, SRM, SR2 Both M2 and M3 stages modified Also, for convenience, remember you can find details of the modification in L1200226, but in summary, the modified driver is a factor of 10 stronger than an unmodified driver (and there's no longer frequency response in the output impedance network, which is why the digital compensation has to change). Also, also, the state machine diagrams that outline how the digital compensation works with the analog driver state can be found in T1100507 These updated models have been committed to the sus/common/models/ directory of the userapps repo, so one needs not do anything different than the above update instructions to receive the bug fix. Since the distinction between whether no, one, or two of an HSTS's TACQ drivers are modified is made at the top-level of the model, i.e. by which of the HSTS_, MC_, or RC_MASTER blocks are used, there shouldn't be a need to change any top-level stuff at LLO to receive this model update. Just update that sus/common/model/ corner of the repo, and recompile, re-install, and re-start.
Tega, Ross, Jim, Dave:
We installed new code for the models h1susitmpi, h1susetmxpi and h1susetmypi. The new code required a DAQ restart, which TEAM-PI obtained permission from TEAM-COMMISSIONING for this afternoon.
The purpose for the change was to make the code more efficient and claw back some CPU time on the h1susitmpi model which was running long (15-16uS for a 64k model). This was successful, it is now running in the 9-10uS range. ETMX is unchanged at 7uS, there is hint that ETMY is running one micro-second longer from 3uS to 4uS.
Tega, Ross, Jim and Dave. Detailed changes made to the PI models. PI_MASTER: 1. Added OMC_PI_MODE library part. 2. Replicate the functionality of the "SUS_PI_DAMP" block in "SUS_PI_COMPUTE" and "ETM_DRIVER". 3. Removed the down-conversion blocks from SUS_PI_DAMP to avoid unnecessary computation in the h1susitmpi model. 4. Renamed OMC_PI as OMC_PI_DOWNCONV to better reflect functionality. 5. Rearranged the library parts so that Simulink blocks related to the OMC_DCPD are on the right whilst blocks that process the QPD data are on the left. h1susetmxpi: 1. Replace the ETMX_PI_DAMP block with the new library parts: SUS_PI_COMPUTE (block name: ETMX_PI_DAMP) and ETM_DRIVER (block name: ETMX_PI_ESD_DRIVER). 2. Moved the down-conversion blocks out of ETMX_PI_DAMP into a single block at the top of the model. 3. Added OMC_DCPD data into the PI control path using a switch that takes either the processed signals from the QPDs (ETMX_PI_DAMP) or the processed signals from the OMC_DCPDs(ETMX_PI_OMC_DAMP)". h1susetmypi: 1. Replace the ETMX_PI_DAMP block with the new library parts: SUS_PI_COMPUTE (block name: ETMY_PI_DAMP) and ETM_DRIVER (block name: ETMY_PI_ESD_DRIVER). 2. Moved the down-conversion blocks out of ETMY_PI_DAMP into a single block at the top of the model. 3. Changes needed to process OMC data are on hold for now. h1susitmpi: 1. Updated the links for ITMX_PI_DAMP and ITMY_PI_DAMP blocks to the new library part: SUS_PI_COMPUTE. The attached images show the before & after snapshots for each model.
Now restored. Restarting the h1hwsmsr computer did the trick.
Here's more detail on what happened:
I came in this morning noticed a connection error on DIAG_MAIN. Opened up the HWS ITMs code and saw that every channel went white. There was no restart in the morning that could have affected TCS. I powercycled the msr computer and was able to rerun the code.
Dave also mentioned that various computer crashes between 2-4 AM local was normal during ER6. He didn't see this problem during O1. We also looked at the memory and CPU usage. Nothing is overloaded
we had some system problems over the weekend, here is a summary of the time-line. (These would appear to be unrelated events)
The last issue caused Guardian problems with the ALS_XARM node. We did the following to try to fix it:
The power up of h1guardian sans epics-gateway gave even more CA connection errors, with HWS IOC and h1seib3 FEC. These were very confusing, and seemed to go away when the h1slow-h1fe epics gateway was restarted which added to the confusion. We need to reproduce this error.
After Patrick restarted the h1ecatx1 IOC the guardian errors went away.
Rebooting the entire guardian machine just because one node was having a problem seems like extreme overkill to me. I would not recommend that as a solution, since obviously it kills all other guardian processes, causing them to loose their state and current channel connections. I don't see any reason to disturb the other nodes because one is having trouble. Any problem that would supposedly be fixed by rebooting the machine should also be fixed by just kill and restarting the affected node process.
The actual problem with the node is not specified, but the only issue I know of that would cause a node to become unresponsive and immune to a simple "guardctrl restart" is the EPICS mutex thread lock issue, which has been reported both at LLO and LHO, and both with solutions that don't require rebooting the entire machine. Presumably the issue being reported here is somehow different? It would be good to have a better description of what exactly the problem was.
See attached screenshot.
Restarted the IOC.
Forced PT100 Cold Cathode on in Beckhoff on h0vacly per Chandra's request (see attached). It is now reading ~1.01e-07.
Attached are HEPI Pump Trends for the past 45 days. To my untrained eye, I don't see any egregious excursions in pump pressures. SEI folks should review.
This completes FAMIS Request 4520.
Labeled HAM11 annulus IP (physically on HAM12) fell out of scale at around noon yesterday. Labeled HAM12 annulus IP (physically on HAM11) needs to be replaced. i.e. HAM11 annulus IP is working hard https://alog.ligo-wa.caltech.edu/aLOG/index.php?callRep=26804
9am local 1/2 turn on LLCV bypass --> took 22 seconds to overfill. Lowered LLCV back to 20% (from 21%). Hot weather last week likely cause of long overfill times last Wed. and Fri. *watch exhaust pressure after tomorrow's Dewar fill
SEI - No major maintenance plans scheduled. Ongoing tweaking with BRS.
SUS - model changes for HSTS scheduled for tomorrow.
VAC - GV measurements and manual CP3 overfill scheduled for tomorrow
CDS - Cables to be pulled for ITM ESD tomorrow. Auto restart of workstations to take place tomorrow morning.
PSL - the team sees no pressing reason to go into the enclosure at this time other than to do some DBB aligning tomorow, possibly.
*CP3 Dewar fill from Norco truck tomorrow
The laser was off this morning. The chiller indicated that there was a "Flow sensor 1" error.
Looking at the data from the various flow sensors, maybe, just maybe, the problem is with the flow sensor attached to the auxiliary circuit (which monitors flows to the power meters ...). The attached plot seems to imply that the flow to the power meters dropped before the crystal chiller flow declined. Would need to check that the power meters are really attached to this part of the cooling circuit because the diode chiller was running this morning. For reference the cooling system information is under https://dcc.ligo.org/DocDB/0067/T1100373/002/LIGO-T1100373-v2%20Coolant%20system%20operating%20and%20maintenance%20manual.pdf
1:30pm local 1/2 turn open on LLCV bypass --> took 33:17 min. to overfill CP3. Raised LLCV from 20% to 21%.
It looks like we need to do something about the triple coil drivers that we switch in lock, especially PRM M3. We have lost lock a good fraction of the times that we have tried to switch in the last month or so. Screenshot is attached, I also filed an FRS ticket hoping that someone might make an attempt to tune the delays while people are in the PSL tomorrow morning. FRS ticket #5489
Is the DAC spectrum post-switch using up a large fraction of the range? If the noise in the PRC loop has change a little bit it could make this transition less risky.
Here's the PRM drive from last night's lock, in which we just skipped the PRM M3 BIO switching leaving the low pass off (BIO state 1). It seems like we should have plenty of room to turn the low pass on without saturating the DAQ.
FRS Ticket #5489 closed in favor of a long-term fix Integration Issue #5506 (which is now also in the FRS system) and for an eventual ECR. Work permit #5880 indicates we'll install the fix tomorrow. See LHO aLOG 27223.