Ryan C, Dave:
Today's reboot of opslogin0 gave us an opportunity to stop Ryans's test version which had been running on opslogin0, and start the production IOC on cdsioc0.
WP11210 h1hwinj1 configuration and reboot test
Keith R, Keith T, Erik, Dave:
We have put h1hwinj1 under puppet configuration control. Monit is being used to monitor the status of psinject (CW hardware injector), and will restart it if it is missing. Systemd is used to start/stop the processes, via scripts which ramp the GAIN of the CAL_INJ_CW_EXC filter module over 10 seconds. To test the system works when h1h1inj1 is rebooted, this morning I rebooted using "systemctl reboot". Systemd correctly ramped the CW gain to zero before shutting down psinject. After the reboot monit started psinject correctly.
WP11211 Replace failing ADC h1seih16
Fil, TJ, Jim, Erik, Dave:
Fil replaced the second ADC in h1seih16. It had been raising ADC errors with increasing frequency (lately several times per day).
Procedure was: Jim safed the system, Dave fenced from Dolphin, stopped the models and powered down the computer. Fil powered down the AI chassis, then removed the old ADC and installed the new. Fil powered the IO Chassis up, Dave powered the computer up, Fil then powered the AI chassis up.
| Old card (removed - faulty) | 110124-06 |
| New card (installed - good) | 211109-31 |
WP11205 Restart all digital video camera servers via monit web page
Jonathan, Patrick, Dave:
To test recent monit configuration changes, all the cameras on h1digivideo[0,1,2] were restarted using their monit web page controls.
Procedure was: open the camera image, via monit press the RESTART button, verify camera image goes blue-screen, then returns to original image.
Test was successful for all cameras on h1digivideo[0,1,2]
The camera control web page for the new camera servers on h1digivideo3 was added to the MEDM (see attachment). It was noted that the camera order on the web page did not match that on the MEDM, and that the STOP button did not work.
Jonathan made several fixes and improvements: Camera order matches, STOP/START buttons work, web page shows camera uptime (similar to monit web page).
WP11208 h1daqframes-0 ZPOOL Scrub
Dave:
I started a ZPOOL SCRUB on h1daqframes-0 at the start of maintenance. So far, so good. It is scheduled to complete early Sat.
scan: scrub in progress since Tue May 23 09:58:19 2023
18.7T scanned at 1.46G/s, 6.88T issued at 550M/s, 184T total
0B repaired, 3.75% done, 3 days 21:32:05 to go
WP11219 Replace Dolphin IX Adapter card h1cdsrfm
EJ, Erik, Jonathan, Dave, TJ, Jim, Jeff, Control Room:
Last night we had to reboot several corner station systems because h1cdsrfm glitched the corner Dolphin fabric when reporting a problem with the IX card installed in its local chassis (i.e. not in End Station Adnaco mini-chassis). This had happened before in August 2022. We elected to replace this card as preventative maintenance before O4.
Because the card's serial number had changed, the Dolphin network manager had to be restarted on h1vmboot1. Unfortunately, as seen before, this caused h1susex, h1susey, h1seiey and h1iscey to hard lockup, requiring IPMI resets to reboot them. These system had to be recovered from their reboots by CDS and OPS.
WP11193 Upgrade DC power supplies in CER
Marc, Fil, Dave:
Marc and Fil continued the DC power supply upgrades in the CER. This required the power down of h1omc0, h1asc0 and h1lsc0 front ends because their IO Chassis were powered down.
The procedure was: fence FE from Dolphin, stop all models, power down computer.
Prior to this I had installed the new h1ascsqzifo and h1sqz models, so then ASC and LSC came back they were running the new code.
WP11206 ADD ADS block to SQZ
Daniel, Vicky, Dave:
New h1ascsqzifo and h1sqz models were installed on h1asc0 and h1lsc0 respectively. DAQ restart was required to add slow channels.
DAQ Restart
Dave:
The DAQ was restarted because of the sqz model changes detailed above.
fw0 restarted itself after running 6 minutes. We think this was a "standard" fw0 restart and not because of the ongoing scrub.
Both gds0 and gds1 needed restarts, this appears to be modus operandi these days.
For context and consequences of the WP11219 Replace Dolphin IX Adapter card h1cdsrfm replacement: Prep for potential Y-end station front end crashing: LHO:69826. Report of crash: LHO:69832 SDF system loop hole that this reboot exposed with the ISI system: LHO:69835
Tue23May2023
LOC TIME HOSTNAME MODEL/REBOOT
08:33:32 h1seih16 ***REBOOT*** <<< replace 2nd ADC
08:35:03 h1seih16 h1iopseih16
08:35:16 h1seih16 h1hpiham1
08:35:29 h1seih16 h1hpiham6
08:35:42 h1seih16 h1isiham6
11:13:36 h1asc0 ***REBOOT*** <<< IO Chassis power supply work
11:13:57 h1lsc0 ***REBOOT***
11:15:11 h1asc0 h1iopasc0
11:15:24 h1asc0 h1asc
11:15:33 h1lsc0 h1ioplsc0
11:15:37 h1asc0 h1ascimc
11:15:42 h1omc0 ***REBOOT***
11:15:46 h1lsc0 h1lsc
11:15:50 h1asc0 h1ascsqzifo <<< New model
11:15:59 h1lsc0 h1lscaux
11:16:12 h1lsc0 h1sqz <<< New model
11:16:25 h1lsc0 h1ascsqzfc
11:17:09 h1omc0 h1iopomc0
11:17:22 h1omc0 h1omc
11:17:35 h1omc0 h1omcpi
11:24:24 h1daqdc0 [DAQ] <<< DAQ-0 Restart for models
11:24:35 h1daqfw0 [DAQ]
11:24:35 h1daqtw0 [DAQ]
11:24:36 h1daqnds0 [DAQ]
11:24:44 h1daqgds0 [DAQ]
11:25:14 h1daqgds0 [DAQ] <<< 2nd GDS0
11:26:08 h1cdsrfm ***REBOOT*** <<< cdsrfm Dolphin card replacement
11:29:59 h1daqfw0 [DAQ] <<< FW0 restarted itself
11:37:16 h1susey ***REBOOT*** <<< Endstation recovery following Dolphin crash
11:37:59 h1seiey ***REBOOT***
11:38:15 h1susex ***REBOOT***
11:38:40 h1iscey ***REBOOT***
11:39:20 h1susey h1iopsusey
11:39:33 h1seiey h1iopseiey
11:39:33 h1susey h1susetmy
11:39:46 h1seiey h1hpietmy
11:39:46 h1susey h1sustmsy
11:39:59 h1seiey h1isietmy
11:39:59 h1susey h1susetmypi
11:40:24 h1iscey h1iopiscey
11:40:25 h1susex h1iopsusex
11:40:37 h1iscey h1pemey
11:40:38 h1susex h1susetmx
11:40:50 h1iscey h1iscey
11:40:51 h1susex h1sustmsx
11:41:03 h1iscey h1caley
11:41:04 h1susex h1susetmxpi
11:41:16 h1iscey h1alsey
11:43:31 h1daqdc1 [DAQ] <<< DAQ-1 restart for models
11:43:39 h1daqfw1 [DAQ]
11:43:39 h1daqtw1 [DAQ]
11:43:41 h1daqnds1 [DAQ]
11:43:49 h1daqgds1 [DAQ]
11:44:52 h1daqgds1 [DAQ] <<< 2nd GDS1 restart
202 slow channels add the DAQ frame, full list in attached file
We should look at mitigations for the end-station crash. There are at least two that we could try: * turning off the node-manager process on the end station machines * disabling the dolphin switch ports for the end station machines We should try one or both of these next time we do something.
DQ Shifter: Brennan Hughey
Summary:
At this point all maintenance activities have wrapped up. The last lingering tasks are:
As soon as SEI gives the OK, we will start alignment.
With the update of opslogin0 I updated the nomachine (remote desktop) software on opslogin. You will see a notice about a certificate fingerprint change when you log in. This is a one-time thing.
Operating systems packages were updated on Opslogin0 and it was rebooted.
CDS and other LIGO packages including python libraries and control room tools that are distributed through conda were *not* updated.
J. Kissel, T. Shaffer, J. Warner 8 seconds after re-isolating the platform from today's Dolphin crash, H1ISIETMY ST2 WD Tripped. Currently identifying and debugging the problem...
As a result of the end-station dolphin crash (LHO:69832), it became apparent that all of the settings from the new-ish blend filter fader system (LHO:69000) were lost because those newish channels were not initialized in that model's safe.snap. Once we restored those settings by hand, we - went to FULLY_ISOLATED to confirm that the platform was functional again - went to ISI_OFFLINE_HEPI_ON in order to bring the platform to its "just restarted the computer" safe state. - Initialized, monitored, and accepted all the new settings. We're not sure how this got missed, and Jim swears he did so those few weeks ago. We'll investigate further as to the history, but the platform is now functional.
The card replacement that caused this front-end crash is documented in LHO:69843.
Using DTT with 0.001 Hz BW and 10 averages (.xml template attached), chi_XY (see LIGO-P2000113 for details) was calculated by hand during several quiet lock stretches of at least 2 hours during the past few days.
| Date | Start Time | chi_XY |
|---|---|---|
| 5/20/2023 | 08:00 | 1.00056 |
| 5/20/2023 | 09:00 | 1.00079 |
| 5/22/2023 | 11:10 | 1.00057 |
| 5/23/2023 | 21:00 | 1.00012 |
Our calculation of chi_XY in the front-end code needs some tuning of the demodulation and filtering parameters as can be seen in the attached plot. This will get some attention soon; it is the focus of a planned summer SURF student investigation here at LHO.
chi_XY compares the calibrations of the fiducial Pcal displacements induced at both end stations. Corrections factors have been applied at each end station (H1:CAL-PCALX_XY_COMPARE_CORR_FACT = 0.9988 and H1:CAL-PCALY_XY_COMPARE_CORR_FACT = 1.0015) such that chi_XY should be close to unity. It appears that both Pcal systems are stable, operating as expected.
F. Mera, M. Pirello
Continuing with WP11193
7 - Installed segregated +24V run from VDC-C2 U34 RHS to SQZ-C1 for the OMC IO Chassis. This conductively isolates the OMC IO Chassis +24V power from the Beckhoff SQZ +24V power.
8 - Replaced VDC-C2 U24 RHS -18V Kepco which supplies -18V to ISC-C1 & ISC-C2.
H1 VDC Rack Drawing D2300167
This concludes WP11193
Tagging DetChar and CAL, for the improved electrical isolation of the h1omc0 IO chassis (which houses the isolated 524 kHz ADC card that's reading out the OMC DCPDs; the gravitational wave PDs) that comes from Marc / Fernando's execution of:
7 - Installed segregated +24V run from VDC-C2 U34 RHS to SQZ-C1 for the OMC IO Chassis. This conductively isolates the OMC IO Chassis +24V power from the Beckhoff SQZ +24V power.
We don't have "smoking gun" evidence "before" the change, but hopefully after this day, the amount or amplitude of lines in the detector sensitivity will be reduced -- so, be aware CW group!
Note that this is one of the last official parts of segregating the OMC IO chassis, a la (H1 only, thus far) ECR E2200441 and IIET Ticket 25756.
Some of the motivating history is in that FRS ticket, as well as it's predecessor IIET Ticket 17846, where we identified mixing of unsynchronized FPGA clocks on all the ADC and DAC cards in the h1lsc0 chassis -- which cites Roberts initial findings in LHO:58313.
J. Kissel, E. von Reis, T. Shaffer, WP 11219 In prep for the suddenly needed replacement of the corner-station to end-station "CDS RFM" dolphin card, which has a non-zero risk of crashing all end-station models, we've - Accepted the ETMX / TMSX, ETMY / TMSY alignment sliders in the SDF system - Brought the ETMX / TMSX, ETMY / TMSY SUS guardians to DAMPED - Brought the EX and EY SEI guardian managers to ISI_DAMPED_HEPI_OFFLINE
Good thing! As feared, EX and EY SEI, SUS, and ISC end-station computers crashed as a result of this work.
The OMC is having trouble locking, the log keeps giving the same error message " Didn't find enough peaks, resetting and trying again" The OMC TRANS camera flashes appear very offcenter as well (bottom right), I was unsucessful trying to scan and lock by hand, I also tried to do a Graceful clear history as suggested by the the troubleshooting doc since the OMC-ASC values were all stuck at high values, clearing the history reset them but then they went right back to being high a few minute later.
It looks like the SRM came back up after the computer crash with some old SDF saved alignment (as they all might since we do not monitor, and therefore do not "Accept" in SDF very often). So the eventual resetting of it back via large-ish slider values restored it to a more recent locking alignment last night.
I have accepted these values in SDF for better luck next time, but we're chewing on how to ensure we are at the best alignment starting place after this type of computer reboot.
Patrick THomas wrote a script that operators can use to restore sliders to a specific time in the past, there is a link to it and gui on the IFO align screen. We should probably add this to any instructions we have for operators on how to come back from a computer problem.
TJ, Rahul, Austin
Today, we have decided to remove the MONITOR_PAUSE state in the VIOLIN_DAMPING guardian and combine the code into the main damping state - DAMP_VIOLINS_FULL_POWER. The motivation of this was to prevent the waiting in the monitor state and go directly into the damp violins state where the damping settings are set.
I have committed the above changes in the svn.