Reports until 13:33, Tuesday 23 May 2023
H1 CDS
david.barker@LIGO.ORG - posted 13:33, Tuesday 23 May 2023 - last comment - 14:16, Tuesday 23 May 2023(69843)
CDS Maintenance Summary: Tuesday 23rd May 2023

WP11210 h1hwinj1 configuration and reboot test

Keith R, Keith T, Erik, Dave:

We have put h1hwinj1 under puppet configuration control. Monit is being used to monitor the status of psinject (CW hardware injector), and will restart it if it is missing. Systemd is used to start/stop the processes, via scripts which ramp the GAIN of the CAL_INJ_CW_EXC filter module over 10 seconds. To test the system works when h1h1inj1 is rebooted, this morning I rebooted using "systemctl reboot". Systemd correctly ramped the CW gain to zero before shutting down psinject. After the reboot monit started psinject correctly.

WP11211 Replace failing ADC h1seih16

Fil, TJ, Jim, Erik, Dave:

FRS27187

Fil replaced the second ADC in h1seih16. It had been raising ADC errors with increasing frequency (lately several times per day).

Procedure was: Jim safed the system, Dave fenced from Dolphin, stopped the models and powered down the computer. Fil powered down the AI chassis, then removed the old ADC and installed the new. Fil powered the IO Chassis up, Dave powered the computer up, Fil then powered the AI chassis up.

Old card (removed -  faulty) 110124-06
New card (installed - good) 211109-31

WP11205 Restart all digital video camera servers via monit web page

Jonathan, Patrick, Dave:

To test recent monit configuration changes, all the cameras on h1digivideo[0,1,2] were restarted using their monit web page controls.

Procedure was: open the camera image, via monit press the RESTART button, verify camera image goes blue-screen, then returns to original image.

Test was successful for all cameras on h1digivideo[0,1,2]

The camera control web page for the new camera servers on h1digivideo3 was added to the MEDM (see attachment). It was noted that the camera order on the web page did not match that on the MEDM, and that the STOP button did not work.

Jonathan made several fixes and improvements: Camera order matches, STOP/START buttons work, web page shows camera uptime (similar to monit web page).

WP11208 h1daqframes-0 ZPOOL Scrub

Dave:

I started a ZPOOL SCRUB on h1daqframes-0 at the start of maintenance. So far, so good. It is scheduled to complete early Sat.

  scan: scrub in progress since Tue May 23 09:58:19 2023
        18.7T scanned at 1.46G/s, 6.88T issued at 550M/s, 184T total
        0B repaired, 3.75% done, 3 days 21:32:05 to go

WP11219 Replace Dolphin IX Adapter card h1cdsrfm

EJ, Erik, Jonathan, Dave, TJ, Jim, Jeff, Control Room:

Last night we had to reboot several corner station systems because h1cdsrfm glitched the corner Dolphin fabric when reporting a problem with the IX card installed in its local chassis (i.e. not in End Station Adnaco mini-chassis). This had happened before in August 2022. We elected to replace this card as preventative maintenance before O4.

Because the card's serial number had changed, the Dolphin network manager had to be restarted on h1vmboot1. Unfortunately, as seen before, this caused h1susex, h1susey, h1seiey and h1iscey to hard lockup, requiring IPMI resets to reboot them. These system had to be recovered from their reboots by CDS and OPS.

WP11193 Upgrade DC power supplies in CER

Marc, Fil, Dave:

Marc and Fil continued the DC power supply upgrades in the CER. This required the power down of h1omc0, h1asc0 and h1lsc0 front ends because their IO Chassis were powered down.

The procedure was: fence FE from Dolphin, stop all models, power down computer. 

Prior to this I had installed the new h1ascsqzifo and h1sqz models, so then ASC and LSC came back they were running the new code. 

WP11206 ADD ADS block to SQZ

Daniel, Vicky, Dave:

New h1ascsqzifo and h1sqz models were installed on h1asc0 and h1lsc0 respectively. DAQ restart was required to add slow channels.

DAQ Restart

Dave:

The DAQ was restarted because of the sqz model changes detailed above.

fw0 restarted itself after running 6 minutes. We think this was a "standard" fw0 restart and not because of the ongoing scrub.

Both gds0 and gds1 needed restarts, this appears to be modus operandi these days.

 

Images attached to this report
Comments related to this report
jeffrey.kissel@LIGO.ORG - 13:54, Tuesday 23 May 2023 (69845)CAL, CDS, DetChar, ISC, OpsInfo, PEM, SEI, SUS, SYS
For context and consequences of the WP11219 Replace Dolphin IX Adapter card h1cdsrfm replacement:

Prep for potential Y-end station front end crashing: LHO:69826.
Report of crash: LHO:69832
SDF system loop hole that this reboot exposed with the ISI system: LHO:69835
david.barker@LIGO.ORG - 13:56, Tuesday 23 May 2023 (69847)

Tue23May2023
LOC TIME HOSTNAME     MODEL/REBOOT
08:33:32 h1seih16     ***REBOOT*** <<< replace 2nd ADC
08:35:03 h1seih16     h1iopseih16 
08:35:16 h1seih16     h1hpiham1   
08:35:29 h1seih16     h1hpiham6   
08:35:42 h1seih16     h1isiham6   


11:13:36 h1asc0       ***REBOOT*** <<< IO Chassis power supply work
11:13:57 h1lsc0       ***REBOOT***
11:15:11 h1asc0       h1iopasc0   
11:15:24 h1asc0       h1asc       
11:15:33 h1lsc0       h1ioplsc0   
11:15:37 h1asc0       h1ascimc    
11:15:42 h1omc0       ***REBOOT***
11:15:46 h1lsc0       h1lsc       
11:15:50 h1asc0       h1ascsqzifo <<< New model
11:15:59 h1lsc0       h1lscaux    
11:16:12 h1lsc0       h1sqz       <<< New model
11:16:25 h1lsc0       h1ascsqzfc  
11:17:09 h1omc0       h1iopomc0   
11:17:22 h1omc0       h1omc       
11:17:35 h1omc0       h1omcpi     


11:24:24 h1daqdc0     [DAQ] <<< DAQ-0 Restart for models
11:24:35 h1daqfw0     [DAQ]
11:24:35 h1daqtw0     [DAQ]
11:24:36 h1daqnds0    [DAQ]
11:24:44 h1daqgds0    [DAQ]
11:25:14 h1daqgds0    [DAQ] <<< 2nd GDS0


11:26:08 h1cdsrfm     ***REBOOT*** <<< cdsrfm Dolphin card replacement


11:29:59 h1daqfw0     [DAQ] <<< FW0 restarted itself


11:37:16 h1susey      ***REBOOT***  <<< Endstation recovery following Dolphin crash
11:37:59 h1seiey      ***REBOOT***
11:38:15 h1susex      ***REBOOT***
11:38:40 h1iscey      ***REBOOT***
11:39:20 h1susey      h1iopsusey  
11:39:33 h1seiey      h1iopseiey  
11:39:33 h1susey      h1susetmy   
11:39:46 h1seiey      h1hpietmy   
11:39:46 h1susey      h1sustmsy   
11:39:59 h1seiey      h1isietmy   
11:39:59 h1susey      h1susetmypi 
11:40:24 h1iscey      h1iopiscey  
11:40:25 h1susex      h1iopsusex  
11:40:37 h1iscey      h1pemey     
11:40:38 h1susex      h1susetmx   
11:40:50 h1iscey      h1iscey     
11:40:51 h1susex      h1sustmsx   
11:41:03 h1iscey      h1caley     
11:41:04 h1susex      h1susetmxpi 
11:41:16 h1iscey      h1alsey     


11:43:31 h1daqdc1     [DAQ] <<< DAQ-1 restart for models
11:43:39 h1daqfw1     [DAQ]
11:43:39 h1daqtw1     [DAQ]
11:43:41 h1daqnds1    [DAQ]
11:43:49 h1daqgds1    [DAQ]
11:44:52 h1daqgds1    [DAQ] <<< 2nd GDS1 restart
 
 

david.barker@LIGO.ORG - 14:02, Tuesday 23 May 2023 (69848)

202 slow channels add the DAQ frame, full list in attached file

 

Non-image files attached to this comment
jonathan.hanks@LIGO.ORG - 14:16, Tuesday 23 May 2023 (69851)
We should look at mitigations for the end-station crash.  There are at least two that we could try:

 * turning off the node-manager process on the end station machines
 * disabling the dolphin switch ports for the end station machines

We should try one or both of these next time we do something.