Displaying report 1-1 of 1.
Reports until 16:31, Sunday 11 May 2025
H1 CDS
david.barker@LIGO.ORG - posted 16:31, Sunday 11 May 2025 - last comment - 09:21, Monday 12 May 2025(84348)
h1susb123 crash, SWWD shutdown of h1seib1,2,3

At 15:36:55 Sun 11may2025 PDT all models on h1susb123 stopped due to an ADC/Timing error.

h1susb123 DMESG output:

[Sun May 11 15:36:49 2025] rts_cpu_isolator: LIGO code is done, calling regular shutdown code
[Sun May 11 15:36:49 2025] h1iopsusb123: ERROR - An ADC timeout error has been detected, waiting for an exit signal.
[Sun May 11 15:36:49 2025] h1susitmpi: ERROR - An ADC timeout error has been detected, waiting for an exit signal.
[Sun May 11 15:36:49 2025] h1susitmy: ERROR - An ADC timeout error has been detected, waiting for an exit signal.
[Sun May 11 15:36:49 2025] h1susbs: ERROR - An ADC timeout error has been detected, waiting for an exit signal.
[Sun May 11 15:36:49 2025] h1susitmx: ERROR - An ADC timeout error has been detected, waiting for an exit signal.
(diskless)controls@h1susb123:~$ 
 

Images attached to this report
Comments related to this report
david.barker@LIGO.ORG - 16:45, Sunday 11 May 2025 (84349)

If I do a "sudo lscpi" this command freezes for about 30 seconds, then eventually returns. While it is frozen, the EDC disconnects from h1susb123 EPICS channels.

This is a 8-core W2245 machines with 5 models, so 3 general purpose cores. The IOC disconnect suggests the lscpi command is taking over all 3 cores.

Interestingly h1iopsusb123 is not in a DACKILL state, its DAC drives are non-zero, frozen at their last values (see attached).

The IPC senders have stopped updating, causing IPC receive errors downstream and SEI SWWD trips.

Images attached to this comment
david.barker@LIGO.ORG - 17:02, Sunday 11 May 2025 (84350)

System has been recovered by power cycling the front-end computer, no IO Chassis restart was needed.

Procedure:

  • Fence h1susb123 from Dolphin fabric
  • Stop all models
  • Power down h1susb123, wait a few minutes
  • Power h1susb123 back up using IPMI.

"rtcds showcards" sees all the ADC/DAC/BIO cards in the IO Chassis. 

"sudo lscpi" completes immediately, no 30 second freeze.

david.barker@LIGO.ORG - 17:03, Sunday 11 May 2025 (84351)

I have untripped the SWWDs (SUS and SEI) for BSC1,2,3.

I have not ununtripped any user model watchdogs, will hand over to control room for that.

thomas.shaffer@LIGO.ORG - 19:14, Sunday 11 May 2025 (84352)

I've brought the BS, ITMX, ITMY suspensions back to Aligned after untripping the user WD. I also untripped the sei for those and watched them go back to nominal. All looks good.

david.barker@LIGO.ORG - 08:12, Monday 12 May 2025 (84355)
david.barker@LIGO.ORG - 09:21, Monday 12 May 2025 (84358)

Sun11May2025
LOC TIME HOSTNAME     MODEL/REBOOT
16:52:48 h1susb123    ***REBOOT***
16:55:12 h1susb123    h1iopsusb123
16:55:25 h1susb123    h1susitmy   
16:55:38 h1susb123    h1susbs     
16:55:51 h1susb123    h1susitmx   
16:56:04 h1susb123    h1susitmpi 

Displaying report 1-1 of 1.