aLIGO LHO Logbook

H1 CDS

david.barker@LIGO.ORG - posted 15:07, Tuesday 11 April 2023 - last comment - 10:54, Wednesday 12 April 2023(68595)

CDS Maintenance Summary: Tuesday 11th April 2023

WP11088 RCG5.1.4 Upgrade

Erik, Jonathan, EJ, Dave

All the frontend systems were rebooted against the upgraded-in-place RCG5.1.4 h1vmboot1. The systems which had been pre-upgraded last Friday (h1susaux[h2,h34,h56,ex,ey], h1pem[mx,my], h1build) were not booted today.

First we booted h1susauxb123 to upgrade the EDC, this was followed by a DAQ restart to sync h1susauxb123's channels. At this point all models had bad DAQ data execpt for the pre-boots and susauxb123/EDC.

We powered down h1cdsrfm.

During the switch upgrade work, the EY Dolphin'ed front ends and susex had crashed. So we next rebooted EY, followed by EX.

Going onto the corner station, we rebooted PSL first while the end stations were started. Then we rebooted SUS, then SEI and then ISC. Surprisingly soon after h1susb123 was running, it was Dolphin glitched again while SEI was being rebooted. We tracked this down to a failed IX Dolphin port disable due to missing network connection. This was quickly fixed and tested by rebooting h1oaf0 after h1susb123 was recovered.

Next h1cdsrfm was rebooted, followed by h1ecatmon0.

DAQ RCG5.1.4 Upgrade

Jonathan

After the front end upgrade to RCG5.1.4 was complete, Jonathan upgraded the DAQ.

Interesting point, the DAQ was restarted twice today and none of the GDS machines needed a second restart to sync their channel lists.

EY HEPI Pump Controller Recovery

Jim, Fil, Dave:

The EY HEPI pump controller network port went offline Friday evening. This morning the EY purple-box controller was power cycled. It came back up in a fully working state. It had most probably been running since 30 Mar 2021 when the original EY box was replaced with the EX box (surplused after EX was upgraded to Beckhoff 30 Jan 2018). Note we did not change the IP address of the EX box when it was moved, so the EY HEPI pump controller is still called h1hpipumpctrlex (10.105.0.62).

WP11063, 11093 MSR Network Switch Upgrade

Jonathan, Tony, Erik, Dave:

The last few aLIGO era Netgear switches were replaced with a Brocade 3-unit stack.

One of the /ligo file cluster nodes had its IPMI connected to one of the old switches, and so the cluster was put into maintenance mode (so it would not attempt to fence) while this IPMI cable was moved.

The new switches were temporarily installed adjacent to the old switch, which greatly reduced the disconnect time when ethernet cables were moved from the old to the new switches. Most hosts did not see any network loss.

The main issue found was that when h1hypervisor0 (which hosts h1vmboot1, the boot server) was moved, there was a familiar Dolphin crash of h1susey, h1seiey, h1iscey and h1susex. These machines were going to be rebooted with the RCG upgrade so this was not a problem.

Comments related to this report

jonathan.hanks@LIGO.ORG - 15:54, Tuesday 11 April 2023 (68599)

Link

The daqd upgrade was very straight forward.  The basic procedure was:

1. run apt-get update
2. stop transports and daqd
3. apt-get install the 5.1.4 version of the packages
3a. check versions with dpkg -l advligorts\*
4. start services

The systems where done in the following order, h1daqgds1, h1daqnds1, h1daqtw1, h1daqfw1, h1daqdc1, h1daqgds0, h1daqnds0, h1daqtw0, h1daqfw0, h1daqdc0.

All systems are now running version 5.1.4-1+deb11. No other changes where made to the daqd systems.

The commands run for dc0 were:

apt-get update
systemctl stop rts-transport@cps_recv.service ; systemctl stop rts-transport@dix_xmit; systemctl stop rts-transport@cps_xmit.service 
systemctl stop rts-daqd
apt-get install advligorts-common=5.1.4-1+deb11 advligorts-daqd=5.1.4-1+deb11 advligorts-gpstime-dkms=5.1.4-1+deb11 advligorts-mbuf-dkms=5.1.4-1+deb11 advligorts-transport-common=5.1.4-1+deb11 advligorts-transport-dix=5.1.4-1+deb11 advligorts-transport-pubsub=5.1.4-1+deb11
dpkg -l advligorts\* 
systemctl start rts-transport@dix_xmit
systemctl start rts-transport@cps_xmit
systemctl start rts-transport@cps_recv.service

david.barker@LIGO.ORG - 09:07, Wednesday 12 April 2023 (68616)

Link

After the RCG upgrade, the /opt/rtcds file system exceeded 90%. This is due to the creation of target_archive directories. I have compressed these directories to free up more space. These changes will not come into effect until the ZFS snapshots referencing these files are deleted in 6 days time.

david.barker@LIGO.ORG - 10:54, Wednesday 12 April 2023 (68623)

Link

RGC Upgrade: Unexpected problems, gotchas, lessons learnt

The automatic RCG install got stuck with file permissions problems on h1isibs and had to be cancelled. Unknown to us, the target directory for h1isiham8 was left in a bad state, with the SDF snap file symbolic links removed. When we started this model, we found that its safe.snap was the default RCG generated one.

The network connection to the first two Dolphin IX switches was down following the switch upgrade preceding the RCG upgrade. This meant that the fencing commands were subtly failing for FEs attached to these two switches. This meant that SUS frontends were being killed as SEI frontends were being rebooted. Perhaps the fencing script should be more verbal if it cannot complete the fence operation.