WP11088 RCG5.1.4 Upgrade
Erik, Jonathan, EJ, Dave
All the frontend systems were rebooted against the upgraded-in-place RCG5.1.4 h1vmboot1. The systems which had been pre-upgraded last Friday (h1susaux[h2,h34,h56,ex,ey], h1pem[mx,my], h1build) were not booted today.
First we booted h1susauxb123 to upgrade the EDC, this was followed by a DAQ restart to sync h1susauxb123's channels. At this point all models had bad DAQ data execpt for the pre-boots and susauxb123/EDC.
We powered down h1cdsrfm.
During the switch upgrade work, the EY Dolphin'ed front ends and susex had crashed. So we next rebooted EY, followed by EX.
Going onto the corner station, we rebooted PSL first while the end stations were started. Then we rebooted SUS, then SEI and then ISC. Surprisingly soon after h1susb123 was running, it was Dolphin glitched again while SEI was being rebooted. We tracked this down to a failed IX Dolphin port disable due to missing network connection. This was quickly fixed and tested by rebooting h1oaf0 after h1susb123 was recovered.
Next h1cdsrfm was rebooted, followed by h1ecatmon0.
DAQ RCG5.1.4 Upgrade
Jonathan
After the front end upgrade to RCG5.1.4 was complete, Jonathan upgraded the DAQ.
Interesting point, the DAQ was restarted twice today and none of the GDS machines needed a second restart to sync their channel lists.
EY HEPI Pump Controller Recovery
Jim, Fil, Dave:
The EY HEPI pump controller network port went offline Friday evening. This morning the EY purple-box controller was power cycled. It came back up in a fully working state. It had most probably been running since 30 Mar 2021 when the original EY box was replaced with the EX box (surplused after EX was upgraded to Beckhoff 30 Jan 2018). Note we did not change the IP address of the EX box when it was moved, so the EY HEPI pump controller is still called h1hpipumpctrlex (10.105.0.62).
WP11063, 11093 MSR Network Switch Upgrade
Jonathan, Tony, Erik, Dave:
The last few aLIGO era Netgear switches were replaced with a Brocade 3-unit stack.
One of the /ligo file cluster nodes had its IPMI connected to one of the old switches, and so the cluster was put into maintenance mode (so it would not attempt to fence) while this IPMI cable was moved.
The new switches were temporarily installed adjacent to the old switch, which greatly reduced the disconnect time when ethernet cables were moved from the old to the new switches. Most hosts did not see any network loss.
The main issue found was that when h1hypervisor0 (which hosts h1vmboot1, the boot server) was moved, there was a familiar Dolphin crash of h1susey, h1seiey, h1iscey and h1susex. These machines were going to be rebooted with the RCG upgrade so this was not a problem.
After the RCG upgrade, the /opt/rtcds file system exceeded 90%. This is due to the creation of target_archive directories. I have compressed these directories to free up more space. These changes will not come into effect until the ZFS snapshots referencing these files are deleted in 6 days time.
RGC Upgrade: Unexpected problems, gotchas, lessons learnt
The automatic RCG install got stuck with file permissions problems on h1isibs and had to be cancelled. Unknown to us, the target directory for h1isiham8 was left in a bad state, with the SDF snap file symbolic links removed. When we started this model, we found that its safe.snap was the default RCG generated one.
The network connection to the first two Dolphin IX switches was down following the switch upgrade preceding the RCG upgrade. This meant that the fencing commands were subtly failing for FEs attached to these two switches. This meant that SUS frontends were being killed as SEI frontends were being rebooted. Perhaps the fencing script should be more verbal if it cannot complete the fence operation.