aLIGO LHO Logbook

H1 General

jonathan.hanks@LIGO.ORG - posted 16:31, Tuesday 03 December 2024 (81608)

Network/GC issues today

As a follow-up to https://alog.ligo-wa.caltech.edu/aLOG/index.php?callRep=81595 with some more information.

Today the many of the site general computing services where down.

While preparing for WP 12215, Nyath was migrating systems on the GC hypervisor cluster. This was to make a configuration change on the nodes when they were not running anything. At the end of moving one of the VMs the network switches connecting the hypervisors and storage went into a bad state. We saw large packet loss on the systems connected to the switches. This manifested itself as the systems seeing disk/io errors (due to timeouts when trying to write/read data). This had wide ranging impacts on GC. It also caused issues on the GC to CDS switch, setting a key link to a blocking state so no traffic flowed between GC and CDS (which points to the issues being related to a spanning-tree problem). I will note that the migration of systems is part of the designed feature set of the system and part of the normal procedure for doing maintenance on hypervisor nodes.

The first steps of work were to get access to the hypervisors and storage, trying to make sure the those items where in a good state. Later after working through restarts on various components and consulting with Dan and Erik the main switch stack for the VM system was rebooted and that seems to have cleared up the issues.

Work in the control room continued, using the local controls account. Though we did have to make a change to the system config. This needs to be looked at. We have several KDCs configured so that authentication can go to multiple locations and does not need to rely on DNS, but the setup caused us issues. To get things working we commented out the KDC lines in the krb5.conf file. This essentially stopped the krb5 authentication (LIGO.ORG), but allowed local auth to go forward (which is what we had designed it for, so we will re-check the configs).