Reports until 17:46, Wednesday 21 February 2024
H1 CDS
jonathan.hanks@LIGO.ORG - posted 17:46, Wednesday 21 February 2024 - last comment - 18:22, Wednesday 21 February 2024(75921)
WP 11698 - CDS network upgrade (Day 2) MSR and core gear
Jonathan, Erik, Dave, Tony, Ryan, Patrick, Dan

Today we transitioned to all the new switches in the MSR.

The order was:

sw-msr-ops - this is the workstation and FOM switch
sw-msr-server3 - auxillary servers, epics gateways, WAP controller
sw-msr-server2 - auxillary servers, back channel connections from the file servers
sw-msr-core - the main core network

Dan Moraru came up to help us make sure the file servers where is a safe state when before we touched switches related to them.

On the network side we ran into a few configuration issues but have been able to fix those.  A few ports didn't get put on the right vlan, that was fixed.  We need to enable directed broadcasts on a number of networks to allow the EPICS channel access broadcasts to go through.  The fmcs epics ioc machine had a bad default gateway, we had to correct that before it would work (we do not know why or how it worked yesterday).  Some issues with loose cables that were causing connectivity issues for fmcs.

Tony, Ryan, and Dave did some work on cleaning out old items from the racks.  We removed an old fujitsu switch, a broken qlogic switch, the old (dead) router, some retired servers, and the old sw-msr-ops, sw-msr-spare, sw-msr-server1, sw-msr-server3 switches.

Tomorrow we will remove the old core switch and do more cleanup.  We may have short outages, of some systems, but do not expect a general outage.

The mx, my, and mechancial room switches are passing data, but we are not able to log into them.  They were working on the test bench last week.  I have racked up a test switch and connected it to the core (via a temporary link on port 48) to try and replicate the setup of these switches and get this issue figured out.

We went through the configuration of all the icx switches and made sure that we had a consistent spanning tree setup.  This should help clear up some odd network issues we have seen in the past.

We also moved the connection to the file servers directly onto the core switch.

We had front end system crashes today.  Dave or Erik will add a comment detailing those issues.  Dave restarted the EDCU while we were doing recovery of the front end systems to force it to reconnect to EPICS channels while we worked on issues with the fmcs ioc.
Comments related to this report
erik.vonreis@LIGO.ORG - 18:22, Wednesday 21 February 2024 (75923)

h1seib1 h1seib3 h1seih7 h1asc0 h1lsc0 h1oaf0 all crashed at approximately 20:00 UTC while switches were being swapped out in the MSR.

h1susey h1seiey hsusex, historically touchy front ends, crashed while recovering the others.  h1seiey got into a bad state, continuously logging dolphin errors whcih disrupted the timing of the models and had to be restarted.

After recovering the front ends, a dolphin timing glitch was trigged when "unfencing" dolphin ports.  The glitch affected models across the corner station.  Models on h1susb123, h1sush2a h1sush34 h1sush56 had to be restarted.