Reports until 22:14, Monday 21 May 2018
H1 CDS (DAQ)
david.barker@LIGO.ORG - posted 22:14, Monday 21 May 2018 - last comment - 07:45, Tuesday 22 May 2018(42114)
RCG-3.4.2 upgrade summary

WP7578. Hugh, Jenne, Jonathan, Dave:

After upgrading the H1 models to RCG-3.4.2 and the front ends to Gentoo 3.0.8 we ran into a Dolphin network manager issue at 3pm PDT. Given the lateness of the hour, we decided to revert back to rcg-3.2/gentoo 2.6.34. All systems are now back with the exception of the HAM Seismic systems which are crashing on start-up. We will resolve this in the morning.

During the reversion process all three of the Gen-2 front end systems (h1suse[x,y], h1lsc0) lost connection with their IO Chassis and required a full system power cycle. This is unusual as recently only h1susex has infrequently required this type of reboot.

To revert, I restored the archived target directories, the IPC file, the original INI and PAR files for those systems whose code had changed and the DHCP service on h1boot.

 

Comments related to this report
keith.thorne@LIGO.ORG - 07:13, Tuesday 22 May 2018 (42115)
This is consistent with what was seen at LLO - see aLOG entries aLOG 38382, aLOG 28392, aLOG 38401.

This issue snuck up on us because we had not rebooted front-ends for many, many months.  It is related to PCIe expansion fiber aging and newer front-ends (10-core Intel V2 from 2015, 6-core Intel V4 from 2018).  Not all fibers have the same degradation (i.e. l1susex was fine).  Also newer fibers from 2015 seem OK as well (i.e. l1susey is fine)

For diagnostics, I would swap fiber on h1susex or h1susey with a newer fiber, then take the old fiber to the DAQ test stand so we can try to repeat it there and see if some BIOS settings on the Intel V4 machines can get around it. 
david.barker@LIGO.ORG - 07:45, Tuesday 22 May 2018 (42116)

Looks like the HAM-ISI kernel objects from the latest target_archive do not match the latest in the RCG-3.2 build area. This suggests a clean make-install and restart should fix this problem. For a first try, we should think about removing the front end from the Dolphin fabric, otherwise if it were to fail it would take the corner station models down (as was happening yesterday).