Reports until 11:43, Thursday 21 May 2015
H1 DAQ (CDS, DCS)
david.barker@LIGO.ORG - posted 11:43, Thursday 21 May 2015 (18553)
summary of work to mitigate h1fw0 instability since tape library move

Stuart, Greg, Dan, Jim, Dave:

Since the LDAS tape library was moved from the OSB Computer Users Room (CUR) to the Warehouse (WH) on Tuesday 12th May h1fw0 became unstable. This came as a surprise, and we have been working over the past week and a half to understand the problem and implement a solution.

When the Tape Library was in the CUR, it was in fiber optics connection with LDAS in the LSB via the multimode patch panel in the CUR. In the WH, the tape library was added to the H1 DAQ Q-Logic Fiber Channel Switches (FCS). At this point the tape traffic along with h1fw0's traffic are sent via single mode to the LSB. We think this additional data traffic on these switches is the root cause of h1fw0's instability problems.

Upon further investigation, data errors were seen on one of the FCS in the LSB, which correlate to frame errors on the LDAS SATABOY, which then correlate to h1fw0 restarts.

For the pair of FCS (one in MSR, one in LSB) which are showing the error, we tried using a different single mode link between the buildings and a different patch cable in the LSB. The error rate was unchanged.

When LDAS stop all access to the Tape Library, h1fw0 becomes stable again. So we know its the additional traffic which is expressing the core problem.

So where do we stand now? Dan has several single mode SFPs on order which will be here for next Tuesday maintenance. We plan on replacing the SFPs in the FCS pair to get back to where we wanted to be on 5/12. To get through this weekend, Dan has done the following:

For the FCS pair which the glitchy link, the single mode SFPs have been turned off. To permit packets from this FCS to get to LSB, a multimode fiber optics link was made between the MSR FCS.

This change was made at 15:40 PDT Wed 5/20 and the tape library was ramped up to full operation, h1fw0 has been stable since then (20 hours), so it looks like a good short term solution for this upcoming long weekend.