aLIGO LHO Logbook

H1 CDS

cyrus.reed@LIGO.ORG - posted 10:44, Monday 18 August 2014 - last comment - 13:31, Monday 18 August 2014(13464)

SataBoy Controller 0 Failures

So, apparently, the loss of a management network link causes the SataBoy controllers to fail, as evidenced by three simultaneous alarms when I powered off the management switch in preparation to swap it out.  Yay.  I have silenced the alarms, and so far C1 on each array appears to be working.  Message on serial console was 'Controller C0 has a firmware fault', and status LED was still green until I booted the switch, at which point the status LED went red and the serial console message changed to 'RAID controller 0 failed'.  More details to follow.

Comments related to this report

cyrus.reed@LIGO.ORG - 13:29, Monday 18 August 2014 (13465)

Link

Dan and I returned the failed controllers to operation by pulling, then re-plugging them in the chassis.  (Turns out you can do this from the GUI too, if that is accessible - probably preferable.  I'm assuming it can probably be done from the serial console as well, if you know where to look.)  We tested using the (idle) second h1fw1 array, and were able to reproduce the problem.  Looking at the logs, I noticed that the SMTP process was generating an exception when the ethernet link went down; when it came back up, the NSVC process logged an exception (and causes a controller failure).  We decided to turn off the 'E-Alerts' for network related events, and tested again - the problem did not re-occur.  So apparently, the arrays get very upset when they can't e-mail you that the network is down.  We updated the configuration on the other two arrays, and tested them to make sure the change works around this issue.

fw1 logged a reboot sometime during the initial instance of the problem, but there were no other problems recorded during recovery and testing so that work was non-intrusive.

cyrus.reed@LIGO.ORG - 13:31, Monday 18 August 2014 (13466)

Link

Note: The affected arrays were raid-msr-h1a, raid-msr-h1b, and raid-msr-dmt.  (So, all of them.)  All should now be recovered and operating normally.  The redundant controller appears to have taken over in all cases while the primary controller was down.