Reports until 00:37, Saturday 12 December 2015
H1 General (CDS)
travis.sadecki@LIGO.ORG - posted 00:37, Saturday 12 December 2015 - last comment - 08:46, Saturday 12 December 2015(24139)
nfs server cdsfs0:/ligo errors

I noticed a couple of errors that showed up in the VerbalAlarms window running on the Alarm Handler computer.  These were not announced verbally by VerbalAlarms.  See attached screenshot.

Images attached to this report
Comments related to this report
travis.sadecki@LIGO.ORG - 02:33, Saturday 12 December 2015 (24140)

There were 2 more of these errors (one set of not responding/alive again) ~10:00 UTC.

david.barker@LIGO.ORG - 08:36, Saturday 12 December 2015 (24144)

I see three errors on cdsfs0 for this morning which resulted in the raid card being reset.

Dec 12 00:17:51 cdsfs0 kernel: [306069.208085] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card.

Dec 12 01:23:42 cdsfs0 kernel: [310009.835370] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.

Dec 12 02:57:34 cdsfs0 kernel: [315627.297842] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card.

 

Here is a full set of logs for the 01:23 event:

 

Dec 12 01:23:42 cdsfs0 kernel: [310009.835370] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.

Dec 12 01:23:45 cdsfs0 snmpd[1716]: Connection from UDP: [10.20.0.85]:53320->[10.20.0.11]

Dec 12 01:24:13 cdsfs0 kernel: [310040.313376] 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.

Dec 12 01:24:13 cdsfs0 kernel: [310040.433071] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0063): Enclosure added:encl=0.

Dec 12 01:24:45 cdsfs0 snmpd[1716]: Connection from UDP: [10.20.0.85]:54550->[10.20.0.11]

Dec 12 01:25:01 cdsfs0 CRON[13730]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Dec 12 01:25:03 cdsfs0 kernel: [310090.356268] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=0.

Dec 12 01:25:03 cdsfs0 kernel: [310090.385280] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=1.

Dec 12 01:25:03 cdsfs0 kernel: [310090.385714] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=2.

david.barker@LIGO.ORG - 08:46, Saturday 12 December 2015 (24145)

Note that this problem is not related to the recent /ligo is 100% full error (it has about 220GB of disk free). 

Looks like most NFS clients rode through these restarts without logging any system errors. Presumably they were short outages and perhaps only computers which were actively trying to access the file system at the time are reporting the error.

Please contact me over the weekend if the errors show up again.