I noticed a couple of errors that showed up in the VerbalAlarms window running on the Alarm Handler computer. These were not announced verbally by VerbalAlarms. See attached screenshot.
There were 2 more of these errors (one set of not responding/alive again) ~10:00 UTC.
I see three errors on cdsfs0 for this morning which resulted in the raid card being reset.
Dec 12 00:17:51 cdsfs0 kernel: [306069.208085] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card.
Dec 12 01:23:42 cdsfs0 kernel: [310009.835370] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.
Dec 12 02:57:34 cdsfs0 kernel: [315627.297842] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x8a) timed out, resetting card.
Here is a full set of logs for the 01:23 event:
Dec 12 01:23:42 cdsfs0 kernel: [310009.835370] sd 0:0:0:0: WARNING: (0x06:0x002C): Command (0x2a) timed out, resetting card.
Dec 12 01:23:45 cdsfs0 snmpd[1716]: Connection from UDP: [10.20.0.85]:53320->[10.20.0.11]
Dec 12 01:24:13 cdsfs0 kernel: [310040.313376] 3w-9xxx: scsi0: AEN: INFO (0x04:0x005E): Cache synchronization completed:unit=0.
Dec 12 01:24:13 cdsfs0 kernel: [310040.433071] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0063): Enclosure added:encl=0.
Dec 12 01:24:45 cdsfs0 snmpd[1716]: Connection from UDP: [10.20.0.85]:54550->[10.20.0.11]
Dec 12 01:25:01 cdsfs0 CRON[13730]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Dec 12 01:25:03 cdsfs0 kernel: [310090.356268] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=0.
Dec 12 01:25:03 cdsfs0 kernel: [310090.385280] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=1.
Dec 12 01:25:03 cdsfs0 kernel: [310090.385714] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=2.
Note that this problem is not related to the recent /ligo is 100% full error (it has about 220GB of disk free).
Looks like most NFS clients rode through these restarts without logging any system errors. Presumably they were short outages and perhaps only computers which were actively trying to access the file system at the time are reporting the error.
Please contact me over the weekend if the errors show up again.