Displaying report 1-1 of 1.
Reports until 15:18, Wednesday 18 September 2013
H1 CDS
cyrus.reed@LIGO.ORG - posted 15:18, Wednesday 18 September 2013 - last comment - 09:41, Friday 20 September 2013(7793)
cdsfs1 RAID Card Replaced
I replaced the RAID card in cdsfs1 with a new one.  While I had the chassis open, I took the opportunity to vacuum out all the bugs and check the interior cable connections.  Even so, upon booting with the new card installed, was greeted by an I2c bus error from the RAID controller.  So powered off the server and found a loose connection to the disk backplane, which I either missed earlier or knocked loose when addressing another loose cable earlier (it may also never have been connected properly, once reconnected the RAID controller showed temperature sensors previously not shown on the old controller).  On reboot the RAID controller was now happy.  But once the server booted the root drive mounted read only due to EXT4 filesystem journal errors.  So rebooted once again, which forced a full fsck after which the root filesystem was happy.  The RAID controller is now in the process of verifying the RAID; this is a slow process that will take at least a day.  So far the system looks healthier, but the RAID verification process should provide a good burn in period.
Comments related to this report
cyrus.reed@LIGO.ORG - 10:36, Thursday 19 September 2013 (7805)
The RAID verification process was complete when I arrived this morning.  I then unmounted the /raid filesystem* so I could force an fsck on it, to verify the integrity of the file structure itself.  The fsck passed, so I remounted /raid.  It should now be ready to run rsyncs/backups again.  I also started the battery backup test on the RAID controller, this takes 'up to 24 hours' to complete.  During this time, if the entire system loses power without a clean shutdown, the contents of any data in flight in the RAID cache will be lost; the system is on UPS power so this is a low risk.

* I had to modify /etc/exports first to remove /raid, then run exportfs -ra to update NFS; otherwise you get 'filesystem busy' messages.  Then the reverse of course when the fsck was finished.
cyrus.reed@LIGO.ORG - 09:41, Friday 20 September 2013 (7819)
The battery backup test passed with an estimated capacity of 255 hours; so the controller can maintain data in the cache for roughly 10 days without external power.  I also checked the controller logs, and so far they are clean with no errors.
Displaying report 1-1 of 1.