at approximately 07:50 PDT this morning the /opt/rtcds file system (served by h1fs0) became full. This caused some front end epics processes to segfault (example dmesg output for h1susb123 shown below). Presumably these models epics processes were trying to do some file access at this time. The CDS overview is attached showing which specific models had problems. At this point guardian stopped running because it could not connect to critical frontends. Lockloss_shutter_check also reported an NDS error at this time (log shown below), further investigation is warrented since h1nds0 was running at the time.
On trying to restart h1susitmx, the errors showed that /opt/rtcds was full. This is a ZFS file system, served by h1fs0. I first attempted to delete some old target_archive directories, but ran into file-system-full errors when running the 'rm' command. As root, I manually destroyed all the ZFS Snapshots for the month of May 2016. This freed up 22.3GB of disk which permitted me to start the failed models.
Note that only the model EPICS processes had failed, the front end cores were still running. However in order to cleanly restart the models I first issued a 'killh1modelname' and then ran 'starth1modelname'. Restarting h1psliss did not trip any shutters and the PSL was operational at all times.
I've handed the front ends over to Patrick and Ed for IFO locking, I'll work on file system cleanup in the background.
I've opened FRS6488 to prevent a re-occurance of this
[1989275.036661] h1susitmxepics[25707]: segfault at 0 ip 00007fd13403c894 sp 00007fffb426b9a0 error 4 in libc-2.10.1.so[7fd133fda000+14c000]
[1989275.045095] h1susitmxepics used greatest stack depth: 2984 bytes left
[1989275.086076] h1susbsepics[25384]: segfault at 0 ip 00007f2a5348e894 sp 00007fff908c88e0 error 4 in libc-2.10.1.so[7f2a5342c000+14c000]
[1989275.127643] h1susitmyepics[25166]: segfault at 0 ip 00007f5905a59894 sp 00007fff20f878d0 error 4 in libc-2.10.1.so[7f59059f7000+14c000]
2016-10-23T14:51:50.62907 LOCKLOSS_SHUTTER_CHECK W: Traceback (most recent call last):
2016-10-23T14:51:50.62909 File "/ligo/apps/linux-x86_64/guardian-1.0.2/lib/python2.7/site-packages/guardian/worker.py", line 461, in run
2016-10-23T14:51:50.62910 retval = statefunc()
2016-10-23T14:51:50.62910 File "/opt/rtcds/userapps/release/isc/h1/guardian/LOCKLOSS_SHUTTER_CHECK.py", line 50, in run
2016-10-23T14:51:50.62911 gs13data = cdu.getdata(['H1:ISI-HAM6_BLND_GS13Z_IN1_DQ','H1:SYS-MOTION_C_SHUTTER_G_TRIGGER_VOLTS'],12,self.timenow-10)
2016-10-23T14:51:50.62911 File "/ligo/apps/linux-x86_64/cdsutils/lib/python2.7/site-packages/cdsutils/getdata.py", line 78, in getdata
2016-10-23T14:51:50.62912 for buf in conn.iterate(*args):
2016-10-23T14:51:50.62912 RuntimeError: Requested data were not found.
2016-10-23T14:51:50.62913