Reports until 10:14, Sunday 23 October 2016
H1 CDS (DAQ)
david.barker@LIGO.ORG - posted 10:14, Sunday 23 October 2016 (30766)
/opt/rtcds file system full, caused some front end epics processes to stop

at approximately 07:50 PDT this morning the /opt/rtcds file system (served by h1fs0) became full. This caused some front end epics processes to segfault (example dmesg output for h1susb123 shown below). Presumably these models epics processes were trying to do some file access at this time. The CDS overview is attached showing which specific models had problems. At this point guardian stopped running because it could not connect to critical frontends. Lockloss_shutter_check also reported an NDS error at this time (log shown below), further investigation is warrented since h1nds0 was running at the time.

On trying to restart h1susitmx, the errors showed that /opt/rtcds was full. This is a ZFS file system, served by h1fs0. I first attempted to delete some old target_archive directories, but ran into file-system-full errors when running the 'rm' command. As root, I manually destroyed all the ZFS Snapshots for the month of May 2016. This freed up 22.3GB of disk which permitted me to start the failed models.

Note that only the model EPICS processes had failed, the front end cores were still running. However in order to cleanly restart the models I first issued a 'killh1modelname' and then ran 'starth1modelname'. Restarting h1psliss did not trip any shutters and the PSL was operational at all times.

I've handed the front ends over to Patrick and Ed for IFO locking, I'll work on file system cleanup in the background.

I've opened FRS6488 to prevent a re-occurance of this

 

[1989275.036661] h1susitmxepics[25707]: segfault at 0 ip 00007fd13403c894 sp 00007fffb426b9a0 error 4 in libc-2.10.1.so[7fd133fda000+14c000]

[1989275.045095] h1susitmxepics used greatest stack depth: 2984 bytes left

[1989275.086076] h1susbsepics[25384]: segfault at 0 ip 00007f2a5348e894 sp 00007fff908c88e0 error 4 in libc-2.10.1.so[7f2a5342c000+14c000]

[1989275.127643] h1susitmyepics[25166]: segfault at 0 ip 00007f5905a59894 sp 00007fff20f878d0 error 4 in libc-2.10.1.so[7f59059f7000+14c000]

 

 

 

 

2016-10-23T14:51:50.62907 LOCKLOSS_SHUTTER_CHECK W: Traceback (most recent call last):

2016-10-23T14:51:50.62909   File "/ligo/apps/linux-x86_64/guardian-1.0.2/lib/python2.7/site-packages/guardian/worker.py", line 461, in run

2016-10-23T14:51:50.62910     retval = statefunc()

2016-10-23T14:51:50.62910   File "/opt/rtcds/userapps/release/isc/h1/guardian/LOCKLOSS_SHUTTER_CHECK.py", line 50, in run

2016-10-23T14:51:50.62911     gs13data = cdu.getdata(['H1:ISI-HAM6_BLND_GS13Z_IN1_DQ','H1:SYS-MOTION_C_SHUTTER_G_TRIGGER_VOLTS'],12,self.timenow-10)

2016-10-23T14:51:50.62911   File "/ligo/apps/linux-x86_64/cdsutils/lib/python2.7/site-packages/cdsutils/getdata.py", line 78, in getdata

2016-10-23T14:51:50.62912     for buf in conn.iterate(*args):

2016-10-23T14:51:50.62912 RuntimeError: Requested data were not found.

2016-10-23T14:51:50.62913

 

Images attached to this report