Dan, Ryan, Dave:
we worked on h1fw0 today. This writer was chosen because the nds machines have been offloaded to the cds-h1-frames file system (via h1ldasgw2) and h1fw0 is significantly more unstable than h1fw1.
Executive Summary: no matter what we did h1fw0 continues to be very unstable if it tries to write all the frame files, it is stable if it only writes the science frames. Further tests are more intrusive and will wait until later in the week.
During all the tests I monitored the status of the threads which write the frame files to disk. There are four threads: dqscifr, dqfulfr, dqstrfr, dqmtrfr (science frame, full frame, second trend frame and minute trend frame). These threads are normally in sleep state (S), transistioning to the running state when writing their frames (R) and also spending some time in the disk-sleep state (D) during the write process.
I am interacting with the QFS file system in three ways: daqd writing the frame files, manually creating a 1GB file using the dd command and performing a directory long listing (ls -al). When everything is running correctly, the dd creation of a 1GB file takes about 5 seconds and the "ls -al" is fast. The symptom of the problem is that all the active frame writer threads all get stuck in the D uninterruptible state, forever waiting for a disk write completion signal. At this point the dd copy is still able to create a 1GB file, but it takes much longer (10-20 seconds). The long listing fails, also going into the D disk-io-sleep state. At this point data is going into daqd and none is coming out, internal buffers fill and the process dies. The long listing now completes, and the dd copy goes back to 5 seconds to complete.
Things tried today:
Monitor NFS traffic on private network between frame writer and solaris NFS server, no errors or unusual packets seen
Change the version of NFS for this mount from vers3 to vers4, no change in stability. One point of interest, to make the change on h1fw0 I rebooted the computer, and the daqd process on h1fw1 died at this time.
Dan stopped the rsync process which is backing up the raw minute trends from ldas-h1-frames. He also stopped all LDAS access to ldas-h1-frames (disk2disk copy, frame file checksumming). At this point h1fw0 hadldas-h1-frames all to itself, and it still could not write all 4 frame files and died on the hour when minute, second and full frames all being written at once.
As I have left the system, h1fw0 is only writing science frames. Dan is comparing these files with h1fw1 and will use them if h1fw1 restarts. Dan has restarted the rsync process to complete the raw minute trend backup (prior to his reconfiguring the SATABOY raids).
Later in the week Dan will reconfigure the Sataboy raid to make a more efficient file system, whose file access times should be halved.