yesterday fw0 went very unstable for a few hours. We noticed that during this period file access on the QFS server (h1ldasgw0) was normal, but even simple file actions like a long listing (ls -al) from h1fw0 could take many seconds. Rebooting and power cycling both machines did not fix this. We configured h1fw0 to only write the science frame, which did make it more stable. At this point the CRC epics channels for fw0 and fw1 were different, but the two science frames written were byte identical. It looks like the CRC is the checksum of the data block the frame writer needs, if it is not writing the commissioning frame then it is the CRC of the science data only. At this point we also did full DAQ restart (17:37PDT) to resync everything. At this point we felt that having the CRCs match is an important diagnostic, so h1fw0 was reverted to write both frames again, and this time it went stable.
h1fw0 has been stable since the 17:37PDT DAQ restart (we don't know why). h1fw1 has restarted 3 times since then (01:04, 01:32, 07:03 PDT Friday 7/1). I think this level of fw1 instability is acceptible if fw0 continues to be 100% stable.
Our investigation suggests that Keith's suggestion of upgrading the NFS link between FW and QFS-NFS server from 1GE to 10GE would be a good test. Dan has some fiber-optics 10GE cards we could borrow. We will schedule this test after ER9 or before if the instability reappears.