after being super reliable for all of O2, h1fw0 daqd stopped running at 15:42 PDT. The error log suggests it was not able to write to its QFS disk system, and internal buffers filled up. Monit restarted daqd, and it has been running for 20 minutes so far, but second trend frame file writing looks too slow
[Fri May 26 15:41:55 2017] main profiler warning: 1 empty blocks in the buffer
[Fri May 26 15:41:56 2017] main profiler warning: 0 empty blocks in the buffer
[Fri May 26 15:41:57 2017] main profiler warning: 0 empty blocks in the buffer
[Fri May 26 15:41:58 2017] main profiler warning: 0 empty blocks in the buffer
[Fri May 26 15:41:59 2017] main profiler warning: 0 empty blocks in the buffer
....
h1fw0 daqd crashed again at 16:13. This was not so clearly a file issue, went into a retransmission storm.
Dan says nothing is changing on the LDAS system. In the bad old days when the DAQ was unstable sometimes power cycling the Solaris QFS/NFS machine helped. So I power cycled h1ldasgw0 and got h1fw0 writing again, recovered at 16:36. It has been running 900 seconds so far.
for the record, procedure followed for power cycling h1ldas0 was:
(root on h1fw0) stop monit running
(user on workstation) kill daqd on h1fw0 via telnet
(root on h1fw0) umount /ldas-h1-frames
(root on h1ldasgw0) power off
Power h1ldasgw0 back up with front panel power button
(root on h1ldasgw0) manually mount the QFS file system, manually export it via NFS
(root on h1fw0) manually mount /ldas-h1-frames, start monit. At this point monit starts daqd.
When Dan goes online again, he says he will check the QFS and SATABOY logs.