To see if reducing h1fw1's disk loading would make it more stable, Thursday at 11:30PDT we changed h1fw1's daqdrc file to stop it writing science frames on the next daqd restart. Despite h1fw1 having restarted itself six times already Thursday morning, it then went into a period of stability from 09:53 through to 23:15, at which time it restarted and stopped writing science frames. What happened next was interesting, here is the timeline:
Thu 23:15PDT h1fw1 writes last science frame
Thu 23:20PDT h1fw1 restarts daqd
Thu 23:30PDT h1nds1 restarts
Thu 23:39PDT h1nds0 stops working, but its process still exists so monit does not restart it
Fri 05:02PDT h1fw1 restarts, test has failed at this point
The interesting points are: h1nds1 restarted once 10 minutes after the config change. Perhaps not surprising because it uses h1fw1's frames. At 23:39PDT h1nds0 stopped serving data. This is totally surprising, there is no link between h1nds0 and h1fw1 that we know of.
Since Guardian is the sole NDS client for h1nds0, several Guardian nodes reported nds problems while h1nds0 was in its frozen state. DIAG_MAIN for example reported nds failures from Thr 23:40PDT through Fri 10:54PDT.
Trouble getting h1fw1 writing science frames again.
The 05:02PDT restart of h1fw1 meant the test had failed. I reverted the daqdrc file back to write science frames. In light of the h1nds0 issues from last night, I decided to manually restart h1fw1. Unfortunately h1fw1 became very unstable, sometimes restarting before a single frame could be written. Here is what I did:
wait for monit to restart daqd several times before intervening
manually restart daqd
stop daqd and reboot h1fw1
stop daqd and power cycle h1fw1
finally, the nuclear option, power down h1fw1, power cycle h1ldasgw1, power up h1fw1
At the time of writing, the last restart seems to have made h1fw1 stable, it has been running for 30 mins.
In the past we noticed that power cycling the solaris QFS/NFS server has helped.
h1fw1 is stable again, presumably the reboot of the solaris QFS server h1ldasgw1 was the fix. It has been running for 18+ hours.