Camilla, Erik, Dave:
h1hwsmsr (HWS ITMX and /data RAID) computer froze at 22:14 Thu 04 Jan 2024 PST. The EDC disconnect count went to 88 at this time.
Erik and Camilla have just viewed h1hwsmsr's console, which indicated a HWS driver issue at the time. They rebooted the computer to get the /data RAID NFS shared to h1hwsex and h1hwsmsr1. Currently the ITMX HWS code is not running, we will start it during this afternoon's commissioning break.
One theory of the recent instabilities is the camera_control code I started just before the break to ensure the HWS cameras are inactive (in extenal trigger mode) when H1 is locked. Every minute the camera_control code gets the status of the camera, which along with the status of H1 lets it decide if the camera needs to be turned ON or OFF. Perhaps with the main HWS code getting frames from the camera, and the control code getting the camera status, there is a possible collision risk.
To test, we turn the camera_control code off at noon. I will rework the code to minimize the number of camera operations to the bare minimum.
At ~ 20:00UTC we left the HWS code running (restarted ITMX) but stopped Dave's carema control code 74951 on ITMX, ITMY, ETMY, leaving the camera's off. They'll be left off over the weekend until Tuesday. ETMX is still down from yesterday 75176.
If the computers remain up over the weekend we'll look at incorporating the camera control into the hws code to avoid crashes.
Erik swapped h1hwsex to a new v1 machine. We restarted the HWS code and turned the camera to external trigger mode so it too should remain off over the weekend.
I've commented out the HWS test entirely (only ITMY was being checked) from DIAG_MAIN since no HWS cameras are capturing data. Tagging OpsInfo.
Trace from h1hwsmsr crash attached.
All 4 computers remained up and running over the weekend, with the camera on/off code paused. We'll look into either making Dave's code smarter or incorporating the cameras turning on/off into the hws-server code so that we don't send multiple calls to the camera at the same time, our leading theory as to why these hws computers have been crashing.