On Friday 16th May 2025 several cameras whose processes are running on one of the new AMD EPYC 9124 servers stopped working. Their MEDMs went "white-box" and their camera images went "blue-screen". The cameras and their last good data points are:
FC2 04:02
FC_TRANS_B 09:00
ETMX 10:03
ETMY 11:01
Note ETMX h1cam25 was not pingable, all the others were.
Yesterday I went to EX and EY to investigate. I power cycled both end station cameras by disconnecting their ethernet from the POE injector. After restarting their processes they came back online.
On the way back I did the same for the FC_TRANS_B from the CDS mezzanine in the FCES.
This morning I found that FC2 had also stopped (more about that laater) and got it running again by just restarting its process, no camera power cycle needed.
Looking at the logs we found
"WaitObject duplicate failed (0): Too many open files. Reached open files limit"
type errors around these times. Debian has a limit of 1024 open file descriptors per process, and scanning the /proc/<pid>/fd/ dir sizes for h1digivideo5 showed that FC2 had 1024 open, which is how we found that camera had stopped working at 04:02 Friday.
Two immediate action items: