Jonathan, Dave:
Around 21:30 PDT last night (Sunday) h1boot froze up, presumably with the 208.5+day bug but the console message looks different. Every front end model eventually stopped processing. The front ends failed at various times, so some front ends showed IPC errors, others froze "green".
We reset h1boot via front panel reset-button. An FSCK file scan was enforced since the system had been running in excess of 232 days (boot servers can run far in excess of 208days before having a problem).
All frozen models came back with no restarts needed.
We are back to the weekend status of h1seib3 and h1susex computers are down due to 208.5+day bug.
CDS front end model status:
h1suspr2: awgtpman process has failed
h1susomc: DAC-KILL active (user watchdog?)
h1iopseiex: SWWD trip due to loss of h1iopsusex, DAC-KILLs active
h1susex: h1iopsusex, h1susetmx, h1sustmsx, h1susetmxpi: all down due to cpu lockup
h1seib3: h1iopseib3, h1isiitmx, h1hpiitmx: all down due to cpu lockup
h1pemmy: h1ioppemmy, h1pemmy: computer powered down for overtemp protection due to CP4 bake-out
I've restarted the awgtpman process for h1suspr2
I have verified that all the front end models continued to run overnight while h1boot was down, and their DAQ data was good. Only the EPICS channel access to the front end IOCs were unavailable. In the attached dataviewer minute trend plot for the past 24 hours, two LSC channels are shown; one comes directly from the front end over MX, the other is acquired via EPICS Channel Access using the EDCU. The EDCU signal is zero during the down time, the direct signal is unaffected.