All of the DAQ machines are unresponsive except for h1daqtw0. Both frame writers were writing the 00:20 frame at the time of the crash.
We brought up the data concentrators' console on SMCIPMI and they do not report any errors, just the regular login prompt which is not responding to the keyboard. Jonthan is in the process of rebooting the DAQ, starting with the DCs.
h1daqnds0 console shown as a reference
Opened FRS15602
While recovering the daqd system the 10Gb port on h1daqdc0 went down after it had been up and receiving traffic. Dmesg reported the following errors: [ 698.537347] sfc 0000:02:00.0 ens2f0np0: RX DMA error (event: c0011004:00111001) [ 698.537550] sfc 0000:02:00.0 ens2f0np0: resetting (RECOVER_OR_ALL) [ 698.580015] sfc 0000:02:00.0 ens2f0np0: efx_ef10_rx_push_exclusive_rss_config: failed rc=-1 [ 698.580224] sfc 0000:02:00.0 ens2f0np0: MC command 0x80 inlen 164 failed rc=-22 (raw=22) arg=789 [ 698.580944] sfc 0000:02:00.0 ens2f0np0: has been disabled I rebooted the system to bring the interface up after it failed to come up with ifup.
While doing the recovery I brought up the h1daq*1 systems and then moved to the h1daq*0 systems. While I was working on h1daq*0 I had a lockup on h1daqdc1 again. Same symptoms, no error messages on the console, not responsive to console or network input.
All the daqd systems are up and running now.
The firmware and driver versions of the 10Gb cards on the dc machines: h1daqdc0 Part Number: SFN7x02F driver: sfc version: 4.1 firmware-version: 6.2.5.1000 rx0 tx0 h1daqdc1 Part Number: SFN7x02F driver: sfc version: 4.1 firmware-version: 6.2.7.1000 rx0 tx0