aLIGO LHO Logbook

H1 SEI (CDS)

jeffrey.kissel@LIGO.ORG - posted 11:29, Monday 11 August 2014 (13329)

More loose wires in the software -- h1seih23 this time...

H. Radkins, J. Kissel

Similar to what we've seen before on front end computers for HAM2 -- both SEI and SUS -- (see LHO aLOGs 11481, 10849, 10375, 8964, 8424, 7385), we found the h1seiham23 computer inexplicably unable to drive past its IOP model this morning. As with each of the other 6 times, we had to kill the user front-end processes, restarted the IOP process, reset all watchdogs to regain actuation. A good guess of the source this time would be the problems with h1sush2b this weekend, but a trend of the CDS State Word DACKILL bit shows the indicative status changed at Aug 9 2014 22:30 UTC, 15:30 Saturday Afternoon (after Dave finished his bootfest with h1sush2b). So ... still no smoking guns for this problem, nor any better solutions than the time-honored, sledge-hammer fix, restart of all models on the computer.

We STILL need a better indication of this problem. The DK bit in the CDS word often gets ignored by CDS admins, because they assume it's the user or iop DACKILLs that have tripped after their reboots (as expected). SEI / SUS commissioners -- though CDS state word is now on everyone's overview screen -- lose track of what bits mean what, the only non-invasive power they have is to reset all watchdog that they have, and there're often things like IPC errors (which is the bit right next to it) that don't affect the performance of the platform so the word cries wolf often. Presumably, to better indicate the problem, we have to identify the problem to begin with...

Details:
-------------------
- Tried turning on ISI, saw that actuators trip. A plot of the trip shows a clear unstable ramp up of output -- but no sign of movement from the sensors.

- Found no DAC outputs reported by the IOP, even though user model has requested it.

- "DK" bit shows red, but all MEDM control of USER and IOP DACKILL buttons have been reset. Bit went red, Aug 9 2014 22:30 UTC, 15:30 Saturday Afternoon (after Dave finished his bootfest with h1sush2b).

- Checked h1seih23 dmesg,
controls@opsws4$ ssh h1seih23
controls@h1seih23$ dmesg
this reports some node errors, but unfortunately the time stamp is meaningless. In the "recent" history, it reports things like
"Session for node 52 is disabled - Status = 0x5"
"Heartbeat alive-check for node=52 failed (cnt=6614 state=0x1 deb=0 val=0)."
"Session for node 20 is disabled - Status = 0x5"
"Session for node 20 is disabled - Status = 0xf"
though, according to T1400026, no front-end process is assigned to the DCUID 52, and h1ascimc.mdl is using DCUID 20, should be totally unrelated.

- Checked /proc/h1iopseih23/status file,
controls@opsws4$ ssh h1seih23
controls@h1seih23$ vi /proc/h1iopseih23/status
reports DAC FIFO Status for both DACs is OK
DAC #0 16-bit fifo_status=2 (OK)
DAC #1 16-bit fifo_status=2 (OK)

- Everyone of previous aLOGs of this problem says "kill user models, restart iop process, restart user models."

- killed all user models, restarted iop model, restarted user models, cleared all watchdogs and problem clears up.