Reports until 18:50, Tuesday 29 October 2013
H1 SUS (CDS)
jeffrey.kissel@LIGO.ORG - posted 18:50, Tuesday 29 October 2013 - last comment - 20:04, Tuesday 29 October 2013(8306)
h1susb123 Computer Woes
D. Barker, J. Batch, J. Kissel, A. Pele

Arnaud has been having trouble this week taking transfer functions on ITMX. After a lot of chasing our tails, and finding a few bugs in the infrastructure work that I've been doing (see LHO aLOG 8247, and  [sorry for the lack of aLOGging, slaps own wrist, was on a deadline for Stuart]), we finally discovered that the analog "keep-alive" signal that is sent from the I/O chassis to the AI Chassis for all the SUS on the h1susb123 (H1SUSITMX, H1SUSITMY, H1SUSBS) was failing. It had apparently failed on Sunday at 7pm PT, when I was here fixing an unrelated bug in the *library parts* for the QUAD (which means, if it *was* me, it would have affected all QUAD models, and Arnaud has successfully driven / damped H1SUSETMX since Sunday). It's now fixed, with a hard power cycle of the front end and IO chassis. (Some rather upsetting) Details below.

---------

The symptoms we identified:
- The IOP Model output on the SUS OVERVIEW screen was showing zeros when we expected to have output signal
- The IOP's GDS_TP screen showed the 5th bit was red -- explained on pg 18 of T1100625 to be 
     "Anti-imaging (AI) chassis watchdog (18bit DAC modules only / [on the] IOP [screen] only): For 18 bit DAC modules, the IOP [front end model] sends a 1 [Hz] heartbeat to the connected AI chassis via a [16 bit] binary output [card] on these modules. The AI chassis, in turn, returns a bit via the DAC binary input register to indicate receipt."
(Don't worry, even with my [edits], it still doesn't even make sense, even to *me*.)
- The LED near the the input of the AI chassis was OFF (not red, just off)
- The switch to flip in the DAC duotone signal on the 31st channel of ADC0 in the IO chassis, which is controlled by the same 16 bit BIO card, was malfunctioning in that -- when watching the 31st channel of ADC0 in dataviewer -- we saw the signal flip from noise to *zero* instead of noise to the typical pretty duotone sinewaves.

Welp. Looks like we need to add yet ANOTHER watchdog layer to the overview screen. 

Dave's conjecture is that some how this 16 bit BIO card got in a bad state on Sunday. It's unclear how, though, since I was just stopping and restarting the user models, and was not playing with the IOP, nor was I turning on and off the front end or the IO chassis.

Anyways. How do we solve any problem with computers? Power down and power back up. *sigh*.

There's now a reasonably successful procedure for gracefully bringing a front end / IO chassis up and down, with out affecting other front ends. Here's what I got from picking Dave's brain, and watching over his shoulder:
(1) Kill the user model processes running on that front end.
     ]$ ssh h1susb123
     ]$ killh1susitmy
     ]$ killh1susitmx
     ]$ killh1susbs
(2) Kill the IOP process running on the front end
     ]$ killh1iopsusb123
(3) Remove the front end from the IPC / Dolphin network, so you don't crash every other front end. Note, you only should do this step if you're powering down the front end and IO chassis. It's not necessary when just stop and starting frontend processes.
      ]$ sudo -s
     ]$ /opt/DIS/sbin/dxtool prepare-shudown 0 
(4) Turn off the front-end gracefully* (still as super user)
      ]$ power off
* This didn't work for us. The front end powered down, and then immediately began rebooting itself and bringing all the models back up. So, we had to 
   - wait for it to finish rebooting and bringing up the models
   - kill the front end processes again
   - Go into the Mass Storage Room (MSR) and hold the power button until it powered down.
(5) Turn off the IO chassis by going to the CDS Highbay, and flicking the rocker switch on the front of the chassis.**
** This doesn't work FOR ANY IO CHASSIS. Jim informs me that the rocker switch is wired to the wrong pins on the motherboard. For every IO chassis. Yeah. So, one has to disconnect the DC power in the back of the rack by unscrewing properly secure cables, risking powering down the chassis unevenly. Similarly on power up. TOTAL BADNESS. Apparently at LLO, they've installed lamp-style rocker switches right on the the cable to work around this problem and badness. (a) Why don't we have this already at LHO? (b) Was this an accepted, global, CDS fix? (c) Why can't we just re-wire the IO chassis?
(6) Turn on the IO chassis via the same rocker switch in the front (assuming you've reconnected the DC power, and like I did, flipped the rocker to the off position expecting it to work before hand.)
(7) Use monit (the remote controller of front ends' power that I still know too little about) to gracefully turn on the front end.
Upon power up, the front is gracefully inserted back into the dolphin network, the IOP front end process is restarted, and then the user front end processes***.
*** Because I've been making a bunch of changes to the EPICs variables in these models, and haven't yet had the chance to update the safe.snaps, the start-up process takes much longer to restore the snap (trying to reconcile the differences I presume), which means the $(IFO):FEC-$(DCUID)_BURT_RESTORE flag doesn't get set before the process looks for it's timing synchronization signal, and just hangs there red and dead claiming no sync. You have to then hit the button (when the EPICs gateway catches up, some time later), restart the front-end process (which captures that this bit is now set), then it happily picks up the IOP timing sync, and springs back to life.


That's the process. Don't you feel better? 
Comments related to this report
keith.thorne@LIGO.ORG - 20:04, Tuesday 29 October 2013 (8309)
If the front-end is not locked up, you can simply shutdown it down with:
sudo shutdown -hP now
   (This command will shutdown the Dolphin client as well)

If you want to shutdown all models on a front-end:
sudo /etc/kill_models.sh

We have a lot of power outages at LLO, hence the invention of in-line power kill switches as it is a long way to the DC power room.  David K. may have them already fabbed - we will check.