aLIGO LHO Logbook

H1 CDS
jeffrey.kissel@LIGO.ORG - posted 17:18, Thursday 14 March 2013 (5801)
Several awgtpmans running on the same frontend causes intermittent excitations
J. Batch, J. Kissel

After the plug got kicked out of the wall (see LHO aLOG 5794), we had to turn off all front-end computers and power cycle the IO chassis. Upon restart and restoration of the front end computers, I launched off a DTT session transfer function hoping to resume transfer functionon the TMS. However, as soon as I started, I noticed that my excitation would drop out intermittently. UH OH. All lights on the GDS_TP screen showed green. 

Jim to the rescue!! He immediately recognized the problem (from my verbal story only!) to be that there are too many awgtpman processes running on the front end. Like it's happened before or something.

He confirmed the problem by logging into the h1susb6 frontend (on which hsustmsy lives), and running
controls@h1susb6 ~ 0$ ps aux | grep awgtpman
This revealed duplicated invocations of awgtpman *for each model* (not just TMS).

This was quickly and easily resolved, with a 
sudo pkill awgtpman
which killed all of the awgtpman processes. Monit -- the program running on every front end which monitors various important processes, ensuring they're up and running -- then restarted only one process properly.
This was confirmed by another grep of aux,
controls@h1susb6 ~ 0$ ps aux | grep awgtpman
root     28813  0.2  3.1 279784 190428 ?       Ssl 16:04   0:04 /opt/rtcds/lho/h1/target/gds/bin/awgtpman -s h1susetmy -1 -l /opt/rtcds/lho/h1/target/gds/awgtpman_logs/h1susetmy.log
root     28823  0.3  3.1 279540 190132 ?       Ssl 16:04   0:05 /opt/rtcds/lho/h1/target/gds/bin/awgtpman -s h1sustmsy -1 -l /opt/rtcds/lho/h1/target/gds/awgtpman_logs/h1sustmsy.log
root     28835  0.2  3.1 279408 189904 ?       Ssl 16:04   0:04 /opt/rtcds/lho/h1/target/gds/bin/awgtpman -s h1iopsusb6 -4 -l /opt/rtcds/lho/h1/target/gds/awgtpman_logs/h1iopsusb6.log
controls 31462  0.0  0.0   6156   412 pts/0    S+   16:32   0:00 grep --colour=auto awgtpman
controls@h1susb6 ~ 0$

We logged into h1seib6, h1pemey, ns h1susauxb6, also showed the same symptoms -- and we rectified the problem.

WHY DID THIS HAPPEN?

/etc/rc.local is a local start up script that's run only when the computer is hard-booted/power-cycled, like this afternoon -- which we rarely happens, believe it or not (typically it's just the front-end process that gets restarted). This very low-level shell script is hosted on the h1boot server, so a change to it immediately gets propagated to every front end. This script had recently been modified to invoke all models' front-end-process startup script on that given front end BEFORE Monit is turned on. The problem is that both the startup scripts and Monit start awgtpman processes, but they do it in *different*, *independent* ways. Regardless of the order in which monit or the model start scripts are invoked, two awgtpman processes would get started.

The change to the /etc/rc.local script is a temporary fix. The motivation for the fix is unknown to Jim. *COUGH*. What needs to happen is a permanent change to the model start scripts,such that they call awgtpman in the same way that Monit does. Then, because Monit checks for the existence of an awgtpman started by its own method, it will not fire off a new process. This requires a change to the RCG code generator, which writes the front-end model startup scripts. Such a change should then be tested extensively on the DAQ Test Stand (or some other non-observatory location), then released to the sites as a tagged version of the RCG code, which is then installed at a well-determined time that is known not interfere with current activities.
Images attached to this report