aLIGO LHO Logbook

H1 SYS

jameson.rollins@LIGO.ORG - posted 18:14, Thursday 26 February 2015 (16778)

guardian upgrade overview

All guardian systems have been successfully upgraded to the latest release

This is a long overdue update on the guardian upgrade performed last week. The current installed versions are:

guardian: r1373
cdsutils: r441

A "final" bug fix patch was applied during the Tuesday 2/24 maintenance period, after which the guardian machine (h1guardian0) was rebooted. All nodes recovered without issue. There have been a couple of small issues that I'll note below.

RELEASE NOTES

OP/MODE split

The functionality of the MODE switch has been split into three independent interfaces:

OP: The operational code execution state, can be set to one of three values:
- EXEC: primary execution mode during which usercode is continually executed
- PAUSE: stop executing user code. Any currently running state method will continue until completion
- STOP: kills the worker process immediately,

MODE: The execution model
- EXEC: primary graph traversal model. Node continually tries to reach REQUEST state
- MANAGED: node "STALLS" after jumps and waits for a new REQUEST in order to continue. Used during intra-node management.
- MANUAL: direct state execution, graph is ignored (see below)

LOAD: momentary switch set to 'True' to prompt a user code reload

RELOAD improvements

Code reload now happens seamlessly in the background, without interupting the current state code execution at all. The current state is no longer interupted or restarted. (triggered by setting the LOAD momentary switch to 'True')

The only known limitation occurs when certain changes to the currently running state are loaded. If the node is currently in the RUN method of a state and the new code references an attribute or variable that was expected to have been set in MAIN, you will encounter an exception. You should be able to bypass this problem by re-requesting the current state, which causes the current state to be re-executed from the beginning (i.e. MAIN).

request any STATE in the graph

The REQUEST interface now allows for requesting any state in the system. "Requestable" states are now only used to populate the REQUEST drop-down menu on the guardian MEDM interfaces.

A new STATES MEDM screen, accessed via the "all" button next to the REQUEST drop-down or via "guardmedm --states ...", allows for selecting any state in the system.

The buttons are colored the same as states in the graph. The "targets" to the right indicate the current STATE (inner), REQUEST (middle), NOMINAL (outer).

The system handles state requests exactly the same, regardless if the state is requestable or not. In EXEC mode, guardian will follow the graph to the requested state, and hold there once it arrives.

NOTE: this is intended only as an aid to commissioning/debugging, so that intermediate states can be requested without having to modify the code to add/remove states from the request list. However, we should continue to persue the same philosophy of only making "requestable" the states in which the system is intended to come to rest. The REQUEST drop down menu is still intended to be the primary request interface. This way it will always be clear which state are intended "final" states of the system.

MANAGER registration and overhaul

Managers now register themselves with their subordinates. The current manager is recorded in the MANAGER channel, and a new display on the main MEDM control panel displays the current manager.

NOTE: if the user manually overrides this by selecting a different mode, or if another manager steals the node, the managing node will need to be told to go back through a state where it runs set_managed() to re-acquire the subordinates.

MANUAL MODE

When in MANUAL mode, the graph is completely ignored and the REQUEST state is immediately executed, dropping whatever else the node was doing at the time.

NOTE: This mode should only be used with caution by those who understand the system their controlling. The graph is there to purposely constrain the dynamics of the system. Ignoring these contraints can easily put the system into a bad state if you're not careful.

This mode can be accessed via the MANUAL button in the STATES MEDM screen.

"protected" states

A new "redirect = False" flag can be used on states that should never be left until they return True. This is useful for FAULT states that should not be exited until the fault clears, even if another goto state is selected (e.g. DOWN).

weighted edges

Edges can now have weights, which can be used to break path degeneracies. Guardian always chooses paths with the lowest total edge weight sum.

execution time recorded

The execution time of user code is now recorded in the EXECTIME (current execution time) and EXECTIME_LAST (execution time of last cycle) records. Indicators of these values are now on the main control screen.

automatic code archiving

All usercode is committed to a per-node git code archive upon every restart or reload. This gives us a complete record of execatly what code was running at any given point in time.

The archive root directory is:

/ligo/cds/lho/h1/guardian/archive/

An integer representation of the archive git SHA1 commit id is recorded in the new ARCHIVE_ID channel, which is also displayed on the main and compact control screens:

compact MEDM control screen

A new compact control screen can be access via e.g. "guardmedm --compact SUS_ETMX":

setpoint monitoring

A new setpoint monitoring system has been added. The ezca object now records all EPICS writes performed by the usercode. If the "ca_monitor=True" flag is set in the module, guardian checks the current value of all setpoint channels to determine if they differ from where they were set by guardian:

If any differences are detected, the SPM box on the control screen goes yellow:

By clicking on the "SPM DIFFS" button, a screen will open showing the current list of differences:

For filter modules, as shown above, just the SWSTAT value is recorded FOR THE ENTIRE MODEULE, even if not all buttons are touched by guardian. This allows guardian to cover the full state of filter modules, since the front end SDF monitoring can not be told to watch individual states only. The SPM DIFF screen shows filter engaged differences as shown above.

NOTE: this feature is still experimental, and there are likely kinks that need to be worked out. In particular, anything that "legitimately" sets values that have been touched by guardian outside of guardian, e.g. a subprocess script or a BURT restore, will cause the SPM to report differences. This is kind of unavoidable, since there's no way for guardian to know if changes that occur outside of its purview are legitimate or not.

Subsystem commissioners should experiment with this feature and report any issues to me.

improved notifications

Notifications (USERMSGs) now cycle through the USERMSG display on the main control window. There's also separate USERMSG MEDM screen where each individual message can be viewed.

Images attached to this report