aLIGO LHO Logbook

H1 CDS (CDS)

craig.cahillane@LIGO.ORG - posted 16:46, Wednesday 22 August 2018 - last comment - 19:49, Wednesday 22 August 2018(43599)

Guardian Redirect Bug

Sheila, Craig, Jaime

We've been having issues where our ASC output matrix is not being set.  This happens in ISC_DRMI guardian state PREP_DRMI_ASC.
I found a lock from last night where this occurred and caused a lockloss.  Plotting around this time showed that nothing was changing in either the ASC intrix or outrix: PREP_DRMI_ASC was not working.

We dove in the guardlogs for ISC_LOCK and ISC_DRMI.  From them, we discovered that the ISC_LOCK guardian was accidentally requesting PREP_DRMI_ASC from the ISC_DRMI guardian twice, first in ACQUIRE_DRMI_1F and second in the next state, DRMI_LOCKED_PREP_ASC.  When we make this second request, it triggers a redirect from PREP_DRMI_ASC to PREP_DRMI_ASC.
Jaime told us that a redirect will kill the original worker after a second, and spawn a new worker at the top of the state.  So in principle, PREP_DRMI_ASC should have still been run.  However, we were not observing this.  In the middle of the redirect, a new request was sent for DRMI_1F_LOCKED_W_ASC to the ISC_DRMI guardian.  We believe this may have forced the new worker to be spawned thinking it had finished PREP_DRMI_ASC.

In any case, this is clear abuse of guardian by us the users.  We removed the second request for PREP_DRMI_ASC from DRMI_LOCKED_PREP_ASC so no redirect will be triggered, and we hope this will solve the issue.


ISC_DRMI guardlog

2018-08-22_06:52:09.460465Z ISC_DRMI EDGE: DRMI_WFS_CENTERING->PREP_DRMI_ASC
2018-08-22_06:52:09.461225Z ISC_DRMI calculating path: PREP_DRMI_ASC->PREP_DRMI_ASC
2018-08-22_06:52:09.480981Z ISC_DRMI executing state: PREP_DRMI_ASC (78)
2018-08-22_06:52:09.480981Z ISC_DRMI [PREP_DRMI_ASC.enter]
2018-08-22_06:52:09.779847Z ISC_DRMI REQUEST: PREP_DRMI_ASC
2018-08-22_06:52:09.780606Z ISC_DRMI calculating path: PREP_DRMI_ASC->PREP_DRMI_ASC           # REDIRECT begun here
2018-08-22_06:52:09.781037Z ISC_DRMI same state request redirect
2018-08-22_06:52:09.781037Z ISC_DRMI REDIRECT requested, timeout in 1.000 seconds
2018-08-22_06:52:09.781796Z ISC_DRMI REDIRECT wait for worker completion...
...
2018-08-22_06:52:10.097441Z ISC_DRMI REDIRECT wait for worker completion...
2018-08-22_06:52:10.146787Z ISC_DRMI REQUEST: DRMI_1F_LOCKED_W_ASC                            # NEW REQUEST for DRMI_1F_LOCKED_W_ASC begun here
2018-08-22_06:52:10.147456Z ISC_DRMI calculating path: PREP_DRMI_ASC->DRMI_1F_LOCKED_W_ASC
2018-08-22_06:52:10.147456Z ISC_DRMI new target: ENGAGE_DRMI_ASC
2018-08-22_06:52:10.148189Z ISC_DRMI REDIRECT wait for worker completion...
2018-08-22_06:52:10.197836Z ISC_DRMI REQUEST: DRMI_1F_LOCKED_W_ASC
...
2018-08-22_06:52:10.783208Z ISC_DRMI REDIRECT wait for worker completion...
2018-08-22_06:52:10.783208Z ISC_DRMI REDIRECT timeout reached. worker terminate and reset...
2018-08-22_06:52:10.800447Z ISC_DRMI worker terminated
2018-08-22_06:52:10.815603Z ISC_DRMI W: initialized
2018-08-22_06:52:10.873188Z ISC_DRMI W: EZCA v1.2.0
2018-08-22_06:52:10.873685Z ISC_DRMI W: EZCA CA prefix: H1:
2018-08-22_06:52:10.873685Z ISC_DRMI W: ready
2018-08-22_06:52:10.874327Z ISC_DRMI worker ready
2018-08-22_06:52:10.878507Z ISC_DRMI EDGE: PREP_DRMI_ASC->ENGAGE_DRMI_ASC
2018-08-22_06:52:10.879330Z ISC_DRMI calculating path: ENGAGE_DRMI_ASC->DRMI_1F_LOCKED_W_ASC
2018-08-22_06:52:10.880327Z ISC_DRMI new target: DRMI_1F_LOCKED_W_ASC
2018-08-22_06:52:10.885337Z ISC_DRMI executing state: ENGAGE_DRMI_ASC (80)
2018-08-22_06:52:10.885986Z ISC_DRMI [ENGAGE_DRMI_ASC.enter]

Comments related to this report

sheila.dwyer@LIGO.ORG - 19:49, Wednesday 22 August 2018 (43608)

Link

A few things to note:

1)We should try to avoid making requests in the run state without some sort of if statement to make sure we don't continuously make requests of managed guardian nodes.

2)When we see same state redirect in the gaurdian log, it means that we have requested it to run the state that is currently running, and the guardian responds to this by waiting one second, (REDIRECT waiting for worker completion) and after this starting the state over from the beginning. (This is usually not what we want, so we should use care not to request things that are already happening).

3) nodes[ISC_DRMI].arrived returns true when a guardian starts the requested state, not when it finishes it. There are some places in the guardian where we use this for states that are not just place holder states, and it seems the code is not doing what we expected in some places.