There were two locklosses during the 09/17 OWL shift, 07:45UTC and 11:30UTC.
09/17 07:45UTC Lockloss
PI24 started ringing up and wasn't able to be damped, and we lost lock 15 minutes after the ringup started(attachment1). Different from the last time we lost lock due to to the PIs (72636), as that time the phase was cycling through too fast to see if anything was helping, and was updated to wait longer before changing phase. This time the phase stopped cycling through - it just stopped trying after a few tries(SUS_PI_logtxt). It actually looks like if it had stayed at the first phase it tried, 340degrees, it might have been able to successfully damp it instead of ringing it up more(attachment2). Does this mean that checker time should be extended again? Or alternatively could we make it so if the slope of the mode increases when the phase is changed, it changes the phase in the opposite direction?
Mode24 started ringing up almost immediately after the 'super_checker' timer was completed (which turns off damping every 1/2 hour), and surpassed a value of 3 15s later, triggering the damping to turn back on.
It seems like it's a relatively common occurrance that the ESDs need to turn back on at max damping within a minute of turning off to damp mode24 or mode31.
Timeline
07:30:00 PI24 starts trending up
07:30:01 'super_checker' timer times out, turns off damping
07:30:17 PI24 exceeds 3
- ETMY PI ESD turns on & reaches its max output in ~30s and continues at max until lockloss
- Phase stops changing once ESD driver is at max output
07:35 PI31 also starts ringing up
07:42 DCPD saturates
07:44 PI24 reached a max of 2212
07:45 Lockloss
09/17 11:30UTC Lockloss
Caused by a 5.6 magnitude earthquake off the coast of western Canada only ~900km away from us. Seismon labeled it as two separate earthquakes from Canada that arrived at the same time, but it was only one.
Since this earthquake was so close and big, the detector wasn't given any warning, and we lost lock 4seconds after earthquake mode was activated. We actually received the "Nearby earthquake from Canada" two minutes after having lost lock!
Timeline
11:30:23 Earthquake mode activated
11:30:27 Lockloss
Looked into the SUS_PI issue and couldn't see why the phase stopped stepping.
During this time, SUS-PI_PROC_COMPUTE_MODE24_RMSMON was > 3 and the new rms from cdu.avg would have been ~11 which is larger than the old saved value of 7.94 . This should have caused SUS_PI to move forward with 'scanning mode 24 phase to damp', but it didn't. There could have been an issue with cdu.avg()? There was no reported errors with the guardian code.
new_24_rms = cdu.avg(-5,'SUS-PI_PROC_COMPUTE_MODE24_RMSMON')
if new_24_rms > self.old_mode24_rmsmon:
if true then would have gone ahead with stepping phase
Vicky, Oli, Camilla. Commented out the super_timeout code from SUS_PI. PI damping will now remain on.
After talking with Oli and Vicky, it seems like the super_timeout timer isn't working as as soon as damping is turned off mode 24 rings up and damping is turned back on. This gives more opportunities for the PI damping guardian to fail as it did for this lockloss.
The super_timeout was added to turn off damping after 30 minutes as LLO saw noise from PI damping LLO67285, but we don't see noise 71737.
I common failure mode when using cdsutils.avg is when it can't get data from NDS for whatever reason, it returns None. In python you can still do comparison operators with Nonetypes (ie. None > 7 evaluates to False). I'd guess that's what happened here, since it wouldn't get into the code under the conditional you posted.
A solution to this is to always check that there actually is data returned, and if not try again. I'd also recommend using the timeout_utils.py call_with_timeout function to avoid times the data call gets hung.