Over the weekend we ran into a few times (alog71043, alog71026, alog71008) that we tried to get data via cdsutils getdata function in an ISC_LOCK guardian state, and it returned nothing. This caused an error in ISC_LOCK, fixed by simply reloading the node since the function just had to try again to get the data. This is not a new thing, but it's definitely another reminder that we have to be prepared for different outcomes anytime we request data.
Some months ago I made with Jonathan's help, a function wrapper that can be used to handle hung data grabs. While not the issue we saw over the weekend, it's still a good idea to use this whenever we try getting data in a Guardian node. The file is (userapps)/sys/h1/guardian/timeout_utils.py and there is either a decorator (@timeout) or a wrapper function (call_with_timeout) than can be used.
For the specific issue we saw over the weekend, a solution is to just do a simple check that the data is actually there before trying to do anything with it (ie. if data:). Using this situation as a good example:
# This wrapper should handle hung nds data grabs popdata_prmi = call_with_timeout(cdu.getdata, 'LSC-POPAIR_B_RF90_I_ERR_DQ', -60) # This conditional handles None data returned if popdata_prmi.data: if popdata_prmi.data.max() < 20: log('no POPAIR RF90 flashes above 20, going to CHECK MICH FRINGES') return 'CHECK_MICH_FRINGES' else: self.timer['PRMI_POPAIR_check'] = 60
I should have added that this fix was loaded into ISC_LOCK by Tony during commissioning today and is ready for our next relock.
This threw the attached error at 2034-07-07 04:14UTC. I edited ISC_LOCK for prmi and drmi checkers from 'if popdata_prmi.data:' to 'if popdata_prmi:'.
This seemed to work but I'm not sure if it will cover all every case. If this goes into error again I suggest the operator start by reloading ISC_LOCK and, if necessary, the "elif self.timer['PRMI_POPAIR_check'] " block of code can be commented out. Tagging OpInfo.
After this edit and a reload, the checker seems to work well, logging that there was no RF18 flashes above 120 (true) and moving to PRMI locking before the old 5 minute 'try_PRMI' timer finished.