Ryan has tracked down the core problem. With older kernels, the sched clock overflows in about 208.5 days. My previous alogs showed that the computers have been running since the last power outage (30 Sep 2016 06:45), and the error messages on h1susex and h1seiex suggested a clock zeroing late Wednesday night (26 Apr 2017 22:00). The time difference between these two times is 208.6 days.
We are investigating why only two computers have crashed, why both at EX, why both around 6am and why one day apart.
If you google 'sched clock overflows in 208 days' you will see many articles on this. One article references kernel 2.6.32, we are running 2.6.34.
Was it reported by Terramon, USGS, SEISMON? Yes, Yes, Yes
Magnitude (according to Terramon, USGS, SEISMON): 6.8 or 7.2, depending on when you looked
Location: 30km SW of Burias Philippines
Starting time of event (ie. when BLRMS started to increase on DMT on the wall): ~20:45 UTC
Lock status? L1 & H1 lost lock at nearly the same time, LLO only a few minutes after us.
EQ reported by Terramon BEFORE it actually arrived? Yes
seismon reported this ~10 before it arrived and we got the Verbal notification, we lost lock probably on the P wave arrival. After lockloss and .03-.1 hz blrms went over 1 micron, I switched the seismic configuration to the large eq configuration.
The clean room that will be used for the up coming vent has been relocated to the beer garden. A little fine tuning will be required on 5/8. The crane is still hooked up to the clean room and the remote control is in my office.
We soft closed GV 5,7 for this exercise. GVs are open again.
At the time the h1seiex and h1susex computers locked up, the last errors to be printed to the console were photographed and attached to the alog. The timestamps (running seconds since last reboot) shown are not consistent with time since last boot.
Looking at the boot logs on h1boot, here are the times of the last two boots of these computers (including this week's boots, all times are local)
| computer | boot 1 | boot 2 |
| h1seiex | 30 Sep 2016 06:45 | 27 Apr 2017 07:05 |
| h1susex | 30 Sep 2016 06:45 | 28 Apr 2017 06:56 |
A computer running since 9/30/2016 should have a running clock of about 18 million seconds. The times shown in the photographs are small. Since we know at which time the computers froze, we can extrapolate back to the apparent zero time (all times local)
| computer | system time on console [crash time] | t=0 datetime |
| h1seiex | 30096 seconds (08 hrs 21 mins) [06:15 Thu 4/27] | 21:53 Wed 4/26 |
| h1susex | 108493 seconds (01 day, 06 hrs, 08 min) [05:43 Fri 2/28] | 23:34 Wed 4/26 |
the "pseudo" boot time when the clock was zeroed are within two hours of each other Wednesday night.
As a sanity check, I have ran dmesg on h1iscex to show the logging of this morning's model restarts. This computer was last rebooted 01 Nov 2016 12:03 PST. Here is the dmesg output:
[15352542.295438] h1alsex: Synched 302425
Doing the math of how many seconds since 28 Apr 2017 07:16 PDT and 01 Nov 2016 12:03 PST gives 15,361,981 which is close to dmesg.
[JeffK, JimW, Jenne]
In hopes of making some debugging a bit easier, we have updated the safe.snap files in SDF just after a lockloss from NomLowNoise.
We knew that an earthquake was incoming (yay seismon!), so as soon as the IFO broke lock, we requested Down so that it wouldn't advance any farther. Then, we accepted most of the differences so that everything but ISIBS, CS ECAT PLC2, EX ECAT PLC2 and EY ECAT PLC2 (which don't switch between Observe.snap and safe.snap) were green.
Jeff is looking at making it so that the ECAT models switch between safe and observe snap files like many of the other models, so that ISIBS will be the only model that has diffs (21 of them).
Note that if the IFO loses lock from any state other than NLN, we shouldn't expect SDF to all be green. But, since this is the state of things when we lose lock from NLN, it should be safe to revert to these values, in hopes of helping to debug.
After talking with Jeff and Sheila, I have made a few of the OBSERVE.snap files in the target directory a link to the OBSERVE.snap in userapps.
This list includes:
I have also updated the switch_SDF_source_files.py script that is called by ISC_LOCK on DOWN and on NOMINAL_LOW_NOISE. I changed the exclude list to only exclude the h1sysecatplc[1or3] "front ends". The sei models will stay in OBSERVE always just as before. This was tested in DOWN and in NLN, and has been loaded into the Guardian.
Starting CP3 fill. LLCV enabled. LLCV set to manual control. LLCV set to 50% open. Fill completed in 1596 seconds. LLCV set back to 17.0% open. Starting CP4 fill. LLCV enabled. LLCV set to manual control. LLCV set to 70% open. Fill completed in 812 seconds. TC A did not register fill. LLCV set back to 41.0% open.
Raised both by 1%:
CP3 now 18% open
CP4 now 42% open
H1 went into observation mode at 16:58:18 UTC, the startup of the CW calibration on h1calex completed at 16:59:48 UTC
model restarts logged for Thu 27/Apr/2017
2017_04_27 07:07 h1hpietmx
2017_04_27 07:07 h1iopseiex
2017_04_27 07:07 h1isietmx
2017_04_27 07:21 h1iopsusex
2017_04_27 07:21 h1susetmx
2017_04_27 07:23 h1hpietmx
2017_04_27 07:23 h1iopseiex
2017_04_27 07:23 h1isietmx
2017_04_27 07:23 h1susetmxpi
2017_04_27 07:23 h1sustmsx
2017_04_27 07:25 h1alsex
2017_04_27 07:25 h1calex
2017_04_27 07:25 h1iopiscex
2017_04_27 07:25 h1iscex
2017_04_27 07:25 h1pemex
Unexpected crash of h1seiex, also restarts of models on h1susex and h1iscex.
model restarts logged for Mon 24/Apr/2017 - Wed 26/Apr/2017 No restarts reported
J. Kissel, K. Izumi, J. Warner, S. Dwyer After another ETMX front-end failure this morning (see LHO aLOG 35861, 35857 etc.), the recovery of the IFO was much easier, because of yesterday morning's lessons learned about not running initial alignment scripts that suffer from bit rot (see LHO aLOG 35839). However, after completing recovery, the SDF system's OBSERVE.snap let us know that some of the same critical initial alignment references were changed at 14:17 UTC, namely - the green ITM camera reference points: H1:ALS-X_CAM_ITM_PIT_OFS H1:ALS-X_CAM_ITM_YAW_OFS and - the transmission monitors red QPDs: H1:LSC-X_TR_A_LF_OFFSET After discussing with Jim, he'd heard the Corey (a little surprisingly) didn't have too much trouble with turning on the green ASC system, which, if these ITM camera offsets are large, then that means the error signals are large, and we'd have the same trouble closing them as yesterday. We traced down the change to when Dave had to the reboot of h1alsex & h1iscex this morning at around 14:15 UTC -- see LHO aLOG 35862 -- and those two models out of date safe.snap files restored. Recall that the safe.snaps for these computers are soft linked to the down.snaps in the user apps repo: /opt/rtcds/lho/h1/target/h1alsex/h1alsexepics/burt ]$ ls -l safe.snap lrwxrwxrwx 1 controls controls 62 Mar 29 2016 safe.snap -> /opt/rtcds/userapps/release/als/h1/burtfiles/h1alsex_down.snap /opt/rtcds/lho/h1/target/h1iscex/h1iscexepics/burt ]$ ls -l safe.snap lrwxrwxrwx 1 controls controls 62 Mar 29 2016 safe.snap -> /opt/rtcds/userapps/release/isc/h1/burtfiles/h1iscex_down.snap where the "safe.snap" in the local "target" directories are what the front uses to restore its EPICs records (which is why we've intentionally commandeered the file with a soft link to a version controlled file in the userapps repo). We've since reverted the above offsets to their OBSERVE values, and I've accepted those OBSERVE values into the safe.snap / down.snap and committed the updated snap to the userapps repo. In the attached screenshots, the "EPICS VALUE" is the correct OBSERVE value, and the "SETPOINT" is the errant safe.snap. So, they show what I've accepted as the current correct value.
The fundamental problem here is our attempt to maintain 2 files with nearly duplicate information (safe and observe are mostly the same settings, realistically only one file is ever going to be well maintained).
I've added a test to DIAG_MAIN to check if the ITM camera references change. It's not a terribly clever test, because it just checks if the camera offset is within a small range around a hard coded value for pitch and yaw for each ITM. These values will need to be adjusted if the cameras are moved or if the reference spots are moved meaning there will be 3 places these values need to be updated (both OBSERVE and safe.snap files and, now, DIAG_MAIN) but hopefully this will help keep us from getting bitten by changed references again. The code is attached below.
@SYSDIAG.register_test
def ALS_CAM_CHECK():
"""Check that ALS CAM OFS references havent changed. Will need to be updated if cameras are moved
"""
nominal_dict = {
'X' : {'PIT':285.850, 'YAW':299.060, 'range':5},
'Y' : {'PIT':309.982, 'YAW':367.952, 'range':5},
}
for opt, vals in nominal_dict.iteritems():
for dof in ['PIT','YAW']:
cam = ezca['ALS-{}_CAM_ITM_{}_OFS'.format(opt,dof)]
if not (vals[dof] + vals['range']) > cam > (vals[dof] - vals['range']):
yield 'ALS {} CAM {} OFS changed from {}'.format(opt,dof,vals[dof])
Summary: no apparent change in the induced wavefront from the point absorber.
After a request from Sheila and Kiwamu, I checked the status of the ITMX point absorber with the HWS.
If I look at the wavefront approximately 13 minutes after lock-aquisition, I see the same magnitude of optical path distortion across the wavefront (approximately 60nm change over 20mm). This is the same scale of OPD that was seen around 17-March-2017.
Note that the whole pattern has shifted slighlty because of some on-table work in which a pick-off beam-splitter was placed in front of the HWS.

Thanks Aidan.
We were wondering about this because of the reappearance of the braod noise lump from 300-800 Hz in the last week, which is clearly visible on the summary pages. (links in this alog ) It we also now have broad coherence between DARM and IMC WFS B pit DC, which I do not think we have had today. We didn't see any obvious alignment shift that could have caused this. It also seems to be getting better or going away if you look at today's summary page.
Here is a bruco for the time when the jitter noise was high: https://ldas-jobs.ligo-wa.caltech.edu/~sheila.dwyer/bruco_April27/
TITLE: 04/28 Owl Shift: 07:00-15:00 UTC (00:00-08:00 PST), all times posted in UTC
STATE of H1: LOCKING by Jim, but still in CORRECTIVE MAINTENANCE (will write an FRS unless someone else beats me to it again!)
INCOMING OPERATOR: Jim
SHIFT SUMMARY:
Groundhog's Day shift, with H1 fine for the first 6hrs and then EX going down again (but this time SUS...see earlier alog). This time I kept away from even breathing on the TMSx dither scripts & simply restored ETMx & TMSx to their values before all of the front end hub bub of this morning. I was able to get ALSx aligned & this is where I'm handing off to Jim (he had to tweak on ALSy & see that he is already tweaking up a locked PRMI. Much better outlook than yesterday at this time for sure!
LOG:
FRS Assigned & CLOSED/RESOLVED for h1susex frontend crash:
https://services.ligo-la.caltech.edu/FRS/show_bug.cgi?id=7995
H1SUSEX front end computer and IO chassis were rebooted this morning to deal with the issue posted by Corey. Richard / Peter
Same failure mode on h1susex today as h1seiex had yesterday. Therefore we were not able to take h1susex out of the Dophin fabric, and so all dolphin connected models were glitched after the reboot of h1susex.
Richard power cycled h1susex and its IO Chassis. I killed all models on h1seiex and h1iscex, and then started all models on these computers. No IRIG-B timing excursions. Cleared IPC and CRC errors, Corey reset the SWWD. IFO recovery has started.
here are the front end computer uptimes (times since last reboot) ran at 10:02 this morning. The longest any machine has ran is 210 days since the site power outage 30 Sep 2016.
h1psl0 up 131 days, 18:38, 0 users, load average: 0.37, 0.13, 0.10
h1seih16 up 210 days, 3:00, 0 users, load average: 0.11, 0.14, 0.05
h1seih23 up 210 days, 3:00, 0 users, load average: 0.62, 1.59, 1.37
h1seih45 up 210 days, 3:00, 0 users, load average: 0.38, 1.31, 1.17
h1seib1 up 210 days, 3:00, 0 users, load average: 0.02, 0.04, 0.01
h1seib2 up 210 days, 3:00, 0 users, load average: 0.02, 0.08, 0.04
h1seib3 up 210 days, 3:00, 0 users, load average: 0.00, 0.05, 0.06
h1sush2a up 210 days, 3:00, 0 users, load average: 1.64, 0.59, 0.56
h1sush2b up 210 days, 3:00, 0 users, load average: 0.00, 0.00, 0.00
h1sush34 up 210 days, 3:00, 0 users, load average: 0.00, 0.03, 0.00
h1sush56 up 210 days, 3:00, 0 users, load average: 0.00, 0.00, 0.00
h1susb123 up 210 days, 3:00, 0 users, load average: 0.17, 1.07, 1.10
h1susauxh2 up 210 days, 3:00, 0 users, load average: 0.00, 0.00, 0.00
h1susauxh34 up 117 days, 17:07, 0 users, load average: 0.08, 0.02, 0.01
h1susauxh56 up 210 days, 3:00, 0 users, load average: 0.00, 0.00, 0.00
h1susauxb123 up 210 days, 2:07, 0 users, load average: 0.00, 0.00, 0.00
h1oaf0 up 164 days, 20:16, 0 users, load average: 0.10, 0.24, 0.23
h1lsc0 up 207 days, 41 min, 0 users, load average: 0.06, 0.57, 0.65
h1asc0 up 210 days, 3:00, 0 users, load average: 1.03, 1.82, 1.80
h1pemmx up 210 days, 3:53, 0 users, load average: 0.05, 0.02, 0.00
h1pemmy up 210 days, 3:53, 0 users, load average: 0.00, 0.00, 0.00
h1susauxey up 205 days, 23:41, 0 users, load average: 0.07, 0.02, 0.00
h1susey up 210 days, 3:06, 0 users, load average: 0.14, 0.04, 0.01
h1seiey up 206 days, 51 min, 0 users, load average: 0.00, 0.03, 0.00
h1iscey up 210 days, 3:07, 0 users, load average: 0.04, 0.21, 0.20
h1susauxex up 210 days, 3:16, 0 users, load average: 0.00, 0.00, 0.00
h1susex up 3:06, 0 users, load average: 0.00, 0.00, 0.00
h1seiex up 1 day, 2:57, 0 users, load average: 0.00, 0.00, 0.00
h1iscex up 177 days, 21:59, 0 users, load average: 0.08, 0.33, 0.24
Here is the list of free RAM on the front end computers in kB:
h1psl0 4130900
h1seih16 4404868
h1seih23 4023316
h1seih45 4024452
h1seib1 4754280
h1seib2 4763256
h1seib3 4753960
h1sush2a 4009216
h1sush2b 5160476
h1sush34 4389108
h1sush56 4400720
h1susb123 4013144
h1susauxh2 5338804
h1susauxh34 5351172
h1susauxh56 5350144
h1susauxb123 5339900
h1oaf0 9102096*
h1lsc0 4065228
h1asc0 3988536
h1pemmx 5358464
h1pemmy 5358352
h1susauxey 5352196
h1susey 64277012~
h1seiey 4758336
h1iscey 4117788
h1susauxex 5349644
h1susex 64301796~
h1seiex 4769840
h1iscex 4138204
* oaf has 12GB
~ end station sus have 66GB
At 12:44utc (5:44amPDT):
(attached is a screenshot of all the WHITE screens we have for EX.)
Tally of activities taken to recover
I saw a posting saying the bug was fixed in kernel 2.6.38