Displaying reports 50341-50360 of 85207.Go to page Start 2514 2515 2516 2517 2518 2519 2520 2521 2522 End
Reports until 15:33, Friday 28 April 2017
H1 CDS
david.barker@LIGO.ORG - posted 15:33, Friday 28 April 2017 - last comment - 15:41, Friday 28 April 2017(35880)
front end computer crashes may be related to sched_clock overflow after 208.5 days

Ryan has  tracked down the core problem. With older kernels, the sched clock overflows in about 208.5 days. My previous alogs showed that the computers have been running since the last power outage (30 Sep 2016 06:45), and the error messages on h1susex and h1seiex suggested a clock zeroing late Wednesday night (26 Apr 2017 22:00). The time difference between these two times is 208.6 days.

We are investigating why only two computers have crashed, why both at EX, why both around 6am and why one day apart.

If you google 'sched clock overflows in 208 days' you will see many articles on this. One article references kernel 2.6.32, we are running 2.6.34.

Comments related to this report
david.barker@LIGO.ORG - 15:41, Friday 28 April 2017 (35882)

I saw a posting saying the bug was fixed in kernel 2.6.38

H1 SEI
jim.warner@LIGO.ORG - posted 15:22, Friday 28 April 2017 (35879)
Earthquake report

Was it reported by Terramon, USGS, SEISMON? Yes, Yes, Yes

Magnitude (according to Terramon, USGS, SEISMON): 6.8 or 7.2, depending on when you looked

Location: 30km SW of Burias Philippines

Starting time of event (ie. when BLRMS started to increase on DMT on the wall): ~20:45 UTC

Lock status? L1 & H1 lost lock at nearly the same time, LLO only a few minutes after us.

EQ reported by Terramon BEFORE it actually arrived? Yes
 

seismon reported this ~10 before it arrived and we got the Verbal notification,  we lost lock probably on the P wave arrival. After lockloss and .03-.1 hz blrms went over 1 micron, I switched the seismic configuration to the large eq configuration.

Images attached to this report
LHO FMCS
bubba.gateley@LIGO.ORG - posted 15:18, Friday 28 April 2017 - last comment - 15:38, Friday 28 April 2017(35878)
Clean room is in the beer garden
The clean room that will be used for the up coming vent has been relocated to the beer garden. A little fine tuning will be required on 5/8.
The crane is still hooked up to the clean room and the remote control is in my office.
Comments related to this report
chandra.romel@LIGO.ORG - 15:38, Friday 28 April 2017 (35881)

We soft closed GV 5,7 for this exercise. GVs are open again.

H1 CDS
david.barker@LIGO.ORG - posted 14:16, Friday 28 April 2017 - last comment - 14:23, Friday 28 April 2017(35875)
h1seiex and h1susex console error message timestamps look fishy

At the time the h1seiex and h1susex computers locked up, the last errors to be printed to the console were photographed and attached to the alog. The timestamps (running seconds since last reboot) shown are not consistent with time since last boot.

Looking at the boot logs on h1boot, here are the times of the last two boots of these computers (including this week's boots, all times are local)

computer boot 1 boot 2
h1seiex 30 Sep 2016 06:45 27 Apr 2017 07:05
h1susex 30 Sep 2016 06:45 28 Apr 2017 06:56

A computer running since 9/30/2016 should have a running clock of about 18 million seconds. The times shown in the photographs are small. Since we know at which time the computers froze, we can extrapolate back to the apparent zero time (all times local)

computer system time on console [crash time] t=0 datetime
h1seiex 30096 seconds (08 hrs 21 mins) [06:15 Thu 4/27] 21:53 Wed 4/26
h1susex 108493 seconds (01 day, 06 hrs, 08 min) [05:43 Fri 2/28] 23:34 Wed 4/26

the "pseudo" boot time when the clock was zeroed are within two hours of each other Wednesday night.

Comments related to this report
david.barker@LIGO.ORG - 14:23, Friday 28 April 2017 (35876)

As a sanity check, I have ran dmesg on h1iscex to show the logging of this morning's model restarts. This computer was last rebooted 01 Nov 2016 12:03 PST. Here is the dmesg output:

[15352542.295438] h1alsex: Synched 302425

Doing the math of how many seconds since 28 Apr 2017 07:16 PDT and 01 Nov 2016 12:03 PST gives 15,361,981 which is close to dmesg.

H1 ISC
jenne.driggers@LIGO.ORG - posted 14:01, Friday 28 April 2017 - last comment - 17:32, Friday 28 April 2017(35874)
SDF made green for Down right after lockloss from NomLowNoise

[JeffK, JimW, Jenne]

In hopes of making some debugging a bit easier, we have updated the safe.snap files in SDF just after a lockloss from NomLowNoise. 

We knew that an earthquake was incoming (yay seismon!), so as soon as the IFO broke lock, we requested Down so that it wouldn't advance any farther.  Then, we accepted most of the differences so that everything but ISIBS, CS ECAT PLC2, EX ECAT PLC2 and EY ECAT PLC2 (which don't switch between Observe.snap and safe.snap) were green. 

Jeff is looking at making it so that the ECAT models switch between safe and observe snap files like many of the other models, so that ISIBS will be the only model that has diffs (21 of them). 

Note that if the IFO loses lock from any state other than NLN, we shouldn't expect SDF to all be green.  But, since this is the state of things when we lose lock from NLN, it should be safe to revert to these values, in hopes of helping to debug.

 

Comments related to this report
thomas.shaffer@LIGO.ORG - 17:32, Friday 28 April 2017 (35888)

After talking with Jeff and Sheila, I have made a few of the OBSERVE.snap files in the target directory a link to the OBSERVE.snap in userapps.

This list includes:

  • h1omcpi
  • h1calex
  • h1susauxb123
  • h1susauxex
  • h1susauxey
  • h1susauxh2
  • h1susauxh34
  • h1susauxh56

I have also updated the switch_SDF_source_files.py script that is called by ISC_LOCK on DOWN and on NOMINAL_LOW_NOISE. I changed the exclude list to only exclude the h1sysecatplc[1or3] "front ends". The sei models will stay in OBSERVE always just as before. This was tested in DOWN and in NLN, and has been loaded into the Guardian.

LHO VE
logbook/robot/script0.cds.ligo-wa.caltech.edu@LIGO.ORG - posted 12:10, Friday 28 April 2017 - last comment - 12:23, Friday 28 April 2017(35872)
CP3, CP4 Autofill 2017_04_28
Starting CP3 fill. LLCV enabled. LLCV set to manual control. LLCV set to 50% open. Fill completed in 1596 seconds. LLCV set back to 17.0% open.
Starting CP4 fill. LLCV enabled. LLCV set to manual control. LLCV set to 70% open. Fill completed in 812 seconds. TC A did not register fill. LLCV set back to 41.0% open.
Images attached to this report
Comments related to this report
chandra.romel@LIGO.ORG - 12:23, Friday 28 April 2017 (35873)

Raised both by 1%:

CP3 now 18% open

CP4 now 42% open

H1 CDS (CAL)
david.barker@LIGO.ORG - posted 11:21, Friday 28 April 2017 (35871)
First minute of observing mode without CW hardware injections

H1 went into observation mode at 16:58:18 UTC, the startup of the CW calibration on h1calex completed at 16:59:48 UTC

H1 CDS (DAQ)
david.barker@LIGO.ORG - posted 11:14, Friday 28 April 2017 (35869)
CDS O2 restart report, Monday 24th - Thursday 27th April 2017

model restarts logged for Thu 27/Apr/2017
2017_04_27 07:07 h1hpietmx
2017_04_27 07:07 h1iopseiex
2017_04_27 07:07 h1isietmx
2017_04_27 07:21 h1iopsusex
2017_04_27 07:21 h1susetmx
2017_04_27 07:23 h1hpietmx
2017_04_27 07:23 h1iopseiex
2017_04_27 07:23 h1isietmx
2017_04_27 07:23 h1susetmxpi
2017_04_27 07:23 h1sustmsx
2017_04_27 07:25 h1alsex
2017_04_27 07:25 h1calex
2017_04_27 07:25 h1iopiscex
2017_04_27 07:25 h1iscex
2017_04_27 07:25 h1pemex

Unexpected crash of h1seiex, also restarts of models on h1susex and h1iscex.

model restarts logged for Mon 24/Apr/2017 - Wed 26/Apr/2017 No restarts reported

H1 OpsInfo (CDS, GRD, ISC)
jeffrey.kissel@LIGO.ORG - posted 10:49, Friday 28 April 2017 - last comment - 14:55, Friday 28 April 2017(35864)
ITM Camera and 1064 TR QPD Offsets Errantly Changed Again -- H1ALSEX & H1ISCEX Safe.snaps Updated
J. Kissel, K. Izumi, J. Warner, S. Dwyer

After another ETMX front-end failure this morning (see LHO aLOG 35861, 35857 etc.), the recovery of the IFO was much easier, because of yesterday morning's lessons learned about not running initial alignment scripts that suffer from bit rot (see LHO aLOG 35839). 

However, after completing recovery, the SDF system's OBSERVE.snap let us know that some of the same critical initial alignment references were changed at 14:17 UTC, namely 
- the green ITM camera reference points:
     H1:ALS-X_CAM_ITM_PIT_OFS
     H1:ALS-X_CAM_ITM_YAW_OFS
and 
- the transmission monitors red QPDs:
     H1:LSC-X_TR_A_LF_OFFSET

After discussing with Jim, he'd heard the Corey (a little surprisingly) didn't have too much trouble with turning on the green ASC system, which, if these ITM camera offsets are large, then that means the error signals are large, and we'd have the same trouble closing them as yesterday.

We traced down the change to when Dave had to the reboot of h1alsex & h1iscex this morning at around 14:15 UTC -- see LHO aLOG 35862 -- and those two models out of date safe.snap files restored. Recall that the safe.snaps for these computers are soft linked to the down.snaps in the user apps repo:
    /opt/rtcds/lho/h1/target/h1alsex/h1alsexepics/burt ]$ ls -l safe.snap 
    lrwxrwxrwx 1 controls controls 62 Mar 29  2016 safe.snap -> /opt/rtcds/userapps/release/als/h1/burtfiles/h1alsex_down.snap

    /opt/rtcds/lho/h1/target/h1iscex/h1iscexepics/burt ]$ ls -l safe.snap
    lrwxrwxrwx 1 controls controls 62 Mar 29  2016 safe.snap -> /opt/rtcds/userapps/release/isc/h1/burtfiles/h1iscex_down.snap
where the "safe.snap" in the local "target" directories are what the front uses to restore its EPICs records (which is why we've intentionally commandeered the file with a soft link to a version controlled file in the userapps repo).

We've since reverted the above offsets to their OBSERVE values, and I've accepted those OBSERVE values into the safe.snap / down.snap and committed the updated snap to the userapps repo. In the attached screenshots, the "EPICS VALUE" is the correct OBSERVE value, and the "SETPOINT" is the errant safe.snap. So, they show what I've accepted as the current correct value.
Images attached to this report
Comments related to this report
sheila.dwyer@LIGO.ORG - 11:17, Friday 28 April 2017 (35870)

The fundamental problem here is our attempt to maintain 2 files with nearly duplicate information (safe and observe are mostly the same settings, realistically only one file is ever going to be well maintained).

jim.warner@LIGO.ORG - 14:55, Friday 28 April 2017 (35877)

I've added a test to DIAG_MAIN to check if the ITM camera references change. It's not a terribly clever test, because it just checks if the camera offset is within a small range around a hard coded value for pitch and yaw for each ITM. These values will need to be adjusted if the cameras are moved or if the reference spots are moved meaning there will be 3 places these values need to be updated (both OBSERVE and safe.snap files and, now, DIAG_MAIN) but hopefully this will help keep us from getting bitten by changed references again. The code is attached below.

 

@SYSDIAG.register_test
def ALS_CAM_CHECK():
    """Check that ALS CAM OFS references havent changed. Will need to be updated if cameras are moved

    """
    nominal_dict = {
        'X' : {'PIT':285.850, 'YAW':299.060, 'range':5},
        'Y' : {'PIT':309.982, 'YAW':367.952, 'range':5},
        }

    for opt, vals in nominal_dict.iteritems():
        for dof in ['PIT','YAW']:
            cam = ezca['ALS-{}_CAM_ITM_{}_OFS'.format(opt,dof)]
            if not (vals[dof] + vals['range']) > cam > (vals[dof] - vals['range']):
                yield 'ALS {} CAM {} OFS changed from {}'.format(opt,dof,vals[dof])
 

H1 TCS (TCS)
aidan.brooks@LIGO.ORG - posted 09:57, Friday 28 April 2017 - last comment - 16:18, Friday 28 April 2017(35863)
Status of point absorber on ITMX

Summary: no apparent change in the induced wavefront from the point absorber.

After a request from Sheila and Kiwamu, I checked the status of the ITMX point absorber with the HWS.

If I look at the wavefront approximately 13 minutes after lock-aquisition, I see the same magnitude of optical path distortion across the wavefront (approximately 60nm change over 20mm). This is the same scale of OPD that was seen around 17-March-2017.

Note that the whole pattern has shifted slighlty because of some on-table work in which a pick-off beam-splitter was placed in front of the HWS.

Images attached to this report
Comments related to this report
sheila.dwyer@LIGO.ORG - 16:18, Friday 28 April 2017 (35884)

Thanks Aidan.  

We were wondering about this because of the reappearance of the braod noise lump from 300-800 Hz in the last week, which is clearly visible on the summary pages. (links in this alog )  It we also now have broad coherence between DARM and IMC WFS B pit DC, which I do not think we have had today.  We didn't see any obvious alignment shift that could have caused this.  It also seems to be getting better or going away if you look at today's summary page

Here is a bruco for the time when the jitter noise was high:  https://ldas-jobs.ligo-wa.caltech.edu/~sheila.dwyer/bruco_April27/

LHO General
corey.gray@LIGO.ORG - posted 08:10, Friday 28 April 2017 - last comment - 03:59, Saturday 29 April 2017(35855)
OWL Operator Summary

TITLE: 04/28 Owl Shift: 07:00-15:00 UTC (00:00-08:00 PST), all times posted in UTC
STATE of H1: LOCKING by Jim, but still in CORRECTIVE MAINTENANCE  (will write an FRS unless someone else beats me to it again!)
INCOMING OPERATOR: Jim
SHIFT SUMMARY:

Groundhog's Day shift, with H1 fine for the first 6hrs and then EX going down again (but this time SUS...see earlier alog).  This time I kept away from even breathing on the TMSx dither scripts & simply restored ETMx & TMSx to their values before all of the front end hub bub of this morning.  I was able to get ALSx aligned & this is where I'm handing off to Jim (he had to tweak on ALSy & see that he is already tweaking up a locked PRMI.  Much better outlook than yesterday at this time for sure!
LOG:

Comments related to this report
corey.gray@LIGO.ORG - 03:59, Saturday 29 April 2017 (35894)CDS

FRS Assigned & CLOSED/RESOLVED for h1susex frontend crash:

https://services.ligo-la.caltech.edu/FRS/show_bug.cgi?id=7995

H1 SUS (SUS)
peter.king@LIGO.ORG - posted 07:12, Friday 28 April 2017 - last comment - 11:10, Friday 28 April 2017(35861)
H1SUSEX reboot
H1SUSEX front end computer and IO chassis were rebooted this morning to deal with the issue posted by Corey.



Richard / Peter
Images attached to this report
Comments related to this report
david.barker@LIGO.ORG - 07:24, Friday 28 April 2017 (35862)

Same failure mode on h1susex today as h1seiex had yesterday. Therefore we were not able to take h1susex out of the Dophin fabric, and so all dolphin connected models were glitched after the reboot of h1susex.

Richard power cycled h1susex and its IO Chassis. I killed all models on h1seiex and h1iscex, and then started all models on these computers. No IRIG-B timing excursions. Cleared IPC and CRC errors, Corey reset the SWWD. IFO recovery has started.

david.barker@LIGO.ORG - 10:32, Friday 28 April 2017 (35866)

here are the front end computer uptimes (times since last reboot) ran at 10:02 this morning. The longest any machine has ran is 210 days since the site power outage 30 Sep 2016.

h1psl0          up 131 days, 18:38,  0 users,  load average: 0.37, 0.13, 0.10
h1seih16        up 210 days,  3:00,  0 users,  load average: 0.11, 0.14, 0.05
h1seih23        up 210 days,  3:00,  0 users,  load average: 0.62, 1.59, 1.37
h1seih45        up 210 days,  3:00,  0 users,  load average: 0.38, 1.31, 1.17
h1seib1         up 210 days,  3:00,  0 users,  load average: 0.02, 0.04, 0.01
h1seib2         up 210 days,  3:00,  0 users,  load average: 0.02, 0.08, 0.04
h1seib3         up 210 days,  3:00,  0 users,  load average: 0.00, 0.05, 0.06
h1sush2a        up 210 days,  3:00,  0 users,  load average: 1.64, 0.59, 0.56
h1sush2b        up 210 days,  3:00,  0 users,  load average: 0.00, 0.00, 0.00
h1sush34        up 210 days,  3:00,  0 users,  load average: 0.00, 0.03, 0.00
h1sush56        up 210 days,  3:00,  0 users,  load average: 0.00, 0.00, 0.00
h1susb123       up 210 days,  3:00,  0 users,  load average: 0.17, 1.07, 1.10
h1susauxh2      up 210 days,  3:00,  0 users,  load average: 0.00, 0.00, 0.00
h1susauxh34     up 117 days, 17:07,  0 users,  load average: 0.08, 0.02, 0.01
h1susauxh56     up 210 days,  3:00,  0 users,  load average: 0.00, 0.00, 0.00
h1susauxb123    up 210 days,  2:07,  0 users,  load average: 0.00, 0.00, 0.00
h1oaf0          up 164 days, 20:16,  0 users,  load average: 0.10, 0.24, 0.23
h1lsc0          up 207 days, 41 min,  0 users,  load average: 0.06, 0.57, 0.65
h1asc0          up 210 days,  3:00,  0 users,  load average: 1.03, 1.82, 1.80
h1pemmx         up 210 days,  3:53,  0 users,  load average: 0.05, 0.02, 0.00
h1pemmy         up 210 days,  3:53,  0 users,  load average: 0.00, 0.00, 0.00
h1susauxey      up 205 days, 23:41,  0 users,  load average: 0.07, 0.02, 0.00
h1susey         up 210 days,  3:06,  0 users,  load average: 0.14, 0.04, 0.01
h1seiey         up 206 days, 51 min,  0 users,  load average: 0.00, 0.03, 0.00
h1iscey         up 210 days,  3:07,  0 users,  load average: 0.04, 0.21, 0.20
h1susauxex      up 210 days,  3:16,  0 users,  load average: 0.00, 0.00, 0.00
h1susex         up  3:06,  0 users,  load average: 0.00, 0.00, 0.00
h1seiex         up 1 day,  2:57,  0 users,  load average: 0.00, 0.00, 0.00
h1iscex         up 177 days, 21:59,  0 users,  load average: 0.08, 0.33, 0.24

 

david.barker@LIGO.ORG - 11:10, Friday 28 April 2017 (35868)

Here is the list of free RAM on the front end computers in kB:

h1psl0        4130900
h1seih16    4404868
h1seih23    4023316
h1seih45    4024452
h1seib1        4754280
h1seib2        4763256
h1seib3        4753960
h1sush2a    4009216
h1sush2b    5160476
h1sush34    4389108
h1sush56    4400720
h1susb123    4013144
h1susauxh2    5338804
h1susauxh34    5351172
h1susauxh56    5350144
h1susauxb123    5339900
h1oaf0        9102096*
h1lsc0        4065228
h1asc0        3988536
h1pemmx        5358464
h1pemmy        5358352
h1susauxey    5352196
h1susey        64277012~
h1seiey        4758336
h1iscey        4117788
h1susauxex    5349644
h1susex        64301796~
h1seiex        4769840
h1iscex        4138204

* oaf has 12GB
~ end station sus have 66GB

 

H1 CDS (CDS, SUS)
corey.gray@LIGO.ORG - posted 06:01, Friday 28 April 2017 - last comment - 07:46, Friday 28 April 2017(35857)
12:44 Out Of Observing: EX SUS FrontEnds DOWN!!

(I'm almost cutting & pasting from what happened almost 24hrs ago!  Yesterday it was EX SEI, Today it is EX SUS!)

At 12:44utc (5:44amPDT):

(attached is a screenshot of all the WHITE screens we have for EX.)

Images attached to this report
Comments related to this report
corey.gray@LIGO.ORG - 07:46, Friday 28 April 2017 (35860)

Tally of activities taken to recover 

  • Richard (& I) had trouble logging into EX Frontends (and any other ones); this was due to the controls password being changed.
  • Dave called in.  
  • 13:42 Richard heading out to EX
  • 13:42 Since we'll probably be rebooting the EX World, taking any Guardian Nodes I can access to a SAFE state (i.e. SEI, HEPI, etc.)
    • Since I did not see an obvious "SAFE" state on the SEI/HEPI/ISI Guardian Nodes (or anything in the SEI wiki), I called Hugh and he suggested I take the EX SEI Guardian Node to OFFLINE.
  • 1351 h1susex booted by Richard 
  • 14:05 Richard back to corner station (Dave must still be working on getting us back & GREEN at EX).
  • 14:11 Dave restarting models
  • When we lost EX SUS, the SUS & TMS bias sliders shifted.  Used Time Machine to find out values from ~5:05amPDT, but also trending Oplev signals (for etmx) & M1_DAMP pit & yaw signals (for tmsx).  The sliders shifted during the reboot this morning (and yesterday morning).  
    • 14:35  Starting with no light on the ALSx camera---tweaking sliders to restore signals above has brought back a spot on the ALSx camera!
    • (NO DITHER scripts used this morning!)
    • OK, working on getting stuff to a decent state for Jim, so that hopefully all he'll have to do is an Initial Alignment.

 

Displaying reports 50341-50360 of 85207.Go to page Start 2514 2515 2516 2517 2518 2519 2520 2521 2522 End