Reports until 09:15, Sunday 03 September 2017
H1 CDS
sheila.dwyer@LIGO.ORG - posted 09:15, Sunday 03 September 2017 - last comment - 14:02, Sunday 03 September 2017(38498)
epics not updating

Epics variables that originate from the RCG don't seem to be updating this morning, although ones that originate in Beckhoff are updating.  I don't see anything wrong on the CDS overview except the timing system errors which have been there since Tuesday.  

On the GDS screens, the GPS times are all stuck at slightly different times around Sep 03 2017 11:42:43 UTC (so far I have seen times within about 10 seconds of each other with all models on the same IOP stopped at the same time.)

We have had what looks like many nearby EQs over the last 16 hours. 

Comments related to this report
david.barker@LIGO.ORG - 09:34, Sunday 03 September 2017 (38499)

h1boot locked up around 04:40 PDT. Sheila is rebooting it.

david.barker@LIGO.ORG - 09:41, Sunday 03 September 2017 (38500)

h1boot is back, front ends look good. Sheila will try some testpoints and excitations.

david.barker@LIGO.ORG - 09:45, Sunday 03 September 2017 (38502)

here are h1boot's system messages for early this morning, last message before freeze up was an ntpd status change at approximately the time of the freeze. The next message is the reboot at 09:39:08

Sep  3 01:19:48 h1boot -- MARK --

Sep  3 01:39:48 h1boot -- MARK --

Sep  3 01:59:48 h1boot -- MARK --

Sep  3 02:19:48 h1boot -- MARK --

Sep  3 02:39:48 h1boot -- MARK --

Sep  3 02:59:48 h1boot -- MARK --

Sep  3 03:19:48 h1boot -- MARK --

Sep  3 03:39:48 h1boot -- MARK --

Sep  3 03:59:49 h1boot -- MARK --

Sep  3 04:19:49 h1boot -- MARK --

Sep  3 04:39:49 h1boot -- MARK --

Sep  3 04:41:40 h1boot ntpd[4865]: kernel time sync status change 6001

Sep  3 09:39:08 h1boot syslog-ng[4227]: Syslog connection established; fd='7', server='AF_INET(10.99.0.99:514)', local='AF_INET(0.0.0.0:0)'

 

david.barker@LIGO.ORG - 10:01, Sunday 03 September 2017 (38503)

Impact of h1boot freeze up:

The front end real-time processes were not affected by the freeze, neither was their data transfer to the DAQ. All EPICS IOCs on the front ends froze up, which mainly impacted the Guardian nodes which received stuck data. MEDMs were also frozen at their 04:41 PDT values, and conlog also did not receive any updates. I suspect testpoint and excitation operations would have been unavailable during the freeze.

keith.thorne@LIGO.ORG - 14:02, Sunday 03 September 2017 (38505)CDS
Given that both sites had the EtherCat front-end node lockup in the last week, and now H1 boot server, we likely are in the known bug in kernel 2.6.34 where it will fail after > 200 days (certainly the case for l1ecatc1).   We had restarted everything to be OK until end of O2, but it is now after that point.

Really looking forward to getting the OK to install updated OS on IFO front-ends.