Displaying report 1-1 of 1.
Reports until 17:30, Wednesday 17 November 2021
LHO General (OpsInfo)
thomas.shaffer@LIGO.ORG - posted 17:30, Wednesday 17 November 2021 - last comment - 11:25, Thursday 18 November 2021(60683)
Power outage log

At 1:27 PDT we experienced a power outage that affected the corner station, bringing down everything that was not connected to backup power. The fire alarm went off briefly after the power went out and we had to send in a party to get the people in the LVEA.  Once the fire alarm had stopped and we confirmed that the alarm was due to the power outage, we went back into the control room to recover. The first major obstacle was finding a power outage recovery document that we knew we have made before.

At this point the facilities team and vacuum team have started their own, separate recovery efforts. (M080334 - extended recovery procedure might be their document to follow, but is out of date)

In the control room we are mainly following T1500545 document, as well as a cds wiki, and the LLO power outage documents:

The following notes follow the T1500545 doc as we did them mostly in sequence. Some tasks were done in parallel or were circled back to. Starting at 6.2 in T1500545:

Not included in the T1500545 doc, but addressed:

 

To be done:

 

Summary - The corner station recovered from the power outage in roughly 2.5 hours. It was a fantastic effort by many people in the lab, I really appreciate how fast everyone jumped on the systems they knew and kept the operator (me) informed the entire time. While this may not have been a full power outage recovery that would normally include relocking the IFO, this came at a great (read surprising) time to test ourselves and our documentation for this event. The lessons learned and areas to improve will be addressed and we will be better off next time around.

Comments related to this report
filiberto.clara@LIGO.ORG - 10:51, Thursday 18 November 2021 (60696)

Timing error was cleared by power cycing the Timing Comparator Chassis.

david.barker@LIGO.ORG - 11:25, Thursday 18 November 2021 (60697)

Details on CDS recovery:

Fil, Jonathan, Erik, Dave

All the corner station front end machines had power cycled with the exception of h1psl0 [this is a surprise, somehow this is on UPS power]. The end station computers had not power cycled, but those on the Dolphin fabric had stopped running their models due to a large time glitch [with the exception of the SEI systems at both ends, we don't know why this is]. The list of systems we did not restart/reboot is smaller than those we did, it is:

h1pemmx, h1pemmy, h1susauxex, h1susauxey, h1seiex, h1seiey.

After Fil verified that all the IO Chassis in the CER were operational, we rebooted h1psl0 as a test [we were suspicious of this system].

We started powering up the corner station front ends, SUS first. At that point we remembered that we should have disabled all the Dolphin IX switch ports in the MSR, which we did before going on with the machine recovery.

As each of these computers came back online, their DAQ status was 0xbad. Both DAQ DCs were not discovering these computers, but had discovered h1psl0 after its reboot (and the DAQ status for the End and Mid machines continued to be good). General rule is that any computer which was powered down in the outage was not being discovered by the DAQ. Jonathan restarted DC1, which fixed its FE discovery of the systems we had already brought back and those we were bring back. Later he restarted DC0 to restore all the FEs into the DAQ.

Minor issue, the V1 SEI front ends started with a negative IRIG-B which delayed their DAQ recovery  by about 10 minutes. Alll the upgraded systems with the new timing card did not see any delay.

At the endstations the models were restarted on both SUS and ISC systems. We debated whether to power cycle these, but decided against it since nothing had been powered down.

The three non-FEC diskless systems (h1cdsrfm, h1build, h1ecatmon0) are UPS'ed and did not power cycle.

 

 

Displaying report 1-1 of 1.