At 1:27 PDT we experienced a power outage that affected the corner station, bringing down everything that was not connected to backup power. The fire alarm went off briefly after the power went out and we had to send in a party to get the people in the LVEA. Once the fire alarm had stopped and we confirmed that the alarm was due to the power outage, we went back into the control room to recover. The first major obstacle was finding a power outage recovery document that we knew we have made before.
- 2127 Power outage
- ~2135 Everyone out of LVEA and PSL enclosure
- Clean rooms reported running and purge air back on
- End stations unaffected
- Chiller yard quickly recovered
- See Betsy's alog for initial actions: alog60679
At this point the facilities team and vacuum team have started their own, separate recovery efforts. (M080334 - extended recovery procedure might be their document to follow, but is out of date)
In the control room we are mainly following T1500545 document, as well as a cds wiki, and the LLO power outage documents:
The following notes follow the T1500545 doc as we did them mostly in sequence. Some tasks were done in parallel or were circled back to. Starting at 6.2 in T1500545:
- 6.2 Informed necessary personnel
- 6.3 Network switches - Confirmed already up
- 6.4 Work stations - Confirmed back up. Operator station remained functional during outage
- 6.5 Wiki, alog, CDS Overview - All confirmed functional
- 7.1 Fil went to the CER and confirmed all all LV DC power supplies are ON. All HV supplies were off and we will keep them this way until we fully recover. One caveat is the SQZ TTFSS HV was on.
- 7.2 Timing master - confirmed okay
- U38 Port 41 had a yellow light
- 7.3 IRIG-B fanout - confirmed okay
- 7.4 RFM out of date, RFM now wrapped up into Dolphin system. This was confirmed as okay as it could be without FEs up.
- 7.6 Jim started HEPI controllers, but will not ramp up until we have front ends
- 7.5 Dave, Erik, Jonathan, Patrick brought back FEs and Beckhoff (see their alog when written)
- Even though the end station FEs were not affected by the power outage, the SUS and ISC end station models needed to be restarted to catch them up in the Dolphin network
- HWS computers on, but teh HWS table is still not hooked up so we will leave it here
- Once the SUS and SEI FEs were back up the software watchdogs were recovered by the whole control room and Dave
- All SUS and SEI recovered (2307 UTC, 1507 PDT)
- 7.6 Jim spun up the HEPI pump stations. Found level trip. He adjusted the level ~1/4", he says this is due to the large changes in temperature lately.
- 7.7 Work stations - all okay
- 7.8 Guardian - All nodes running okay
- 7.9 Control room wall monitors - recovered as best we could
- 8 & 9
- EY lasers were already down, so these will need to be recovered later.
- EX lasers will need to be checked as well
- End station SUS and SEI recovered after model restarts
Not included in the T1500545 doc, but addressed:
- End stations SUS and SEI was found tripped due to dolphin network down.
- Interlock system was also tripped. This was brought back by Patrick and others
- TCS chillers brought back by Camilla
- Dust monitor vacuum pump was found loud and with ejected carbon filter. Gerardo screwed filter back in place and all was okay
- PCAL not mentioned in doc, will need expert to check on
- Optical levers were recovered
- PSL was placed into Science mode
- High voltage for HAM7 was turned on, but others will not be turned back on until we are back under vacuum
- Cluster was not impacted by the outage
- Timing LSC comparator failed (port 13 at fanout A), this will be replaced later
To be done:
- Recover lasers at EY, check on lasers at EX
- Replace LSC timing comparator in Fanout A port 13
- The vacuum team plans to make or update a document describing their recovery process
- The Facilities team plans to also update their doc
- Rahul has a list of items to add to the T1500545 doc, and we will update it collectively
- Add more breadcrumbs leading to the recovery document. Jeff added links in the wikis, but some of the aforementioned DCC docs also need references to each other
Summary - The corner station recovered from the power outage in roughly 2.5 hours. It was a fantastic effort by many people in the lab, I really appreciate how fast everyone jumped on the systems they knew and kept the operator (me) informed the entire time. While this may not have been a full power outage recovery that would normally include relocking the IFO, this came at a great (read surprising) time to test ourselves and our documentation for this event. The lessons learned and areas to improve will be addressed and we will be better off next time around.
Timing error was cleared by power cycing the Timing Comparator Chassis.
Details on CDS recovery:
Fil, Jonathan, Erik, Dave
All the corner station front end machines had power cycled with the exception of h1psl0 [this is a surprise, somehow this is on UPS power]. The end station computers had not power cycled, but those on the Dolphin fabric had stopped running their models due to a large time glitch [with the exception of the SEI systems at both ends, we don't know why this is]. The list of systems we did not restart/reboot is smaller than those we did, it is:
h1pemmx, h1pemmy, h1susauxex, h1susauxey, h1seiex, h1seiey.
After Fil verified that all the IO Chassis in the CER were operational, we rebooted h1psl0 as a test [we were suspicious of this system].
We started powering up the corner station front ends, SUS first. At that point we remembered that we should have disabled all the Dolphin IX switch ports in the MSR, which we did before going on with the machine recovery.
As each of these computers came back online, their DAQ status was 0xbad. Both DAQ DCs were not discovering these computers, but had discovered h1psl0 after its reboot (and the DAQ status for the End and Mid machines continued to be good). General rule is that any computer which was powered down in the outage was not being discovered by the DAQ. Jonathan restarted DC1, which fixed its FE discovery of the systems we had already brought back and those we were bring back. Later he restarted DC0 to restore all the FEs into the DAQ.
Minor issue, the V1 SEI front ends started with a negative IRIG-B which delayed their DAQ recovery by about 10 minutes. Alll the upgraded systems with the new timing card did not see any delay.
At the endstations the models were restarted on both SUS and ISC systems. We debated whether to power cycle these, but decided against it since nothing had been powered down.
The three non-FEC diskless systems (h1cdsrfm, h1build, h1ecatmon0) are UPS'ed and did not power cycle.