Reports until 15:29, Monday 01 May 2017
H1 CDS
david.barker@LIGO.ORG - posted 15:29, Monday 01 May 2017 - last comment - 20:44, Monday 01 May 2017(35947)
rebooted all gentoo computers with uptimes exceeding 208.5 days

Ed, Jeff K, Richard, Dave:

h1seib2 locked up at 13:05 PDT this afternoon after running for 215 days. Its console reported an uptime of 399,422 seconds, and the error was the same seen on the other lockup consoles (fixing recursive fault but reboot is needed!).

I rebooted all of the front end computers with run-times exceeding 208 days. The front ends which were not rebooted are (followed by their uptimes): h1oaf0 (168 days), h1psl0 (134 days), h1seiex (4 days), h1susex (3 days).

I also rebooted the DAQ computer h1nds0 which had been running 209 days. The following DAQ computers were not rebooted: h1dc0 (194 days), h1nds1 (91 days), h1tw1 (111 days), h1broadcast0 (57 days), h1build (102 days), h1boot (90 days).

Notes:

h1psl0 and h1oaf0 could be scheduled for reboots during the May vent event.

h1dc0 is scheduled to be rebooted during 5/2 maintenance, to clear this befor the May vent event.

Comments related to this report
david.barker@LIGO.ORG - 15:38, Monday 01 May 2017 (35948)

reboot details:

I was able to take h1seib2 out of the Dolphin fabric before it was reset. I found that I was able to manage the computer via the IPMI management port (so only the standard gigabit ethernet ports were disabled) and reset the computer using this method. The computer came back correctly.

For every dolphin'ed corner station the procedure was: kill all models, become root user, take local host out of dolphin fabric, issue 'reboot' command. This worked on every machine, and the h1psl models were never glitched.

For every non-dolphin'ed corner station the procedure was the same sans the dolphin removal.

At EX, h1iscex was the only dolphin'ed machine with needed a reboot. But the procedure failed, and the machine did not come back from soft reboot. Using IPMI to reset, all other EX models were dolphin glitched and needed code restarts (not computer reboots). The sus-aux rebooted with no issues.

At EY things were worse. Following the procedure none of the dolphined machines restarted correctly following soft reboot. IPMI resets got h1seiey and h1iscey going, but not h1susey (faster computer model). We ended up IPMI power-cycling h1susey to get the code to run. Last Friday morning this is how h1susex was recovered too. the sus-aux rebooted with no issues.

h1dc0 did not shutdown cleanly (I think this is a known daqd issue) and needed a front panel RESET button press. It was a slow restart because the OS had been running in excess of 214 days (like we didn't know this!). It kicked out an error in attempting to NFS mount h1tw0 (absent) and started running.

 

david.barker@LIGO.ORG - 15:43, Monday 01 May 2017 (35949)

after the reboots, here are the current uptimes for the front end computers

h1psl0 up 134 days
h1seih16 up 59 min
h1seih23 up 57 min
h1seih45 up 55 min
h1seib1 up 1:01
h1seib2 up 1:15
h1seib3 up 1:00
h1sush2a up 1:06
h1sush2b up 1:05
h1sush34 up 1:04
h1sush56 up 1:02
h1susb123 up 1:10
h1susauxh2 up 1 day
h1susauxh34 up 22 min
h1susauxh56 up 19 min
h1susauxb123 up 9:54
h1oaf0 up 168 days
h1lsc0 up 54 min
h1asc0 up 53 min
h1pemmx up 17 min
h1pemmy up 16 min
h1susauxey up 45 min
h1susey up 19 min
h1seiey up 35 min
h1iscey up 35 min
h1susauxex up 41 min
h1susex up 3 days
h1seiex up 4 days
h1iscex up 40 min

 

david.barker@LIGO.ORG - 20:44, Monday 01 May 2017 (35954)

here is a photo of h1seib2's console after it had locked up

Images attached to this comment