Displaying report 1-1 of 1.
Reports until 17:45, Tuesday 26 March 2013
patrick.thomas@LIGO.ORG - posted 17:45, Tuesday 26 March 2013 - last comment - 17:51, Thursday 11 April 2013(5893)
troubleshooting h1ecatc1
Chris, Daniel, Patrick

In short: Swapping the real time Ethernet card in the h1ecatc1 computer appears to have resolved a problem with it periodically crashing and restarting.

Last week and possibly the week before, the computer running the EtherCAT system at the corner station, h1ecatc1, had been spontaneously crashing and restarting. Each time this occurred there was a message upon login that Windows had recovered from an unexpected error.

My first inclination was that one of the installed programs, OpenOffice.org, which periodically asked for Java updates, was causing the problem. To this end I removed this application. I also removed some others that I can not recall at this time. However, the problem persisted.

Looking at the Windows minidump file, it appeared that the issue was with one of the TwinCAT processes. The system manager was also reporting CRC errors and lost frames. I attempted to narrow down the problem by systematically disconnecting EtherCAT chassis. I removed the fiber connected to the test setup of end X in the MSR. This fiber was broken on one end, and I destroyed it. I also disconnected the fiber to end Y. I tried various combinations of the remaining chassis in the H1 electronics room. At one point suspecting the ISC Common Chassis, I swapped it with one from H2. While it may have appeared at one point that things were running better with fewer chassis attached, the problem eventually returned.

I also tested the fiber connection from the MSR to the H1 electronics room. I removed the fiber connection from the ISC Common Chassis in the H1 electronics room and attached it to a communications tester that Daniel had made. With this setup, the h1ecatc1 computer was connected to the Corner MSR Chassis and this was connected by the existing fiber to the communications tester. This appeared to work. Nevertheless, the problem persisted after reconnecting to the chassis. However, it was realized that the orange multi mode fiber patches from the Beckhoff chassis in the MSR and H1 electronics room did not match the type of fiber running between these rooms. These patches along with the fiber in the chassis itself should be swapped for 50/125 (aqua). This has not been done yet.

Last Friday (March 23) I looked closer in the system manager at the diagnosis for just the Realtime Ethernet card. I had the impression that it may have been reporting errors itself, and decided to try and swap it. I replaced it from the slow computer in the rack in the computer user's room. It appeared to fix the CRC errors and I left it running over the weekend. It was still running on Monday morning and appears to have done the trick.

Caveat: My memory is somewhat fuzzy and the above steps may not have occurred in exactly the order described.
Comments related to this report
patrick.thomas@LIGO.ORG - 17:51, Thursday 11 April 2013 (6043)
My handwritten notes:

1:00 PM lost PSL environment channels, H1ECATC1 restart?
jucheck.exe, Oracle America, Inc.
"C: Program Files  Common Files  Java  Java Update  jucheck.exe"
 -auto -scheduled
H1:SUSH34 <- not sure what this means, if even related
restarted again trying to browse to EPICS target directory
C: Windows  Minidump  030513-30139-01.dmp
C: Users  controls  AppData  Local  Temp  WER-55941-0.sysdata.xml
 removing OpenOffice.org 3.3
 remove Java 7 Update 11
 remove Adobe Flash Player 11 ActiveX

burtrestored 2013/03/04/00:00
Displaying report 1-1 of 1.