aLIGO LHO Logbook

H1 DAQ

jonathan.hanks@LIGO.ORG - posted 10:58, Friday 19 May 2017 (36287)

Investigation of the H1NDS1 crash of 15 May 2017 (WP 6638)

H1NDS1 locked up on Monday at around 10:51am localtime. Here what happened to the best of my understanding.

Background

TJ was testing his operator alog summary generation script with a pre-release version of the nds2-client at my request. The build of the nds2-client was updated to fix some serious performance regressions on the streaming data interface. When TJ ran his script it ran for a while, then timed out. At this point H1NDS1 was unresponsive and required a restart of the the nds/daqd services.

Analysis

The logs showed that the same querie was being repeated over and over again. Followed eventually by a large number of socket closed events.

Reviewing the code it ran into an endless loop. If the queiry failed it would retry and retry and retry ... This is what locked up the nds1 server. The reason that it worked on an older nds2-client (0.12.2) and failed on the newer client, is there was a gap in the data, that the 0.12.2 client just skipped over. Where the 0.14.x client raised an error saying there was a gap. This error triggered a retry, which raised and error, which triggered a retry, which ...

The gap was not 'missing' data but was temporarily not-accessible via H1NDS1. For background there are two sources for the nds1 server to pull from, its in memory buffer (which is fed by a continuous stream of data from the data concentrator) and from the files on disk. It takes about 70-120s (depending on where the data falls in the frame, io load on the disk system, ...) from the time that the data leaves the data concentrator before it is written to disk. Each raw frame is 64s long. It takes roughly 40s to write the frame, so you need about 100s from the time the frame writer gets the first seconds worth of data for a frame until the frame can be read from disk. The nds1 servers keep a buffer in memory of live data, and read from it where they can. H1NDS1 is configured to keep 50s of data in memory, which leaves you with a gap of about 55s when you transition from data on disk to data in the in memory buffer.

As a side note, h1nds0 (which guardian uses) has a 100s buffer and does not see a gap unless there is a glitch in the writting the frame. I beleive that this is how the nds1 servers at LLO are configured.

Testing this Morning

I ran some tests on the production system this morning from roughly 7:00-7:30am localtime. I was able to reproduce the endless error loop. I did not see H1NDS1 lock up, which was probably related to seeing connections close right after the request for data (which is what I would expect). So I am not sure what kept the connections open until the termination of TJs script on Monday.

Recommendations

I recommend the following

H1NDS1 be configured with a larger buffer, removing the gap. In this case it was shown to be an issue.
TJ will fix the error handler to retry a limited number of times.
The nds2-client will be set to skip gaps (the old behavior at least with nds1) unless you ask for it to be patched in.
- The behavior in the code he was testing was to abort on gaps. We are trying to get a consistent behavior of the nds2 client code between nds1 and nds2 servers, and this is one of the areas the servers just behave differently.