Reports until 13:17, Thursday 29 October 2020
H1 CDS
david.barker@LIGO.ORG - posted 13:17, Thursday 29 October 2020 - last comment - 16:59, Thursday 29 October 2020(57152)
h1sush2b accidentally crashed while h1sush2a BIO investigation ongoing

I'm holding its recovery off as it appears likely we will be power cycling h1sush2a for BIO card replacement.

Comments related to this report
david.barker@LIGO.ORG - 16:44, Thursday 29 October 2020 (57158)

Keita, Fil, Richard, Patrick, Jonathan, Erik, Dave:

During the investigation into a possible bad binary-input channel, Fil first replaced the binary in chassis for BIO-0. We then powered down h1sush2a and Fil replaced the first Contec 6464 BIO card in the IO Chassis. There was no change.

We found evidence that this is not a recent problem and may have been present for quite some time.

On power up of h1sush2a, as expected the DAQ restarted itself. This time h1daqdc0 was running Jonathan's new code which reports the dcu-ids of the models which send bad timing information on startup. We confirmed Rolf's suspicion that all five models on h1oaf1 were responsible for the DAQ restart and not the five models on h1sush2a.

I then restarted the models on h1sush2b and handed everything over to Patrick to untrip the SUS and SEI watchdogs and restore the suspensions.

jonathan.hanks@LIGO.ORG - 16:59, Thursday 29 October 2020 (57160)
Here are some logs from h1daqdc0 which show that oaf1 sent data with a wildly bad time which caused the daq restart.  Oaf1 takes its timing from dolphin.

2020-10-29T14:32:32-07:00 h1daqdc0.cds.ligo-wa.caltech.edu daqd[7947]: Dropped data from shmem or received 0 dcus; gps now = 1288042371, 0; was = 1288042369, 15; dcu count = 5
2020-10-29T14:32:32-07:00 h1daqdc0.cds.ligo-wa.caltech.edu daqd[7947]: #011expected gps = 1288042370
2020-10-29T14:32:32-07:00 h1daqdc0.cds.ligo-wa.caltech.edu daqd[7947]: #011expected cycle = 0
2020-10-29T14:32:32-07:00 h1daqdc0.cds.ligo-wa.caltech.edu daqd[7947]: #011expected nano = 0
2020-10-29T14:32:32-07:00 h1daqdc0.cds.ligo-wa.caltech.edu daqd[7947]: first 5 dcuids seen
2020-10-29T14:32:32-07:00 h1daqdc0.cds.ligo-wa.caltech.edu daqd[7947]: saw dcu 72
2020-10-29T14:32:32-07:00 h1daqdc0.cds.ligo-wa.caltech.edu daqd[7947]: saw dcu 25
2020-10-29T14:32:32-07:00 h1daqdc0.cds.ligo-wa.caltech.edu daqd[7947]: saw dcu 117
2020-10-29T14:32:32-07:00 h1daqdc0.cds.ligo-wa.caltech.edu daqd[7947]: saw dcu 119
2020-10-29T14:32:32-07:00 h1daqdc0.cds.ligo-wa.caltech.edu daqd[7947]: saw dcu 42

After this restart fw1 restarted its daqd about 10minutes later.

2020-10-29T14:40:37-07:00 h1daqfw1.cds.ligo-wa.caltech.edu daqd[33175]: Dropped data from shmem or received 0 dcus; gps now = 1288042855, 9; was = 1288042855, 7; dcu count = 109
2020-10-29T14:40:37-07:00 h1daqfw1.cds.ligo-wa.caltech.edu daqd[33175]: #011expected gps = 1288042855
2020-10-29T14:40:37-07:00 h1daqfw1.cds.ligo-wa.caltech.edu daqd[33175]: #011expected cycle = 8
2020-10-29T14:40:37-07:00 h1daqfw1.cds.ligo-wa.caltech.edu daqd[33175]: #011expected nano = 8

Here it missed a cycle, through running long or some other not known reason.  There is logic in the daqd to ride through this if the data is in the mbuf for the skipped cycle to just read that and start catching up.  However we found recently that there was a bug in the logic, which has been fixed.  I have a work permit in to install an updated daqd on the fw systems to deal with this.