aLIGO LHO Logbook

H1 CDS

david.barker@LIGO.ORG - posted 22:59, Tuesday 31 August 2021 (59797)

RCG4.2 Upgrade: summary of today's work

WP9898

Rolf, Jeff, Jenne, Jim, Fil, Daniel, Marc, Camilla, Keith, Jamie, Erik, Jonathan, Dave:

The automatic install of all the models was started before 8am. I modified
the h1iop[sus,sei]h7 models to use Dolphin IPC.

The auto install stopped at 70% complete when we discovered a $bdroot issue with h1psliss
and h1lsc models. I continued the install by hand while we investigated this issue.
Daniel had resolved the h1sqzwfs issue found yesterday, this model was installed.
Rolf and Jonathan resolved the $bdroot issue, it was preventing the post_build scan
for scripts, which are not present in these models.

At 09:11 we shutdown all models and powered down all the front end computers.
Fil, Daniel and Marc started the timing system upgrade: replacing the MSR master
chassis and reprogramming the timing fanouts in the CER and end stations.

Erik installed IX Dolphin cards in h1[sus,sei]h7 and connected them to the switches.

h1isiham[2-6] were reinstalled to correctly run their post-build scripts (creation of payload medm files).

We powered up h1[sus,sei]h7 first as these get their timing from the MSR master which was the
first timing system back online. At this point we discovered a Dolphin node number issue
resulting from adding two more nodes to the fabric. Erik had used the two next numbers in
the sequence: 108 and 112. Dmesg on h1sush7 gave the error: Illegal NodeId 108.

He next tried two unused lower ids: 4 and 92. We found that 92 works on either machine,
but 4 does not. At this point we discovered that IX only supports 8 to 104 in increments of 4
nodes (25 nodes) and we have reached this limit. For now we have taken h1seih7 back
out of Dolphin to keep within this limit.

Now that we had some front ends running the DAQ started, but it needed a clean restart
to sync the channel lists.

h1cdsrfm was powered up, it started running automatically.

The end station SUS, SEI and ISC front ends were powered up via IPMI.

We powered up h1seih16 and h1seih23, but they could not see their IO Chassis. Fil
verified that all systems were operational in the CER, and we remembered a similar
problem during the last upgrade whereby V1 front end machines needed a reboot following
a power up to see the chassis. We then completed the full power up of the front
end machines, including h1build and h1ecatmon0.

Jim reported various binary IO issues with various SEI models.

The h1iopseih[16,23] models were found to have too many DIO parts which were removed.

h1psliss model failed to start. It reported problems with its safe.snap file, which was
found to be a static file and not a symbolic link to the userapps. Erik investigated
and found that even a safe.snap with just the BURT_RESTORE channel caused problems and
it was best to delete the file to get the model to start.

After only a few minutes of running h1seib3 stopped with and ADC timing error. This
happened a second time, but on a third try a different timing error (DAC fifo) stopped
the IOP. Fil went to the IO Chassis, verified its timing and power cycled the chassis.
The timing error has not reappeared.

The SEI binary issue was tracked down to user models which use the old all-in-one
DIO parts, with one part per 6464Contec, all 64 bits, both input and output. Models
using the newer cdsCDI64/cdsCDO64 did not have this problem. Jeff and Jim reworked
h1isi[bs, itmx, itmy, etmx, etmy, ham6]. Jeff reworked h1sus[im, mc1, mc2] to
replace the DIO parts. h1susmc1 compile failed with the error that the max number
of parts is 8 (the model has 10 parts). Rolf, Jonathan and Erik made code changes
to fix this limit, all models on h1sush2a were built against this new code.

To get the PEM mid station IOP models to run we added the lhomid=1 parameter to the CDS
block.

The EDC was generating a lot of DAQ CRC errors. Jonathan rebuilt this to get its timing
from the IOP model which fixed the issue.
Also the EDC did not have a complete CA_ADDR_LIST which meant it could only connect to
IOCs on its local H1-FE LAN. Jonathan installed a temporary environment so all of its
channels could be connected.

At 18:26 we had recovered all the front ends. Jeff, Jim, Jenne and Camilla had brought
all of the IFO systems back into operation.

Summary of outstanding issues:
1. h1seib3 could fail again, the first ADC is suspect having failed its autocal
2. h1psliss is not reading its safe.snap, will investigate tomorrow
3. edc CA_ADDR_LIST is temporary, needs to be resolved in puppet
4. extending BIO max-parts limit needs to be rolled out in new release
5. $bdroot issue when installing needs to be rolled out in rew release
6. we have apparently run out of Dolphin node Ids, we need 2 more to complete A+