Displaying report 1-1 of 1.
Reports until 11:09, Friday 26 January 2024
H1 CDS
david.barker@LIGO.ORG - posted 11:09, Friday 26 January 2024 - last comment - 11:42, Friday 26 January 2024(75572)
FMCS IOC stable after power cycle and code upgrade

FRS30184

Patrick, Jonathan, Erik, Dave:

A summary and follow up on the recent FMCS EPICS IOC freeze ups.

Timeline (all times local)

Sat 13 Jan 15:57 FMCS IOC flatlining started. After restart IOC would run anywhere from 1 to 14 hours before flatlining again

Tue 16 Jan 23:12 FMCS IOC running under systemd control, auto-restart code running which restarts IOC if flatlined for 10 mins

Wed 17 Jan 10:48 fmcs-epics-cds machine power cycled

Thu 18 Jan 11:25 After power cycle, IOC ran 25 hours with no flatline, previous longest run 14 hours [Power cycle of computer fixed it]

Thu 18 Jan 11:25 Patrick installed new version of the FMCS IOC code. This added new diagnostic EPICS channels.

Thu 18 Jan 14:58 systemd control of FMCS IOC under puppet configuration management

Tue 23 Jan 10:33 DAQ+EDC restarted to trend new FMCS diagnostic channels

Summary:

After running for many years error free, and 90 days after the last computer reboot, the FMCS IOC code became unstable. At random times it would stop updating its EPICS values, flatlining them at their last value.

The code at this point was started manually and ran in a screen environment.

To facilitate the auto restart of the code while the problem was being investigated, the IOC code was moved to a procServ environment and put under systemd control. A script running on cdsmanager monitored the FMCS channel H0:FMC-EX_CY_H2O_SUP_DEGF every minute. If its value did not change for 10 minutes, the systemd fmcs_ioc.service was restarted on fmcs-epics-cds.

A soft reboot of fmcs-epics-cds did not fix the problem. We then tried a hard power down, wait 30 seconds, then power back on. This appeared to fix the immediate problem.

We decided to upgrade the software to see if this would prevent future occurances, in 90+ days time.

At time of writing, +8 days after upgrade, there have been zero flatline instances.

Comments related to this report
david.barker@LIGO.ORG - 11:42, Friday 26 January 2024 (75576)

One caveat to the new FMCS code is that we have not directly verified it records when the fire pumps run. However the new code's diagnostics channels do permit a verification that the fire pump bacnet device it functioning correctly.

The fire pump status is a binary device. The new code does not allow binary records to directly read device data, so Patrick created intermediate analog input records to read the bacnet devices. These records are defined in the EPICS db file fmcs_bacnet_bi_to_ai.db

For the firepumps, the AI record device names are:

    field(INP, "@bacnet12075 3 5 85")
    field(INP, "@bacnet12075 3 6 85")
 

In backnet-speak, the device string is @bacnet<dev_id> <data_type> <dev_chan_num> <chan_type>

In this case fire_pump_1 reads channel 5, fire_pump_2 reads channel 6. Device 12075 only reads the two fire pump operational status.

The new code provides diagnostic channels for bacnet devices, in this case the three channels:

H0:FMC-BACNET_12075_TX

H0:FMC-BACNET_12075_RX

H0:FMC-BACNET_12075_ER

Trending these channels since they were added to the DAQ Tue morning shows: zero errors, TX and RX numbers increasing linearly in step, they are almost but not quite identical to each other.

Displaying report 1-1 of 1.