Reports until 16:45, Tuesday 15 July 2014
H1 CDS
cyrus.reed@LIGO.ORG - posted 16:45, Tuesday 15 July 2014 (12768)
AutoBurt Changes
I finally got tired of watching the hourly autoburts spam everything in the network with broadcasts as it's done since, well, pretty much forever.  So I stood up a small virtual machine to handle this task (aptly named 'autoburt'), the main new difference being a local EPICS gateway process to act as a proxy for any CA requests.  The gateway code caches the channel name -> IOC mapping (for a configurable interval, default 7200 seconds (or if you prefer, '2 hours')).  This means that if we route all the CA requests through this local gateway (listening on the loopback address, 127.0.0.1), it should only have to broadcast for channel names once, as the burt snapshots are taken every hour (thus refreshing the cache).  After the initial broadcast, the only things that should trigger a CA search request are: if a channel moves to a different IOC, if a channel is in a request file but does not exist (such channels are marked 'dead' after 2 minutes in the gateway and unlearned/discarded), or new channels are added.  Since the local gateway process is close to the source of the packets, it can learn the status and location of every requested channel (assuming it's valid), so it does not need to re-broadcast for channels it would never find (as is the case with a central gateway for an individual subnet).  To this end, attached are plots (in bits and number of packets) of the reduction in traffic after making the aforementioned change.  These plots are from the interface for Vlan20 on the core switch, which all the broadcast traffic would traverse.  The initial 'fuzzyness' is the overlap between the new machine taking over and the old snapshots finishing up.  In addition there were some other model changes that made channels unavailable for a bit.  The remaining peaks in traffic are from invalid channels in the autoburt snapshots, which to address I will have to track down the responsible parties and 'educate' them.

One additional change I made is that the 'H1' autoburt now runs at 10 min after the hour.  This give the 'H0' autoburt time to run and finish, and should help sort out if one or the other is having issues.

Yes, it's perhaps quibbling to worry about 500k of traffic (a whole 0.05% of 1Gbps!), but the previous behavior was just dumb and I wanted to look at something prettier.
Images attached to this report