First a bit of history: h1boot-backup is an identical machine to h1boot and can be used to replace h1boot if it were to fail. The /opt filesystem on h1boot used to be rsynced every hour to keep h1boot-backup current. Over the course of summer 2015 the rsync took longer, and became linked to the EPICS freeze ups of the front end computers. We suspect the main reason is that the NDS jobs directory became filled with many small files. Over Christmas I was doing select rsyncs manually, skipping these huge directories of NDS logs.
Over the last week I have reduced the number of files in these directories to a manageable number. This took some time because when a directory has 6.5 million files in it standard commands like find
or ls
wont work. I used customized C code to scan the files.
We now have NDS logs which look back over the past 7 days. The full rsync now only takes 15 minutes (it was taking either 10 hours or would not complete).
Because there is a non zero chance these fast rsyncs could freeze EPICS, I have scheduled the backups 3 times daily at 8am, noon and 6pm. If anyone sees epics freeze up at these times this would be the reason.