Reports until 09:15, Saturday 07 January 2017
H1 CDS (DAQ)
david.barker@LIGO.ORG - posted 09:15, Saturday 07 January 2017 - last comment - 09:41, Saturday 07 January 2017(33066)
h1tw1 minute trend writer no longer NFS exporting its file system to NDS machines

at 17:22 Friday 6 Jan PST h1tw1 stopped exporting its minute trend files via NFS to the NDS machines (syslog report shown below). 

h1tw1's daqd continues to run and minute trends are still being written to its SSD-RAID, the data just cannot be served by the NDS machines. I'll work with the control room on scheduling restarts to fix this, in the mean-time as TJ said only second trends will be available. 

Jan  6 17:22:54 h1tw1 kernel: [497056.519668] divide error: 0000 [#1] SMP 

Jan  6 17:22:54 h1tw1 kernel: [497056.519879] last sysfs file: /sys/devices/system/cpu/cpu11/cache/index2/shared_cpu_map

Jan  6 17:22:54 h1tw1 kernel: [497056.520287] CPU 10 

Jan  6 17:22:54 h1tw1 kernel: [497056.520293] Modules linked in: ext4 jbd2 crc16 arcmsr myri10ge

Jan  6 17:22:54 h1tw1 kernel: [497056.520710] 

Jan  6 17:22:54 h1tw1 kernel: [497056.520913] Pid: 4886, comm: nfsd Not tainted 2.6.35.3 #5 X8DTU/X8DTU

Jan  6 17:22:54 h1tw1 kernel: [497056.521122] RIP: 0010:[<ffffffff810311e1>]  [<ffffffff810311e1>] find_busiest_group+0x3c9/0x759

Jan  6 17:22:54 h1tw1 kernel: [497056.521537] RSP: 0018:ffff88122929f1a0  EFLAGS: 00010046

Jan  6 17:22:54 h1tw1 kernel: [497056.521744] RAX: 0000000000000000 RBX: ffff88000234e460 RCX: 0000000000000001

Jan  6 17:22:54 h1tw1 kernel: [497056.522150] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880002340000

Jan  6 17:22:54 h1tw1 kernel: [497056.522555] RBP: ffff88122929f310 R08: 0000000000000000 R09: 0000000000000000

Jan  6 17:22:54 h1tw1 kernel: [497056.522961] R10: 00000000ffffffff R11: 000000000000000a R12: ffff88000234e570

Jan  6 17:22:54 h1tw1 kernel: [497056.523366] R13: 0000000000012540 R14: 0000000000000000 R15: ffffffffffffffff

Jan  6 17:22:54 h1tw1 kernel: [497056.523773] FS:  0000000000000000(0000) GS:ffff880002340000(0000) knlGS:0000000000000000

Jan  6 17:22:54 h1tw1 kernel: [497056.524181] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

Jan  6 17:22:54 h1tw1 kernel: [497056.524388] CR2: 00007f9bcd423800 CR3: 0000000001a09000 CR4: 00000000000006e0

Jan  6 17:22:54 h1tw1 kernel: [497056.524795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

Jan  6 17:22:54 h1tw1 kernel: [497056.525200] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

Jan  6 17:22:54 h1tw1 kernel: [497056.525605] Process nfsd (pid: 4886, threadinfo ffff88122929e000, task ffff881232183090)

Jan  6 17:22:54 h1tw1 kernel: [497056.526017] Stack:

Jan  6 17:22:54 h1tw1 kernel: [497056.526223]  ffff88000234e570 0000000000000000 0000000000012501 ffff88000234e570

Jan  6 17:22:54 h1tw1 kernel: [497056.526438] <0> ffff88122929f3e0 0000000000012548 0000000000012540 0000000000012540

Jan  6 17:22:54 h1tw1 kernel: [497056.526852] <0> ffff88000234e450 0000000000000008 0000000000012540 ffff88122929f3ec

Jan  6 17:22:54 h1tw1 kernel: [497056.527463] Call Trace:

Jan  6 17:22:54 h1tw1 kernel: [497056.527668]  [<ffffffff8103536c>] load_balance+0xcf/0x62d

Jan  6 17:22:54 h1tw1 kernel: [497056.527878]  [<ffffffff8102d8f7>] ? update_curr+0xf2/0xfb

Jan  6 17:22:54 h1tw1 kernel: [497056.528089]  [<ffffffff8102e5bd>] ? dequeue_entity+0x1b/0x1b6

Jan  6 17:22:54 h1tw1 kernel: [497056.528298]  [<ffffffff8102f8a6>] ? dequeue_task_fair+0x69/0x72

Jan  6 17:22:54 h1tw1 kernel: [497056.528509]  [<ffffffff8150bcce>] schedule+0x1fb/0x52f

Jan  6 17:22:54 h1tw1 kernel: [497056.528722]  [<ffffffffa0055518>] ? brelse+0xe/0x10 [ext4]

Jan  6 17:22:54 h1tw1 kernel: [497056.528931]  [<ffffffff8150c03a>] io_schedule+0x38/0x4d

Jan  6 17:22:54 h1tw1 kernel: [497056.529140]  [<ffffffff8119ec5f>] get_request_wait+0xac/0x13a

Jan  6 17:22:54 h1tw1 kernel: [497056.529350]  [<ffffffff8104c157>] ? autoremove_wake_function+0x0/0x34

Jan  6 17:22:54 h1tw1 kernel: [497056.529560]  [<ffffffff8119bdd4>] ? elv_merge+0x166/0x19e

Jan  6 17:22:54 h1tw1 kernel: [497056.529768]  [<ffffffff8119ef80>] __make_request+0x293/0x3b5

Jan  6 17:22:54 h1tw1 kernel: [497056.529981]  [<ffffffffa0043de6>] ? check_block_validity+0x30/0x63 [ext4]

Jan  6 17:22:54 h1tw1 kernel: [497056.530192]  [<ffffffff8119d81b>] generic_make_request+0x174/0x1d7

Jan  6 17:22:54 h1tw1 kernel: [497056.530402]  [<ffffffff8119d92d>] submit_bio+0xaf/0xb8

Jan  6 17:22:54 h1tw1 kernel: [497056.530611]  [<ffffffff810d7217>] mpage_bio_submit+0x22/0x26

Jan  6 17:22:54 h1tw1 kernel: [497056.530820]  [<ffffffff810d7719>] do_mpage_readpage+0x355/0x47e

Jan  6 17:22:54 h1tw1 kernel: [497056.531031]  [<ffffffff8108ea7e>] ? __inc_zone_page_state+0x29/0x2b

Jan  6 17:22:54 h1tw1 kernel: [497056.536111]  [<ffffffff8107effd>] ? add_to_page_cache_locked+0x75/0xb6

Jan  6 17:22:54 h1tw1 kernel: [497056.536321]  [<ffffffff810d7976>] mpage_readpages+0xd7/0x11b

Jan  6 17:22:54 h1tw1 kernel: [497056.536532]  [<ffffffffa0046a03>] ? ext4_get_block+0x0/0x13 [ext4]

Jan  6 17:22:54 h1tw1 kernel: [497056.536744]  [<ffffffffa0046a03>] ? ext4_get_block+0x0/0x13 [ext4]

Jan  6 17:22:54 h1tw1 kernel: [497056.536955]  [<ffffffff810a62a9>] ? alloc_pages_current+0xa2/0xc5

Jan  6 17:22:54 h1tw1 kernel: [497056.537166]  [<ffffffffa00451ec>] ext4_readpages+0x18/0x1a [ext4]

Jan  6 17:22:54 h1tw1 kernel: [497056.537377]  [<ffffffff81085f35>] __do_page_cache_readahead+0x10e/0x1a4

Jan  6 17:22:54 h1tw1 kernel: [497056.537588]  [<ffffffff811af81a>] ? radix_tree_gang_lookup_slot+0x69/0x8c

Jan  6 17:22:54 h1tw1 kernel: [497056.537799]  [<ffffffff81085fe7>] ra_submit+0x1c/0x20

Jan  6 17:22:54 h1tw1 kernel: [497056.538008]  [<ffffffff81086244>] ondemand_readahead+0x189/0x19c

Jan  6 17:22:54 h1tw1 kernel: [497056.538219]  [<ffffffff8108632b>] page_cache_sync_readahead+0x38/0x3a

Jan  6 17:22:54 h1tw1 kernel: [497056.538430]  [<ffffffff810ce30d>] __generic_file_splice_read+0x119/0x44c

Jan  6 17:22:54 h1tw1 kernel: [497056.538643]  [<ffffffff8103d884>] ? local_bh_enable_ip+0x9/0xb

Jan  6 17:22:54 h1tw1 kernel: [497056.538856]  [<ffffffff810c332e>] ? wait_on_inode+0x22/0x27

Jan  6 17:22:54 h1tw1 kernel: [497056.539066]  [<ffffffff810c3555>] ? ifind_fast+0x4e/0x60

Jan  6 17:22:54 h1tw1 kernel: [497056.539277]  [<ffffffff810c36cb>] ? iget_locked+0x39/0x131

Jan  6 17:22:54 h1tw1 kernel: [497056.539491]  [<ffffffffa0043e40>] ? ext4_iget+0x27/0x6df [ext4]

Jan  6 17:22:54 h1tw1 kernel: [497056.539702]  [<ffffffff814b07ea>] ? cache_get+0x15/0x1c

Jan  6 17:22:54 h1tw1 kernel: [497056.539915]  [<ffffffff810c04ee>] ? __d_find_alias+0x54/0x80

Jan  6 17:22:54 h1tw1 kernel: [497056.540126]  [<ffffffff810c24ce>] ? iput+0x2f/0x65

Jan  6 17:22:54 h1tw1 kernel: [497056.540336]  [<ffffffff81156a06>] ? nfsd_acceptable+0x0/0xd7

Jan  6 17:22:54 h1tw1 kernel: [497056.540547]  [<ffffffff8115388f>] ? find_acceptable_alias+0x23/0xd5

Jan  6 17:22:54 h1tw1 kernel: [497056.540761]  [<ffffffff81153a02>] ? exportfs_decode_fh+0xc1/0x20f

Jan  6 17:22:54 h1tw1 kernel: [497056.540975]  [<ffffffff810cd277>] ? spd_release_page+0x0/0x14

Jan  6 17:22:54 h1tw1 kernel: [497056.541190]  [<ffffffff810ce684>] generic_file_splice_read+0x44/0x70

Jan  6 17:22:54 h1tw1 kernel: [497056.541405]  [<ffffffff810ccc22>] do_splice_to+0x6f/0x7c

Jan  6 17:22:54 h1tw1 kernel: [497056.541615]  [<ffffffff810cd353>] splice_direct_to_actor+0xc8/0x193

Jan  6 17:22:54 h1tw1 kernel: [497056.541828]  [<ffffffff811588de>] ? nfsd_direct_splice_actor+0x0/0x12

Jan  6 17:22:54 h1tw1 kernel: [497056.542040]  [<ffffffff811587db>] nfsd_vfs_read+0x256/0x359

Jan  6 17:22:54 h1tw1 kernel: [497056.542249]  [<ffffffff81158ee2>] nfsd_read+0xa1/0xbf

Jan  6 17:22:54 h1tw1 kernel: [497056.542459]  [<ffffffff814b0b1f>] ? cache_revisit_request+0x47/0xf7

Jan  6 17:22:54 h1tw1 kernel: [497056.542672]  [<ffffffff8115ee30>] nfsd3_proc_read+0xe2/0x121

Jan  6 17:22:54 h1tw1 kernel: [497056.542883]  [<ffffffff8103d884>] ? local_bh_enable_ip+0x9/0xb

Jan  6 17:22:54 h1tw1 kernel: [497056.543093]  [<ffffffff81153eb6>] nfsd_dispatch+0xec/0x1c7

Jan  6 17:22:54 h1tw1 kernel: [497056.543304]  [<ffffffff814a997d>] svc_process+0x436/0x637

Jan  6 17:22:54 h1tw1 kernel: [497056.543513]  [<ffffffff811543ee>] nfsd+0xf1/0x135

Jan  6 17:22:54 h1tw1 kernel: [497056.543720]  [<ffffffff811542fd>] ? nfsd+0x0/0x135

Jan  6 17:22:54 h1tw1 kernel: [497056.543929]  [<ffffffff8104bd53>] kthread+0x7a/0x82

Jan  6 17:22:54 h1tw1 kernel: [497056.544138]  [<ffffffff81003654>] kernel_thread_helper+0x4/0x10

Jan  6 17:22:54 h1tw1 kernel: [497056.544347]  [<ffffffff8104bcd9>] ? kthread+0x0/0x82

Jan  6 17:22:54 h1tw1 kernel: [497056.544556]  [<ffffffff81003650>] ? kernel_thread_helper+0x0/0x10

Jan  6 17:22:54 h1tw1 kernel: [497056.544767] Code: 1b 48 8b 78 10 31 d2 48 89 f8 44 8b 40 08 48 8b 00 4c 01 c2 48 39 f8 75 f1 89 56 08 41 8b 74 24 08 48 8b 45 a8 31 d2 48 c1 e0 0a <48> f7 f6 48 8b 75 b0 48 89 45 a0 31 c0 48 85 f6 74 09 48 8b 45 

Jan  6 17:22:54 h1tw1 kernel: [497056.545513] RIP  [<ffffffff810311e1>] find_busiest_group+0x3c9/0x759

Jan  6 17:22:54 h1tw1 kernel: [497056.545724]  RSP <ffff88122929f1a0>

Jan  6 17:22:54 h1tw1 kernel: [497056.546243] ---[ end trace e8c9db712a6847e8 ]---

Jan  6 17:22:54 h1tw1 kernel: [497056.546491] nfsd used greatest stack depth: 2120 bytes left

Comments related to this report
david.barker@LIGO.ORG - 09:41, Saturday 07 January 2017 (33067)

I just spoke with Cheryl, we decided that unless anyone desperately needs minute trends this weekend, we'll hold off on the reboots until Monday.