kernel-packages team mailing list archive

Thread
Date
[Bug 1568729] Re: divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Dongwon Cho <eastbest1@xxxxxxxxx>
Date: Fri, 03 Jun 2016 04:03:15 -0000
Reply-to: Bug 1568729 <1568729@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
dmesg -T
[Fri Jun  3 01:07:11 2016] divide error: 0000 [#1] SMP 
[Fri Jun  3 01:07:11 2016] Modules linked in: iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 8021q garp mrp binfmt_misc veth vhost_net vhost macvtap macvlan ebtable_filter ebtables ip6table_filter ip6_tables openvswitch nf_defrag_ipv6 nf_conntrack xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables x_tables bonding zfs(PO) zunicode(PO) zcommon(PO) znvpair(PO) spl(O) zavl(PO) ipmi_ssif ipmi_devintf dcdbas intel_rapl x86_pkg_temp_thermal coretemp sb_edac edac_core mei_me mei shpchp ipmi_si ipmi_msghandler 8250_fintek lpc_ich mac_hid acpi_power_meter kvm_intel kvm irqbypass ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
[Fri Jun  3 01:07:11 2016]  raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul igb aesni_intel aes_x86_64 lrw dca gf128mul glue_helper ptp ahci ablk_helper pps_core mxm_wmi cryptd libahci megaraid_sas i2c_algo_bit fjes wmi
[Fri Jun  3 01:07:11 2016] CPU: 4 PID: 4915 Comm: vhost-4913 Tainted: P           O    4.4.0-22-generic #40-Ubuntu
[Fri Jun  3 01:07:11 2016] Hardware name: Dell Inc. PowerEdge R730xd/0H21J3, BIOS 2.0.2 03/15/2016
[Fri Jun  3 01:07:11 2016] task: ffff8807e6a80000 ti: ffff880046a3c000 task.ti: ffff880046a3c000
[Fri Jun  3 01:07:11 2016] RIP: 0010:[<ffffffff810b593d>]  [<ffffffff810b593d>] task_numa_find_cpu+0x2cd/0x710
[Fri Jun  3 01:07:11 2016] RSP: 0018:ffff880046a3f7d8  EFLAGS: 00010257
[Fri Jun  3 01:07:11 2016] RAX: 0000000000000000 RBX: ffff880046a3f878 RCX: 0000000000000001
[Fri Jun  3 01:07:11 2016] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880dc42efa00
[Fri Jun  3 01:07:12 2016] RBP: ffff880046a3f840 R08: 0000000000000001 R09: 0000000000aaaaaa
[Fri Jun  3 01:07:12 2016] R10: 0000000000000335 R11: 0000000000000000 R12: ffff880e2fbeee00
[Fri Jun  3 01:07:12 2016] R13: 0000000000000001 R14: ffff880dc42efa00 R15: 0000000000000335
[Fri Jun  3 01:07:12 2016] FS:  0000000000000000(0000) GS:ffff88085e680000(0000) knlGS:0000000000000000
[Fri Jun  3 01:07:12 2016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Jun  3 01:07:12 2016] CR2: 00007f442814d400 CR3: 0000000e6d094000 CR4: 00000000001426e0
[Fri Jun  3 01:07:12 2016] Stack:
[Fri Jun  3 01:07:12 2016]  00000000d08a1709 ffff8807e6a806ac ffff8807e6a80000 0000000000000001
[Fri Jun  3 01:07:12 2016]  0000000000000335 000000000000030d 0000000000016d00 0000000000000001
[Fri Jun  3 01:07:12 2016]  ffff8807e6a80000 ffff880046a3f878 000000000000006d 0000000000000059
[Fri Jun  3 01:07:12 2016] Call Trace:
[Fri Jun  3 01:07:12 2016]  [<ffffffff810b61be>] task_numa_migrate+0x43e/0x9b0
[Fri Jun  3 01:07:12 2016]  [<ffffffff810b67a9>] numa_migrate_preferred+0x79/0x80
[Fri Jun  3 01:07:12 2016]  [<ffffffff810badc4>] task_numa_fault+0x7f4/0xd40
[Fri Jun  3 01:07:12 2016]  [<ffffffff810ba435>] ? should_numa_migrate_memory+0x55/0x130
[Fri Jun  3 01:07:13 2016]  [<ffffffff811bf860>] handle_mm_fault+0xbc0/0x1820
[Fri Jun  3 01:07:13 2016]  [<ffffffff8105a50e>] ? physflat_send_IPI_mask+0xe/0x10
[Fri Jun  3 01:07:13 2016]  [<ffffffff8106b537>] __do_page_fault+0x197/0x400
[Fri Jun  3 01:07:13 2016]  [<ffffffff8106b7c2>] do_page_fault+0x22/0x30
[Fri Jun  3 01:07:13 2016]  [<ffffffff81827478>] page_fault+0x28/0x30
[Fri Jun  3 01:07:13 2016]  [<ffffffff813f69b5>] ? copy_user_enhanced_fast_string+0x5/0x10
[Fri Jun  3 01:07:13 2016]  [<ffffffff813fc789>] ? copy_to_iter+0x79/0x260
[Fri Jun  3 01:07:13 2016]  [<ffffffff815eee49>] tun_do_read+0x1c9/0x3f0
[Fri Jun  3 01:07:13 2016]  [<ffffffff815ef103>] tun_recvmsg+0x93/0xb0
[Fri Jun  3 01:07:13 2016]  [<ffffffffc08c865d>] handle_rx+0x43d/0x7e0 [vhost_net]
[Fri Jun  3 01:07:13 2016]  [<ffffffffc08c8a15>] handle_rx_net+0x15/0x20 [vhost_net]
[Fri Jun  3 01:07:13 2016]  [<ffffffffc08ba723>] vhost_worker+0xf3/0x190 [vhost]
[Fri Jun  3 01:07:13 2016]  [<ffffffffc08ba630>] ? vhost_poll_wakeup+0x30/0x30 [vhost]
[Fri Jun  3 01:07:14 2016]  [<ffffffff810a0588>] kthread+0xd8/0xf0
[Fri Jun  3 01:07:14 2016]  [<ffffffff810a04b0>] ? kthread_create_on_node+0x1e0/0x1e0
[Fri Jun  3 01:07:14 2016]  [<ffffffff8182568f>] ret_from_fork+0x3f/0x70
[Fri Jun  3 01:07:14 2016]  [<ffffffff810a04b0>] ? kthread_create_on_node+0x1e0/0x1e0
[Fri Jun  3 01:07:14 2016] Code: d0 4c 89 f7 e8 95 c7 ff ff 49 8b 84 24 d8 01 00 00 49 8b 76 78 31 d2 49 0f af 86 b0 00 00 00 4c 8b 45 d0 48 8b 4d b0 48 83 c6 01 <48> f7 f6 4c 89 c6 48 89 da 48 8d 3c 01 48 29 c6 e8 de c5 ff ff 
[Fri Jun  3 01:07:14 2016] RIP  [<ffffffff810b593d>] task_numa_find_cpu+0x2cd/0x710
[Fri Jun  3 01:07:14 2016]  RSP <ffff880046a3f7d8>
[Fri Jun  3 01:07:14 2016] ---[ end trace c2e57ae327861148 ]---

After that, the KVM instances have been hanging and I cannot get the
result of some commands such 'w' and 'ps -ef' hanging as well.

More information for you.

cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04 LTS"

uname -a
Linux infra02 4.4.0-22-generic #40-Ubuntu SMP Thu May 12 22:03:46 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

dpkg -l | grep qemu
ii  ipxe-qemu                          1.0.0+git-20150424.a25a16d-1ubuntu1 all          PXE boot firmware - ROM images for qemu
ii  qemu-block-extra:amd64             1:2.5+dfsg-5ubuntu10.1              amd64        extra block backend modules for qemu-system and qemu-utils
ii  qemu-kvm                           1:2.5+dfsg-5ubuntu10.1              amd64        QEMU Full virtualization
ii  qemu-system-common                 1:2.5+dfsg-5ubuntu10.1              amd64        QEMU full system emulation binaries (common files)
ii  qemu-system-x86                    1:2.5+dfsg-5ubuntu10.1              amd64        QEMU full system emulation binaries (x86)
ii  qemu-utils                         1:2.5+dfsg-5ubuntu10.1              amd64        QEMU utilities

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1568729

Title:
  divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Xenial:
  In Progress

Bug description:
  While running qemu 2.5 on a trusty host running 4.4.0-15.31~14.04.1
  the host system has crashed (load > 200) 3 times in the last 3 days.

  Always with this stack trace:

  Apr  9 19:01:09 cnode9.0 kernel: [197071.195577] divide error: 0000 [#1] SMP 
  Apr  9 19:01:09 cnode9.0 kernel: [197071.195633] Modules linked in: vhost_net vhost macvtap macvlan arc4 md4 nls_utf8 ci
  fs nfnetlink_queue nfnetlink xt_CHECKSUM xt_nat iptable_nat nf_nat_ipv4 xt_NFQUEUE xt_CLASSIFY ip6table_mangle sch_sfq sch_htb veth dccp_diag
   dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag ebtable_filter ebtables nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_fil
  ter ip6_tables iptable_mangle xt_CT iptable_raw xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack iptable_filter ip_tables x_tables dum
  my bridge stp llc ipmi_ssif ipmi_devintf intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm dcdbas irqbypass crct10dif_p
  clmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev input_leds nf_nat_ftp sb_edac nf_conntrack_ftp e
  dac_core cdc_ether nf_nat_pptp usbnet nf_conntrack_pptp mii nf_nat_proto_gre lpc_ich nf_nat_sip ioatdma nf_nat nf_conntrack_sip nfsd ipmi_si 
  8250_fintek nf_conntrack_proto_gre ipmi_msghandler acpi_pad wmi shpchp nf_conntrack acpi_power_meter mac_hid auth_rpcgss nfs_acl bonding nfs 
  lp lockd parport grace sunrpc fscache tcp_htcp xfs btrfs hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq async_
  xor async_tx xor ixgbe raid6_pq libcrc32c igb vxlan raid1 i2c_algo_bit ip6_udp_tunnel dca udp_tunnel ahci raid0 ptp libahci megaraid_sas mult
  ipath pps_core mdio linear fjes
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197014] CPU: 13 PID: 3147726 Comm: ceph-osd Not tainted 4.4.0-15-generic #31~14
  .04.1-Ubuntu
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197085] Hardware name: Dell Inc. PowerEdge R720/0XH7F2, BIOS 2.5.2 01/28/2015
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197154] task: ffff88252be1ee00 ti: ffff8824fc0d4000 task.ti: ffff8824fc0d4000
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197221] RIP: 0010:[<ffffffff810afec8>]  [<ffffffff810afec8>] task_numa_find_cpu+0x238/0x700
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197300] RSP: 0000:ffff8824fc0d7ba8  EFLAGS: 00010257
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197340] RAX: 0000000000000000 RBX: ffff8824fc0d7c48 RCX: 0000000000000000
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197406] RDX: 0000000000000000 RSI: ffff88479f180000 RDI: ffff884782a47600
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197473] RBP: ffff8824fc0d7c10 R08: 0000000102eea157 R09: 00000000000001a8
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197540] R10: 000000000002404b R11: 000000000000023f R12: ffff882380930000
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197606] R13: 0000000000000008 R14: 000000000000008c R15: 0000000000000124
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197673] FS:  00007f19aab5b700(0000) GS:ffff88479f180000(0000) knlGS:0000000000000000
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197741] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197782] CR2: 0000000025469600 CR3: 00000023846bc000 CR4: 00000000000426e0
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197848] Stack:
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197880]  ffffffff817425fb ffff8829af3e9e00 00000000000000f6 ffff88252be1ee00
  Apr  9 19:01:09 cnode9.0 kernel: [197071.197965]  000000000000008d 0000000000000225 0000000000016d40 000000000000008d
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198047]  ffff88252be1ee00 00000000000001ad ffff8824fc0d7c48 00000000000000e1
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198132] Call Trace:
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198172]  [<ffffffff817425fb>] ? tcp_schedule_loss_probe+0x12b/0x1b0
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198219]  [<ffffffff810b0830>] task_numa_migrate+0x4a0/0x930
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198264]  [<ffffffff816d2957>] ? release_sock+0x117/0x160
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198306]  [<ffffffff810b0d39>] numa_migrate_preferred+0x79/0x80
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198350]  [<ffffffff810b557d>] task_numa_fault+0x91d/0xcc0
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198395]  [<ffffffff811d35ae>] ? mpol_misplaced+0x14e/0x190
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198439]  [<ffffffff811b06b8>] handle_pte_fault+0x5a8/0x14c0
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198485]  [<ffffffff810f8531>] ? futex_wake+0x81/0x150
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198526]  [<ffffffff810b0de4>] ? set_next_entity+0xa4/0x700
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198569]  [<ffffffff810fab44>] ? do_futex+0xf4/0x4d0
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198610]  [<ffffffff811b2440>] handle_mm_fault+0x250/0x540
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198654]  [<ffffffff81067d19>] __do_page_fault+0x199/0x430
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198696]  [<ffffffff81067fd2>] do_page_fault+0x22/0x30
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198740]  [<ffffffff817ef878>] page_fault+0x28/0x30
  Apr  9 19:01:09 cnode9.0 kernel: [197071.198775] Code: 4d b0 4c 89 f7 e8 29 d5 ff ff 48 8b 4d b0 49 8b 86 b0 00 00 00 31 d2 48 0f af 81 d8 01 00 00 49 8b 4e 78 4c 8b 73 78 48 83 c1 01 <48> f7 f1 48 8b 4b 20 49 89 c1 48 29 c1 4c 03 4b 48 4c 39 7d d0 
  Apr  9 19:01:09 cnode9.0 kernel: [197071.199217] RIP  [<ffffffff810afec8>] task_numa_find_cpu+0x238/0x700
  Apr  9 19:01:09 cnode9.0 kernel: [197071.199264]  RSP <ffff8824fc0d7ba8>
  Apr  9 19:01:09 cnode9.0 kernel: [197071.199900] ---[ end trace e938a840610a79f7 ]---

  This is appears to be the same bug as reported upstream in 
  http://lkml.iu.edu/hypermail/linux/kernel/1603.2/01659.html

  According to this thread the issue is:

  27: 48 83 c1 01 add $0x1,%rcx
  2b:* 48 f7 f1 div %rcx <-- trapping instruction

  This suggests the CONFIG_FAIR_GROUP_SCHED version of task_h_load:

  update_cfs_rq_h_load(cfs_rq);
  return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
  cfs_rq_load_avg(cfs_rq) + 1);

  So the load avg is -1, thus after adding 1 we get division by 0

  The fix of the LKML reporter was to include the patches to kernel/sched/fair.c up to 4.5
  A specific patch was not identified.

  Please backport these patches for Xenial and lts-xenial kernel in
  trusty.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1568729/+subscriptions
References

[Bug 1568729] [NEW] divide error: 0000 [#1] SMP in task_numa_migrate - handle_mm_fault
From: Markus Schade, 2016-04-11