← Back to team overview

kernel-packages team mailing list archive

[Bug 1484919] [NEW] Kernel oops associated with BIRD/netlink

 

Public bug reported:

Scale testing our product, which uses the BIRD BGP daemon, on Google's
GCE cloud, we see frequent (40% of hosts) Kernel Oopses and reboots on
kernel 3.19.0-25-generic #26~14.04.1-Ubuntu with BIRD running.  This is
the standard GCE-provided Ubuntu image.

If we replace the image with a stock Ubuntu one (kernel
3.13.0-61-generic #100-Ubuntu), installed from ISO, then we do not see
the issue.

If we stop BIRD then we no longer see the issue.

I suspect that this is an issue with the way BIRD is using netlink,
triggering a kernel bug.   It seems to happen more at scale, when BIRD
is doing more with netlink and we have thousands of routes in place.

Here's a sample kernel oops:

[  266.033276] BUG: unable to handle kernel paging request at 000000190000003c
[  266.035142] IP: [<ffffffff811d1f0b>] __kmalloc_node_track_caller+0xfb/0x2c0
[  266.036009] PGD b9e5e067 PUD 0 
[  266.036009] Oops: 0000 [#1] SMP 
[  266.036009] Modules linked in: bridge stp llc dummy xt_mac xt_mark nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_set ip_set_hash_ip ip_set nfnetlink ebtable_nat ebtables xt_nat ipip tunnel4 ip_tunnel ipt_REJECT nf_reject_ipv4 xt_conntrack xt_CHECKSUM xt_tcpudp iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables iptable_filter ip_tables x_tables nbd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_crypt ppdev dm_multipath scsi_dh 8250_fintek parport_pc i2c_piix4 mac_hid serio_raw parport crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse virtio_scsi
[  266.036009] CPU: 2 PID: 3456 Comm: bird Tainted: G         C     3.19.0-25-generic #26~14.04.1-Ubuntu
[  266.036009] Hardware name: Google Google, BIOS Google 01/01/2011
[  266.036009] task: ffff8801210775c0 ti: ffff880036a08000 task.ti: ffff880036a08000
[  266.036009] RIP: 0010:[<ffffffff811d1f0b>]  [<ffffffff811d1f0b>] __kmalloc_node_track_caller+0xfb/0x2c0
[  266.036009] RSP: 0018:ffff880036a0b7f8  EFLAGS: 00010246
[  266.036009] RAX: 0000000000000000 RBX: 00000000000102d0 RCX: 000000000008b0f6
[  266.036009] RDX: 000000000008b0f5 RSI: 0000000000000000 RDI: 00000000000171c0
[  266.036009] RBP: ffff880036a0b848 R08: ffff8801263171c0 R09: ffff880121c01600
[  266.036009] R10: 0000000000000000 R11: ffff880121c01600 R12: 00000000000102d0
[  266.036009] R13: 0000000000000180 R14: 00000000ffffffff R15: 000000190000003c
[  266.036009] FS:  00007fd470753740(0000) GS:ffff880126300000(0000) knlGS:0000000000000000
[  266.036009] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  266.036009] CR2: 000000190000003c CR3: 00000000bb207000 CR4: 00000000001406e0
[  266.036009] Stack:
[  266.036009]  0000000100000180 00000000000000c3 ffff880121c01600 ffffffff816992ea
[  266.036009]  0000000000000001 ffff8800b1486a00 0000000000000000 00000000000000d0
[  266.036009]  0000000000000180 00000000ffffffff ffff880036a0b898 ffffffff81697261
[  266.036009] Call Trace:
[  266.036009]  [<ffffffff816992ea>] ? pskb_expand_head+0x6a/0x260
[  266.036009]  [<ffffffff81697261>] __kmalloc_reserve.isra.27+0x31/0x90
[  266.036009]  [<ffffffff816992ea>] pskb_expand_head+0x6a/0x260
[  266.036009]  [<ffffffff816d6d13>] netlink_trim+0xa3/0xe0
[  266.036009]  [<ffffffff816d984e>] netlink_unicast+0x3e/0x200
[  266.036009]  [<ffffffff816da323>] nlmsg_notify+0x93/0xb0
[  266.036009]  [<ffffffff816b8d3e>] rtnl_notify+0x2e/0x40
[  266.036009]  [<ffffffff81727525>] rtmsg_fib+0x115/0x160
[  266.036009]  [<ffffffff8172a09d>] ? trie_rebalance+0x10d/0x130
[  266.036009]  [<ffffffff8172a34a>] fib_table_insert+0x1da/0x8e0
[  266.036009]  [<ffffffff817242a8>] inet_rtm_newroute+0x48/0x60
[  266.036009]  [<ffffffff816b97c5>] rtnetlink_rcv_msg+0x95/0x250
[  266.036009]  [<ffffffff813bb4a6>] ? rhashtable_lookup_compare+0x36/0x70
[  266.036009]  [<ffffffff816d631e>] ? __netlink_lookup+0x3e/0x50
[  266.036009]  [<ffffffff816b9730>] ? rtnetlink_rcv+0x40/0x40
[  266.036009]  [<ffffffff816da271>] netlink_rcv_skb+0xc1/0xe0
[  266.036009]  [<ffffffff816b971c>] rtnetlink_rcv+0x2c/0x40
[  266.036009]  [<ffffffff816d9906>] netlink_unicast+0xf6/0x200
[  266.036009]  [<ffffffff816d9d1c>] netlink_sendmsg+0x30c/0x680
[  266.036009]  [<ffffffff81351610>] ? aa_sk_perm.isra.4+0x70/0x150
[  266.036009]  [<ffffffff8168f2ec>] do_sock_sendmsg+0x8c/0x100
[  266.036009]  [<ffffffff81209a13>] ? __fdget+0x13/0x20
[  266.036009]  [<ffffffff8168f547>] SYSC_sendto+0x157/0x200
[  266.036009]  [<ffffffff81690252>] ? __sys_recvmsg+0x42/0x80
[  266.036009]  [<ffffffff8168fd2e>] SyS_sendto+0xe/0x10
[  266.036009]  [<ffffffff817b668d>] system_call_fastpath+0x16/0x1b
[  266.036009] Code: fb 41 8b 53 18 0f 1f 44 00 00 48 83 c4 28 48 89 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 40 00 49 63 41 20 48 8d 4a 01 49 8b 39 <49> 8b 1c 07 4c 89 f8 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84 53 ff 
[  266.036009] RIP  [<ffffffff811d1f0b>] __kmalloc_node_track_caller+0xfb/0x2c0
[  266.036009]  RSP <ffff880036a0b7f8>
[  266.036009] CR2: 000000190000003c
[  266.131166] ---[ end trace 246ae06038901786 ]---

Running our product on CoreOS, we see similar, but less frequent
crashes.  Their kernel is 4.1-based:
https://github.com/coreos/bugs/issues/435

** Affects: linux-meta (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-meta in Ubuntu.
https://bugs.launchpad.net/bugs/1484919

Title:
  Kernel oops associated with BIRD/netlink

Status in linux-meta package in Ubuntu:
  New

Bug description:
  Scale testing our product, which uses the BIRD BGP daemon, on Google's
  GCE cloud, we see frequent (40% of hosts) Kernel Oopses and reboots on
  kernel 3.19.0-25-generic #26~14.04.1-Ubuntu with BIRD running.  This
  is the standard GCE-provided Ubuntu image.

  If we replace the image with a stock Ubuntu one (kernel
  3.13.0-61-generic #100-Ubuntu), installed from ISO, then we do not see
  the issue.

  If we stop BIRD then we no longer see the issue.

  I suspect that this is an issue with the way BIRD is using netlink,
  triggering a kernel bug.   It seems to happen more at scale, when BIRD
  is doing more with netlink and we have thousands of routes in place.

  Here's a sample kernel oops:

  [  266.033276] BUG: unable to handle kernel paging request at 000000190000003c
  [  266.035142] IP: [<ffffffff811d1f0b>] __kmalloc_node_track_caller+0xfb/0x2c0
  [  266.036009] PGD b9e5e067 PUD 0 
  [  266.036009] Oops: 0000 [#1] SMP 
  [  266.036009] Modules linked in: bridge stp llc dummy xt_mac xt_mark nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_set ip_set_hash_ip ip_set nfnetlink ebtable_nat ebtables xt_nat ipip tunnel4 ip_tunnel ipt_REJECT nf_reject_ipv4 xt_conntrack xt_CHECKSUM xt_tcpudp iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip6table_filter ip6_tables iptable_filter ip_tables x_tables nbd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_crypt ppdev dm_multipath scsi_dh 8250_fintek parport_pc i2c_piix4 mac_hid serio_raw parport crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse virtio_scsi
  [  266.036009] CPU: 2 PID: 3456 Comm: bird Tainted: G         C     3.19.0-25-generic #26~14.04.1-Ubuntu
  [  266.036009] Hardware name: Google Google, BIOS Google 01/01/2011
  [  266.036009] task: ffff8801210775c0 ti: ffff880036a08000 task.ti: ffff880036a08000
  [  266.036009] RIP: 0010:[<ffffffff811d1f0b>]  [<ffffffff811d1f0b>] __kmalloc_node_track_caller+0xfb/0x2c0
  [  266.036009] RSP: 0018:ffff880036a0b7f8  EFLAGS: 00010246
  [  266.036009] RAX: 0000000000000000 RBX: 00000000000102d0 RCX: 000000000008b0f6
  [  266.036009] RDX: 000000000008b0f5 RSI: 0000000000000000 RDI: 00000000000171c0
  [  266.036009] RBP: ffff880036a0b848 R08: ffff8801263171c0 R09: ffff880121c01600
  [  266.036009] R10: 0000000000000000 R11: ffff880121c01600 R12: 00000000000102d0
  [  266.036009] R13: 0000000000000180 R14: 00000000ffffffff R15: 000000190000003c
  [  266.036009] FS:  00007fd470753740(0000) GS:ffff880126300000(0000) knlGS:0000000000000000
  [  266.036009] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  266.036009] CR2: 000000190000003c CR3: 00000000bb207000 CR4: 00000000001406e0
  [  266.036009] Stack:
  [  266.036009]  0000000100000180 00000000000000c3 ffff880121c01600 ffffffff816992ea
  [  266.036009]  0000000000000001 ffff8800b1486a00 0000000000000000 00000000000000d0
  [  266.036009]  0000000000000180 00000000ffffffff ffff880036a0b898 ffffffff81697261
  [  266.036009] Call Trace:
  [  266.036009]  [<ffffffff816992ea>] ? pskb_expand_head+0x6a/0x260
  [  266.036009]  [<ffffffff81697261>] __kmalloc_reserve.isra.27+0x31/0x90
  [  266.036009]  [<ffffffff816992ea>] pskb_expand_head+0x6a/0x260
  [  266.036009]  [<ffffffff816d6d13>] netlink_trim+0xa3/0xe0
  [  266.036009]  [<ffffffff816d984e>] netlink_unicast+0x3e/0x200
  [  266.036009]  [<ffffffff816da323>] nlmsg_notify+0x93/0xb0
  [  266.036009]  [<ffffffff816b8d3e>] rtnl_notify+0x2e/0x40
  [  266.036009]  [<ffffffff81727525>] rtmsg_fib+0x115/0x160
  [  266.036009]  [<ffffffff8172a09d>] ? trie_rebalance+0x10d/0x130
  [  266.036009]  [<ffffffff8172a34a>] fib_table_insert+0x1da/0x8e0
  [  266.036009]  [<ffffffff817242a8>] inet_rtm_newroute+0x48/0x60
  [  266.036009]  [<ffffffff816b97c5>] rtnetlink_rcv_msg+0x95/0x250
  [  266.036009]  [<ffffffff813bb4a6>] ? rhashtable_lookup_compare+0x36/0x70
  [  266.036009]  [<ffffffff816d631e>] ? __netlink_lookup+0x3e/0x50
  [  266.036009]  [<ffffffff816b9730>] ? rtnetlink_rcv+0x40/0x40
  [  266.036009]  [<ffffffff816da271>] netlink_rcv_skb+0xc1/0xe0
  [  266.036009]  [<ffffffff816b971c>] rtnetlink_rcv+0x2c/0x40
  [  266.036009]  [<ffffffff816d9906>] netlink_unicast+0xf6/0x200
  [  266.036009]  [<ffffffff816d9d1c>] netlink_sendmsg+0x30c/0x680
  [  266.036009]  [<ffffffff81351610>] ? aa_sk_perm.isra.4+0x70/0x150
  [  266.036009]  [<ffffffff8168f2ec>] do_sock_sendmsg+0x8c/0x100
  [  266.036009]  [<ffffffff81209a13>] ? __fdget+0x13/0x20
  [  266.036009]  [<ffffffff8168f547>] SYSC_sendto+0x157/0x200
  [  266.036009]  [<ffffffff81690252>] ? __sys_recvmsg+0x42/0x80
  [  266.036009]  [<ffffffff8168fd2e>] SyS_sendto+0xe/0x10
  [  266.036009]  [<ffffffff817b668d>] system_call_fastpath+0x16/0x1b
  [  266.036009] Code: fb 41 8b 53 18 0f 1f 44 00 00 48 83 c4 28 48 89 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 40 00 49 63 41 20 48 8d 4a 01 49 8b 39 <49> 8b 1c 07 4c 89 f8 65 48 0f c7 0f 0f 94 c0 84 c0 0f 84 53 ff 
  [  266.036009] RIP  [<ffffffff811d1f0b>] __kmalloc_node_track_caller+0xfb/0x2c0
  [  266.036009]  RSP <ffff880036a0b7f8>
  [  266.036009] CR2: 000000190000003c
  [  266.131166] ---[ end trace 246ae06038901786 ]---

  Running our product on CoreOS, we see similar, but less frequent
  crashes.  Their kernel is 4.1-based:
  https://github.com/coreos/bugs/issues/435

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/1484919/+subscriptions


Follow ups