← Back to team overview

kernel-packages team mailing list archive

[Bug 1403152] Re: unregister_netdevice: waiting for lo to become free. Usage count


A summary of the bug so far:

Occasionally starting and stopping many containers with network traffic may
result in new containers being unable to start due to the inability to create
new network namespaces.

The following message repeats in the kernel log until reboot.

unregister_netdevice: waiting for lo to become free. Usage count = 1

Eventually when creating a new container this hung task backtrace

  ? __kmalloc+0x1e9/0x230
  ? call_rcu_sched+0x1d/0x20
  ? system_call_fastpath+0x1a/0x1f

The following conditions I've been able to test:
- If CONFIG_BRIDGE_NETFILTER is disabled this problem does not occur.
- If net.bridge.bridge-nf-call-iptables is disabled, this problem does not occur.
- This problem can happen on single processor machines
- This problem can happen with IPv6 disabled
- This problem can happen with xt_conntrack enabled.

The unregister_netdevice warning always waits on lo. It always has reg_state
set to NETREG_UNREGISTERING. This follows that the device has been through the
unregister_netdevice_many path and is being unregistered. This path is ultimately
where net_mutex is locked and thus prevents copy_net_ns from executing.

In addition when the unregister netdevice warning happens, a crashdump reveals
the dst_busy_list always contains a dst_entry that references the device above.
This dst_entry has already been through ___dst_free since it has already been
marked DST_OBSOLETE_DEAD. 'dst->ops' is always set to ipv4_dst_ops.
dst->callback_head.next is NULL, and the next pointer is NULL. Use is also zero.

We can trace where the dst_entry is trying to be freed. When free_fib_info_rcu
is called, if nh_rth_input is set, it eventually calls dst_free. Because there
is still a refcnt held, it does not get immediately destroyed and continues on
to __dst_free. This puts the dst into the dst_garbage list, which is then
examined periodically by the dst_gc_work worker thread. Each time it tries to
clean it up it fails because it still has a non-zero refcnt.

The faulty dst_entry is being allocated via ip_rcv..ip_route_input_noref. In
addition this dst is most likely being held in response to a new packet via the
ip_rcv..inet_sk_rx_dst_set path.

You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.

  unregister_netdevice: waiting for lo to become free. Usage count

Status in The Linux Kernel:
Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Trusty:
Status in linux source package in Utopic:

Bug description:
  I currently running trusty latest patches and i get on these hardware
  and software:

  Ubuntu 3.13.0-43.72-generic

  processor	: 7
  vendor_id	: GenuineIntel
  cpu family	: 6
  model		: 77
  model name	: Intel(R) Atom(TM) CPU  C2758  @ 2.40GHz
  stepping	: 8
  microcode	: 0x11d
  cpu MHz		: 2400.000
  cache size	: 1024 KB
  physical id	: 0
  siblings	: 8
  core id		: 7
  cpu cores	: 8
  apicid		: 14
  initial apicid	: 14
  fpu		: yes
  fpu_exception	: yes
  cpuid level	: 11
  wp		: yes
  flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch arat epb dtherm tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms
  bogomips	: 4799.48
  clflush size	: 64
  cache_alignment	: 64
  address sizes	: 36 bits physical, 48 bits virtual
  power management:

  somehow reproducable the subjected error, and lxc is working still but
  not more managable until a reboot.

  managable means every command hangs.

  I saw there are alot of bugs but they seams to relate to older version
  and are closed, so i decided to file a new one?

  I run alot of machine with trusty an lxc containers but only these kind of machines produces these errors, all
  other don't show these odd behavior.

  thx in advance


To manage notifications about this bug go to: