kernel-packages team mailing list archive

Thread
Date
[Bug 1486670] Re: using ipsec, many connections result in no buffer space error

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Dan Streetman <dan.streetman@xxxxxxxxxxxxx>
Date: Tue, 09 Feb 2016 01:02:44 -0000
Reply-to: Bug 1486670 <1486670@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
> The LXC images failed to start under linux-image-4.2.0-28-generic,
with a kernel oops.

this bug isn't about kernel oopses.

> Setting /proc/sys/net/ipv4/xfrm4_gc_thresh to 5 causes the failure almost immediately.
> 
> I would like to confirm my procedure however. I've been changing /proc/sys/net/ipv4/xfrm4_gc_thresh inside the containers,
> not the host. Is this correct?

no, that's not correct, and unfortunately the "reproducer" in this bug
description is completely invalid (it was copied from a private bug for
this issue).  ipsec will NEVER work with gc_thresh set to 5, due to the
internal details between the xfrm gc_thresh and the hardcoded flowcache
per-cpu hashtable size limit.  In upstream the xfrm4_gc_thresh has been
changed to INT_MAX, because using a gc_thresh doesn't make any sense for
xfrm dst entries; the total number of them possible is entirely
dependent on the number of cpus in the system.

There is 1 xfrm dst entry per flowcache entry, and the flowcache entries
are kept in a per-cpu hashtable that is strictly limited at 4096 entries
(per cpu).  So, the total number of xfrm dst entries will be 4096 *
num_active_cpus(), meaning that since the dst code stops allowing new
dst allocation once the dst entry count is >= 2 * gc_thresh, the
threshold (with the current default 32k gc_thresh) where dst allocation
failures can begin to be seen is (32k * 2) / 4096 = 16 cpus.  On systems
with less than 16 cpus, at the default gc_thresh of 32k, there will
never be any dst allocation failures (except due to the real bug this
addresses).  On systems with 16 or more cpus, with the default gc_thresh
of 32k, there will/can be dst alloc failures (with a high enough ipsec
usage rate, creating new connections - the flowcache clears all its
entries every 10 minutes, so a lightly loaded ipsec with > 16 cpus could
be fine).

Setting the xfrm4_gc_thresh value to anything less than (4k * 2 * CPUS)
will result in failures, and in fact there is no point in setting
xfrm4_gc_thresh to ANYTHING other than INT_MAX because it doesn't
actually remove any dst entries.


All that, unfortunately, is actually tangential to this real bug - the problem here is with multiple net namespaces (i.e. multiple containers) all running ipsec, the dst entry counter changes for one container can incorrectly be given to a different container - so one container could have its dst entry count steadily go down (incorrectly) while another container's dst entry count keeps going up (incorrectly).  The latter container there would eventually reach its 2 * gc_thresh limit and encounter dst allocation failures, making its ipsec network unusable.  Unfortunately, that error looks identical to the error when the dst entry counter is correct, but the number of system cpus is > 16 as described above.


To test this fix, multiple containers must be started (just 2 is fine).  On each container, new ipsec connections should be created as fast as possible; e.g. something like:

while true ; do ping -c 1 OTHER_CONTAINER ; done

so each container is pinging each other - but it's important to use -c 1
so that each ping creates a new ipsec dst entry; just normal ping will
re-use the existing dst entry.

After a sufficiently long period (depending on the number of containers
and number of cpus, and the luck of randomly incorrectly assigning dst
entry count changes to different containers) one or more containers
should get dst allocation failures and its ipsec network should not be
usable anymore.  To speed up reproduction of this bug, lower the
xfrm4_gc_thresh to a value ABOVE (2 * 4096 * CPUS), but close to it -
e.g. something like 10k * CPUS.  With that gc_thresh value set, this bug
should be reproducable fairly quickly (order of days or less) without
the patch, but not reproducable with the patch (i.e. with the -proposed
repo kernel).

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1486670

Title:
  using ipsec, many connections result in no buffer space error

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Precise:
  Invalid
Status in linux source package in Trusty:
  Fix Committed
Status in linux source package in Vivid:
  Fix Committed
Status in linux source package in Wily:
  Fix Committed

Bug description:
  Reproduction info:

  set up two LXC containers (although this probably isn't specific to
  LXC containers), and inside each setup ipsec with something similar
  to:

  conn nodeN
  aggressive=yes 
  authby=secret 
  auto=start 
  closeaction=restart 
  dpdaction=restart 
  esp=aes256-aes256gmac-modp1024 
  ike=aes256-sha512-modp1024 
  keyexchange=ikev2 
  left=10.0.3.145 
  leftid=10.0.3.145 
  lifetime=12h 
  reauth=no 
  right=10.0.3.199 
  type=transport 

  
  then repeatedly open connections to the peer, e.g.:

  while true; do ping -c1 10.0.3.199 ; sleep 0.1 ; done

  eventually, the connections will fail with:

  connect: No buffer space available

  the reproduction can be sped up by reducing the xfrm4_gc_thresh, e.g.:

  echo 5 > /proc/sys/net/ipv4/xfrm4_gc_thresh

  
  Once the error occurs, no more connections can be made to the peer (all fail with no buffer space available), however after a long period (e.g. overnight) the buffers will be cleaned up and connections can be made again.

  this happens even on the latest net-next kernel.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1486670/+subscriptions
References

[Bug 1486670] [NEW] using ipsec, many connections result in no buffer space error
From: Dan Streetman, 2015-08-19