← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1799124] Re: Path MTU discovery fails for VMs with Floating IP behind DVR routers

 

Bug closed due to lack of activity, please feel free to reopen if
needed.

** Changed in: neutron
       Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1799124

Title:
  Path MTU discovery fails for VMs with Floating IP behind DVR routers

Status in neutron:
  Won't Fix

Bug description:
  Tenant VMs using an overlay network with an MTU <1500 and less than
  the MTU of the external network are unable to negotiate MTU using Path
  MTU discovery.

  In most cases, since the instance MTU is configured by DHCP, direct
  instance traffic is not affected however if the VM acts as a router
  for other traffic (e.g. to bridge for Docker, LXD, Libvirt, etc) that
  have the MTU set to 1500 (which is the default in most cases) then
  they rely on Path MTU discovery to discover the 1450 MTU.

  On normal routers and DVR routers where the VM does not have a
  floating IP (and thus is routed via a centralized node), this works as
  expected.

  However on DVR routers where the VM has a Floating IP (and thus
  traffic is routed directly from the compute node) this fails. When a
  packet comes from the external network towards the VM with a size
  larger than the overlay network's MTU, the packet is dropped and no
  ICMP too large fragmentation required response is received by the
  external host. This prevents Path MTU discovery from working to fix
  the connection, the result is that most TCP connections will stall if
  they attempt to send more than 1500 bytes, e.g. a simple HTTP
  download.

  My diagnosis is that the qrouter namespace on the compute host has no
  default route. It has a default route in the alternative routing table
  (16) used for traffic matching an "ip rule" which selects all traffic
  being sent from the VM subnet but there is no default route in the
  global default routing table.

  I have not 100% confirmed this part, however, my understanding is that
  since there is no global default route the kernel is unable to select
  a source IP for the ICMP error. Additionally, even if it did somehow
  select a source IP, the appropriate default route appears to be via
  the RFP interface on the 169.254.0.0/16 subnet back to the FIP
  namespace which would not match the rule for traffic from the VM
  subnet to use the alternate routing table anyway.

  In testing, if I add a default route through the rfp interface then
  ICMP errors are sent and Path MTU discovery successfully works,
  allowing TCP connections to work.

  root@maas-node02:~# ip netns exec qrouter-1752c73a-be9f-4326-97cc-99dbe0988b3c ip r
  103.245.215.0/28 dev qr-ec03268e-fb  proto kernel  scope link  src 103.245.215.1 
  169.254.106.114/31 dev rfp-1752c73a-b  proto kernel  scope link  src 169.254.106.114 

  root@maas-node02:~# ip -n qrouter-1752c73a-be9f-4326-97cc-99dbe0988b3c route show table 16
  default via 169.254.106.115 dev rfp-1752c73a-b 

  root@maas-node02:~# ip -n qrouter-1752c73a-be9f-4326-97cc-99dbe0988b3c
  route add default via 169.254.106.115  dev rfp-1752c73a-b

  It's not clear to me if there is an intentional reason not to install
  a default route here, particularly since such a route exists for non-
  DVR routers. I would appreciate input from anyone who knows if this
  was an intentional design decision or simply oversight.

   = Steps to reproduce =

  (1) Deploy a cloud with DVR and global-physnet-mtu=1500
  (2) Create an overlay tenant network (MTU: 1450), VLAN/flat external network (MTU: 1500), router.
  (3) Deploy an Ubuntu 16.04 container
  (4) Verify that a large download works; "wget http://archive.ubuntu.com/ubuntu-releases/18.04.1/ubuntu-18.04.1-live-server-amd64.iso.zsync";
  (5) Configure LXD to use a private subnet and NAT; "dpkg-reconfigure -pmedium lxd" - you can just hit yes and accept the defaults bascially
  (6) Create an lxd image, "lxc launch ubuntu:16.04 test", then test a download
  (7) lxc exec test "wget http://archive.ubuntu.com/ubuntu-releases/18.04.1/ubuntu-18.04.1-live-server-amd64.iso.zsync";

  An alternative simple test to using LXD/docker is to force the MTU of
  the VM back to 1500. "ip link set eth0 mtu 1500" -- this same scenario
  will fail with DVR and work without DVR.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1799124/+subscriptions



References