← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1689952] [NEW] conntrack race can blackhole flows to Floating IP

 

Public bug reported:

We have some users who want to receive continuous unidirectional flows
of UDP-over-IPv4 datagram on their instances (sent by some sort of
sensors) via Floating IP.  After we migrate or restart the Neutron
routers serving those instances, the users complain that their instances
stop receiving those packets.

After debugging this for a long time, we have observed that there are
incorrect conntrack entries for those flows in the router's namespace.
Apparently these conntrack entries don't NAT the Floating IP to the
instance's Fixed IP.  When we delete the conntrack entries, they are
quickly replaced with the correct entries, and the instance starts
receiving traffic again.

  $ sudo ip netns exec qrouter-fe77a8ff-769b-4469-8490-1d37873a5671 conntrack -L -d 192.0.2.67
  ...
  udp      17 29 src=192.0.2.7 dst=192.0.2.67 sport=58254 dport=12345 [UNREPLIED] src=192.0.2.67 dst=192.0.2.7 sport=12345 dport=58254 mark=0 use=1
  ...

Note that the original "src" is identical to the response "dst".

After deleting the entries (sudo ip netns exec ... conntrack -D -d
192.0.2.67), the (new) entries look like this:

  $ sudo ip netns exec qrouter-fe77a8ff-769b-4469-8490-1d37873a5671 conntrack -L conntrack -d 192.0.2.67
  ...
  udp      17 29 src=192.0.2.7 dst=192.0.2.67 sport=58254 dport=12345 [UNREPLIED] src=10.0.0.107 dst=192.0.2.7 sport=12345 dport=58254 mark=0 use=1
   ...

These entries are much better, because the response "dst" is now the
Fixed IP of the instance (10.0.0.107).

We assume that there is a race condition: When packets for a given
Floating IP arrive at the router namespace before the NAT rules(?) for
that Floating IP have been completely set up, conntrack creates these
incorrect entries.  This is likely if these packets arrive at a high
rate (we have hundreds of those packets per second).  And the incorrect
entries will never time out if the traffic flows continuously.

We have observed this frequently over the years, including recently
after we upgraded our network nodes to Newton.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: conntrack floating ipv4

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1689952

Title:
  conntrack race can blackhole flows to Floating IP

Status in neutron:
  New

Bug description:
  We have some users who want to receive continuous unidirectional flows
  of UDP-over-IPv4 datagram on their instances (sent by some sort of
  sensors) via Floating IP.  After we migrate or restart the Neutron
  routers serving those instances, the users complain that their
  instances stop receiving those packets.

  After debugging this for a long time, we have observed that there are
  incorrect conntrack entries for those flows in the router's namespace.
  Apparently these conntrack entries don't NAT the Floating IP to the
  instance's Fixed IP.  When we delete the conntrack entries, they are
  quickly replaced with the correct entries, and the instance starts
  receiving traffic again.

    $ sudo ip netns exec qrouter-fe77a8ff-769b-4469-8490-1d37873a5671 conntrack -L -d 192.0.2.67
    ...
    udp      17 29 src=192.0.2.7 dst=192.0.2.67 sport=58254 dport=12345 [UNREPLIED] src=192.0.2.67 dst=192.0.2.7 sport=12345 dport=58254 mark=0 use=1
    ...

  Note that the original "src" is identical to the response "dst".

  After deleting the entries (sudo ip netns exec ... conntrack -D -d
  192.0.2.67), the (new) entries look like this:

    $ sudo ip netns exec qrouter-fe77a8ff-769b-4469-8490-1d37873a5671 conntrack -L conntrack -d 192.0.2.67
    ...
    udp      17 29 src=192.0.2.7 dst=192.0.2.67 sport=58254 dport=12345 [UNREPLIED] src=10.0.0.107 dst=192.0.2.7 sport=12345 dport=58254 mark=0 use=1
     ...

  These entries are much better, because the response "dst" is now the
  Fixed IP of the instance (10.0.0.107).

  We assume that there is a race condition: When packets for a given
  Floating IP arrive at the router namespace before the NAT rules(?) for
  that Floating IP have been completely set up, conntrack creates these
  incorrect entries.  This is likely if these packets arrive at a high
  rate (we have hundreds of those packets per second).  And the
  incorrect entries will never time out if the traffic flows
  continuously.

  We have observed this frequently over the years, including recently
  after we upgraded our network nodes to Newton.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1689952/+subscriptions