yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #63889
[Bug 1689952] [NEW] conntrack race can blackhole flows to Floating IP
Public bug reported:
We have some users who want to receive continuous unidirectional flows
of UDP-over-IPv4 datagram on their instances (sent by some sort of
sensors) via Floating IP. After we migrate or restart the Neutron
routers serving those instances, the users complain that their instances
stop receiving those packets.
After debugging this for a long time, we have observed that there are
incorrect conntrack entries for those flows in the router's namespace.
Apparently these conntrack entries don't NAT the Floating IP to the
instance's Fixed IP. When we delete the conntrack entries, they are
quickly replaced with the correct entries, and the instance starts
receiving traffic again.
$ sudo ip netns exec qrouter-fe77a8ff-769b-4469-8490-1d37873a5671 conntrack -L -d 192.0.2.67
...
udp 17 29 src=192.0.2.7 dst=192.0.2.67 sport=58254 dport=12345 [UNREPLIED] src=192.0.2.67 dst=192.0.2.7 sport=12345 dport=58254 mark=0 use=1
...
Note that the original "src" is identical to the response "dst".
After deleting the entries (sudo ip netns exec ... conntrack -D -d
192.0.2.67), the (new) entries look like this:
$ sudo ip netns exec qrouter-fe77a8ff-769b-4469-8490-1d37873a5671 conntrack -L conntrack -d 192.0.2.67
...
udp 17 29 src=192.0.2.7 dst=192.0.2.67 sport=58254 dport=12345 [UNREPLIED] src=10.0.0.107 dst=192.0.2.7 sport=12345 dport=58254 mark=0 use=1
...
These entries are much better, because the response "dst" is now the
Fixed IP of the instance (10.0.0.107).
We assume that there is a race condition: When packets for a given
Floating IP arrive at the router namespace before the NAT rules(?) for
that Floating IP have been completely set up, conntrack creates these
incorrect entries. This is likely if these packets arrive at a high
rate (we have hundreds of those packets per second). And the incorrect
entries will never time out if the traffic flows continuously.
We have observed this frequently over the years, including recently
after we upgraded our network nodes to Newton.
** Affects: neutron
Importance: Undecided
Status: New
** Tags: conntrack floating ipv4
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1689952
Title:
conntrack race can blackhole flows to Floating IP
Status in neutron:
New
Bug description:
We have some users who want to receive continuous unidirectional flows
of UDP-over-IPv4 datagram on their instances (sent by some sort of
sensors) via Floating IP. After we migrate or restart the Neutron
routers serving those instances, the users complain that their
instances stop receiving those packets.
After debugging this for a long time, we have observed that there are
incorrect conntrack entries for those flows in the router's namespace.
Apparently these conntrack entries don't NAT the Floating IP to the
instance's Fixed IP. When we delete the conntrack entries, they are
quickly replaced with the correct entries, and the instance starts
receiving traffic again.
$ sudo ip netns exec qrouter-fe77a8ff-769b-4469-8490-1d37873a5671 conntrack -L -d 192.0.2.67
...
udp 17 29 src=192.0.2.7 dst=192.0.2.67 sport=58254 dport=12345 [UNREPLIED] src=192.0.2.67 dst=192.0.2.7 sport=12345 dport=58254 mark=0 use=1
...
Note that the original "src" is identical to the response "dst".
After deleting the entries (sudo ip netns exec ... conntrack -D -d
192.0.2.67), the (new) entries look like this:
$ sudo ip netns exec qrouter-fe77a8ff-769b-4469-8490-1d37873a5671 conntrack -L conntrack -d 192.0.2.67
...
udp 17 29 src=192.0.2.7 dst=192.0.2.67 sport=58254 dport=12345 [UNREPLIED] src=10.0.0.107 dst=192.0.2.7 sport=12345 dport=58254 mark=0 use=1
...
These entries are much better, because the response "dst" is now the
Fixed IP of the instance (10.0.0.107).
We assume that there is a race condition: When packets for a given
Floating IP arrive at the router namespace before the NAT rules(?) for
that Floating IP have been completely set up, conntrack creates these
incorrect entries. This is likely if these packets arrive at a high
rate (we have hundreds of those packets per second). And the
incorrect entries will never time out if the traffic flows
continuously.
We have observed this frequently over the years, including recently
after we upgraded our network nodes to Newton.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1689952/+subscriptions