yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1814002] [NEW] Packets getting lost during SNAT with too many connections using the same source and destination on Network Node

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Swaminathan Vasudevan <souminathan@xxxxxxxxx>
Date: Wed, 30 Jan 2019 23:30:50 -0000
Reply-to: Bug 1814002 <1814002@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

Probably we have a problem with SNAT, with too many connections using the same source / destination, on the network nodes.
 
We have reproduced the bug with DNS requests, but we assume that it affects other packages as well.
 
When we send a lot of DNS requests, we see that sometimes a packet does not pass through the NAT and simply "gets lost".

 
In addition, we can see in the conntrack table that the who "insert_failed" increases.
 
ip netns exec snat-848819dc-efa2-45d9-9bc3-d96f093fa87a conntrack -S | grep insert_failed | grep -v insert_failed=0
cpu=0   searched=1166140 found=5587918 new=6659 invalid=5 ignore=0 delete=27726 delete_list=27712 insert=6645 insert_failed=14 drop=0 early_drop=0 error=0 search_restart=0
cpu=2   searched=12015 found=64626 new=2467 invalid=0 ignore=0 delete=15205 delete_list=15204 insert=2466 insert_failed=1 drop=0 early_drop=0 error=0 search_restart=0
cpu=3   searched=1348502 found=6097345 new=4093 invalid=0 ignore=0 delete=23200 delete_list=23173 insert=4066 insert_failed=27 drop=0 early_drop=0 error=0 search_restart=0
cpu=4   searched=1068516 found=5398514 new=3299 invalid=0 ignore=0 delete=14144 delete_list=14126 insert=3281 insert_failed=18 drop=0 early_drop=0 error=0 search_restart=0
cpu=5   searched=2280948 found=9908854 new=6770 invalid=0 ignore=0 delete=17224 delete_list=17185 insert=6731 insert_failed=39 drop=0 early_drop=0 error=0 search_restart=0
cpu=6   searched=1123341 found=5264368 new=9749 invalid=0 ignore=0 delete=17272 delete_list=17247 insert=9724 insert_failed=25 drop=0 early_drop=0 error=0 search_restart=0
cpu=7   searched=1553934 found=7234262 new=8734 invalid=0 ignore=0 delete=15658 delete_list=15634 insert=8710 insert_failed=24 drop=0 early_drop=0 error=0 search_restart=0

This might be a generic problem with conntrack and linux. 
We suspect that we encounter the following "limitation / bug" in the kernel:
https://github.com/torvalds/linux/blob/24de3d377539e384621c5b8f8f8d8d01852dddc8/net/netfilter/nf_nat_core.c#L290-L291
 
There seems to be a workaround to alleviate this behavior by setting the -random-fully flag in iptables. Unfortunately, this is only available since iptables 1.6.2.

Also this is not currently supported in neutron for the SNAT rules, it
just uses the --to-source.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1814002

Title:
  Packets getting lost during SNAT with too many connections using the
  same source and destination on Network Node

Status in neutron:
  New

Bug description:
  Probably we have a problem with SNAT, with too many connections using the same source / destination, on the network nodes.
   
  We have reproduced the bug with DNS requests, but we assume that it affects other packages as well.
   
  When we send a lot of DNS requests, we see that sometimes a packet does not pass through the NAT and simply "gets lost".

   
  In addition, we can see in the conntrack table that the who "insert_failed" increases.
   
  ip netns exec snat-848819dc-efa2-45d9-9bc3-d96f093fa87a conntrack -S | grep insert_failed | grep -v insert_failed=0
  cpu=0   searched=1166140 found=5587918 new=6659 invalid=5 ignore=0 delete=27726 delete_list=27712 insert=6645 insert_failed=14 drop=0 early_drop=0 error=0 search_restart=0
  cpu=2   searched=12015 found=64626 new=2467 invalid=0 ignore=0 delete=15205 delete_list=15204 insert=2466 insert_failed=1 drop=0 early_drop=0 error=0 search_restart=0
  cpu=3   searched=1348502 found=6097345 new=4093 invalid=0 ignore=0 delete=23200 delete_list=23173 insert=4066 insert_failed=27 drop=0 early_drop=0 error=0 search_restart=0
  cpu=4   searched=1068516 found=5398514 new=3299 invalid=0 ignore=0 delete=14144 delete_list=14126 insert=3281 insert_failed=18 drop=0 early_drop=0 error=0 search_restart=0
  cpu=5   searched=2280948 found=9908854 new=6770 invalid=0 ignore=0 delete=17224 delete_list=17185 insert=6731 insert_failed=39 drop=0 early_drop=0 error=0 search_restart=0
  cpu=6   searched=1123341 found=5264368 new=9749 invalid=0 ignore=0 delete=17272 delete_list=17247 insert=9724 insert_failed=25 drop=0 early_drop=0 error=0 search_restart=0
  cpu=7   searched=1553934 found=7234262 new=8734 invalid=0 ignore=0 delete=15658 delete_list=15634 insert=8710 insert_failed=24 drop=0 early_drop=0 error=0 search_restart=0

  This might be a generic problem with conntrack and linux. 
  We suspect that we encounter the following "limitation / bug" in the kernel:
  https://github.com/torvalds/linux/blob/24de3d377539e384621c5b8f8f8d8d01852dddc8/net/netfilter/nf_nat_core.c#L290-L291
   
  There seems to be a workaround to alleviate this behavior by setting the -random-fully flag in iptables. Unfortunately, this is only available since iptables 1.6.2.

  Also this is not currently supported in neutron for the SNAT rules, it
  just uses the --to-source.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1814002/+subscriptions

Follow ups

[Bug 1814002] Re: Packets getting lost during SNAT with too many connections using the same source and destination on Network Node
From: OpenStack Infra, 2019-04-24