← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1945306] [NEW] north-south traffic not working when VM and main router are not on the same host

 

Public bug reported:

Some newly created VM's are not able to reach "outside" resources (e.g.
apt repositories) on the l3ha + dvr env, this problem can be easily
reproduced as long as VM and main router are not on the same host, and
'apt update' command can not be run inside VM, so the north-south
traffic is broken.

Here are steps to easily reproduce it.

1, set up wallaby or ussuri vrrp + dvr env (it works on train, not work on ussuri and wallaby)
2, create a test vm, query host by: nova show <VM> |grep host
3, query main router by: neutron l3-agent-list-hosting-router $(openstack router show provider-router -fvalue -cid)
4, make sure VM and main router are not on the same host
5, on main router host, it will fail to run: ip netns exec snat-xxx ping <VM-IP> -c1

I've done some bisect, I found:

15.3.4 (bionic-train)  - no problem
1c2e10f859             - no problem
16.4.0 (bionic-ussuri) - has problem
16.0.0-0ubuntu3        - has problem, and also have multiple active routers problem
16.0.0~b3~git2020041516.5f42488a9a-0ubuntu2 - BAD version, all routers are in standby state so we can't do any test
16.1.0 (focal) - has problem, and also have multiple active routers problem
16.2.0 (focal) - has problem
16.3.0 (focal) - has problem
16.4.0 (focal-ussuri) - has problem
focal-wallaby - has problem

Because I often have multiple standby issue with some commit id (eg:
14dd3e95ca) so that I can't continue bisect.

I also used 'ovs-appctl ofproto/trace' and tcpdump to do some debugs,
the results are as follows.

train - works
sg-xxx -> vm - https://pastebin.ubuntu.com/p/MHNVf8wXtb/
tcpdump on sg-xxx - https://pastebin.ubuntu.com/p/Fqxp4mvkgV/
tcpdump on vm's tap - https://pastebin.ubuntu.com/p/YppWc2Pg33/
tcpdump on qr-xxx - https://pastebin.ubuntu.com/p/MPmQ5xbnT2/     - can get icmp reply

ussuri - not work
sg-xxx -> vm - https://pastebin.ubuntu.com/p/hKfSB9gmd9/
tcpdump on sg-xxx - https://pastebin.ubuntu.com/p/NCcnGS4gdj/     - sg-xxx can't get icmp reply
tcpdump on vm's tap - https://pastebin.ubuntu.com/p/DHdVbB66NT/   - VM can't get sg-xxx's arp reply
tcpdump on qr-xxx - https://pastebin.ubuntu.com/p/4hJ7vdRRC4/     - can't get arp reply

It looks like VM can't get arp reply for sg-xxx interface,

** Affects: neutron
     Importance: Undecided
         Status: New

** Description changed:

  Some newly created VM's are not able to reach "outside" resources (e.g.
- apt repositories) on then l3ha + dvr env, I can easily reproduce this
- problem as long as VM and main router are not on the same host, and 'apt
- update' command can not be run inside VM, so the north-south traffic is
- broken.
+ apt repositories) on the l3ha + dvr env, this problem can be easily
+ reproduced as long as VM and main router are not on the same host, and
+ 'apt update' command can not be run inside VM, so the north-south
+ traffic is broken.
  
  Here are steps to easily reproduce it.
  
  1, set up wallaby or ussuri vrrp + dvr env (it works on train, not work on ussuri and wallaby)
  2, create a test vm, query host by: nova show <VM> |grep host
  3, query main router by: neutron l3-agent-list-hosting-router $(openstack router show provider-router -fvalue -cid)
  4, make sure VM and main router are not on the same host
- 5, on main router host, it will fail to run: ip netns exec snat-xxx ping <VM-IP> -c1 
+ 5, on main router host, it will fail to run: ip netns exec snat-xxx ping <VM-IP> -c1
  
  I've done some bisect, I found:
  
  15.3.4 (bionic-train)  - no problem
  1c2e10f859             - no problem
  16.4.0 (bionic-ussuri) - has problem
  16.0.0-0ubuntu3        - has problem, and also have multiple active routers problem
  16.0.0~b3~git2020041516.5f42488a9a-0ubuntu2 - BAD version, all routers are in standby state so we can't do any test
  16.1.0 (focal) - has problem, and also have multiple active routers problem
  16.2.0 (focal) - has problem
  16.3.0 (focal) - has problem
  16.4.0 (focal-ussuri) - has problem
  focal-wallaby - has problem
  
  Because I often have multiple standby issue with some commit id (eg:
  14dd3e95ca) so that I can't continue bisect.
  
  I also used 'ovs-appctl ofproto/trace' and tcpdump to do some debugs,
  the results are as follows.
  
  train - works
  sg-xxx -> vm - https://pastebin.ubuntu.com/p/MHNVf8wXtb/
  tcpdump on sg-xxx - https://pastebin.ubuntu.com/p/Fqxp4mvkgV/
  tcpdump on vm's tap - https://pastebin.ubuntu.com/p/YppWc2Pg33/
  tcpdump on qr-xxx - https://pastebin.ubuntu.com/p/MPmQ5xbnT2/     - can get icmp reply
  
  ussuri - not work
  sg-xxx -> vm - https://pastebin.ubuntu.com/p/hKfSB9gmd9/
  tcpdump on sg-xxx - https://pastebin.ubuntu.com/p/NCcnGS4gdj/     - sg-xxx can't get icmp reply
  tcpdump on vm's tap - https://pastebin.ubuntu.com/p/DHdVbB66NT/   - VM can't get sg-xxx's arp reply
  tcpdump on qr-xxx - https://pastebin.ubuntu.com/p/4hJ7vdRRC4/     - can't get arp reply
  
  It looks like VM can't get arp reply for sg-xxx interface,

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1945306

Title:
  north-south traffic not working when VM and main router are not on the
  same host

Status in neutron:
  New

Bug description:
  Some newly created VM's are not able to reach "outside" resources
  (e.g. apt repositories) on the l3ha + dvr env, this problem can be
  easily reproduced as long as VM and main router are not on the same
  host, and 'apt update' command can not be run inside VM, so the north-
  south traffic is broken.

  Here are steps to easily reproduce it.

  1, set up wallaby or ussuri vrrp + dvr env (it works on train, not work on ussuri and wallaby)
  2, create a test vm, query host by: nova show <VM> |grep host
  3, query main router by: neutron l3-agent-list-hosting-router $(openstack router show provider-router -fvalue -cid)
  4, make sure VM and main router are not on the same host
  5, on main router host, it will fail to run: ip netns exec snat-xxx ping <VM-IP> -c1

  I've done some bisect, I found:

  15.3.4 (bionic-train)  - no problem
  1c2e10f859             - no problem
  16.4.0 (bionic-ussuri) - has problem
  16.0.0-0ubuntu3        - has problem, and also have multiple active routers problem
  16.0.0~b3~git2020041516.5f42488a9a-0ubuntu2 - BAD version, all routers are in standby state so we can't do any test
  16.1.0 (focal) - has problem, and also have multiple active routers problem
  16.2.0 (focal) - has problem
  16.3.0 (focal) - has problem
  16.4.0 (focal-ussuri) - has problem
  focal-wallaby - has problem

  Because I often have multiple standby issue with some commit id (eg:
  14dd3e95ca) so that I can't continue bisect.

  I also used 'ovs-appctl ofproto/trace' and tcpdump to do some debugs,
  the results are as follows.

  train - works
  sg-xxx -> vm - https://pastebin.ubuntu.com/p/MHNVf8wXtb/
  tcpdump on sg-xxx - https://pastebin.ubuntu.com/p/Fqxp4mvkgV/
  tcpdump on vm's tap - https://pastebin.ubuntu.com/p/YppWc2Pg33/
  tcpdump on qr-xxx - https://pastebin.ubuntu.com/p/MPmQ5xbnT2/     - can get icmp reply

  ussuri - not work
  sg-xxx -> vm - https://pastebin.ubuntu.com/p/hKfSB9gmd9/
  tcpdump on sg-xxx - https://pastebin.ubuntu.com/p/NCcnGS4gdj/     - sg-xxx can't get icmp reply
  tcpdump on vm's tap - https://pastebin.ubuntu.com/p/DHdVbB66NT/   - VM can't get sg-xxx's arp reply
  tcpdump on qr-xxx - https://pastebin.ubuntu.com/p/4hJ7vdRRC4/     - can't get arp reply

  It looks like VM can't get arp reply for sg-xxx interface,

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1945306/+subscriptions