yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1794991] [NEW] Inconsistent flows with DVR l2pop VxLAN on br-tun

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Gaëtan Trellu <gaetan.trellu@xxxxxxxxxxxxx>
Date: Fri, 28 Sep 2018 13:56:14 -0000
Reply-to: Bug 1794991 <1794991@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

We are using Neutron (Pike) configured as DVR with l2pop and ARP
responder and VxLAN. Since few weeks we are experiencing unexpected
behaviors:

- [1] Some instances are not able to get DHCP address
- [2] Instances are not able to ping other instances on different compute

This is totally random, sometime it will work as expected and sometime
we will have the behaviors describe above.

After checking the flows between network and compute nodes we have been
able to discover that for behavior [1] it is due to missing flows on the
compute nodes pointing to the DHCP agent on the network one.

About behavior [2] it is related to missing flows too, some compute
nodes have missing output to other compute nodes (vxlan-xxxxxx) which
prevent an instance on compute 1 to communicate to an instance on
compute 2.

When we add the missing flows for [1] and [2] we are able to fix the
issues but if we restart neutron-openvswitch-agent the flows are missing
again.

For [1] sometime just disable/enable the port on the network nodes
related to each DHCP solve the problem and sometime not.

For [2] the only way we found to fix the flows without adding them
manually is to remove all instances of a network from the compute and
create a new instance from this network which will sends a notification
message to all computing and network nodes but again when neutron-
openvswitch-agent restart the flows vanish again.

We cherry-picked these commits but nothing changed:
  - https://review.openstack.org/#/c/600151/
  - https://review.openstack.org/#/c/573785/

Any ideas ?

** Affects: neutron
     Importance: Undecided
         Status: New

** Description changed:

  We are using Neutron (Pike) configured as DVR with l2pop and ARP
  responder. Since few weeks we are experiencing unexpected behaviors:
  
  - [1] Some instances are not able to get DHCP address
  - [2] Instances are not able to ping other instances on different compute
  
  This is totally random, sometime it will work as expected and sometime
  we will have the behaviors describe above.
  
  After checking the flows between network and compute nodes we have been
- able to discover that for behavior [1] is due to missing flows on the
+ able to discover that for behavior [1] it is due to missing flows on the
  compute nodes pointing to the DHCP agent on the network one.
  
  About behavior [2] it is related to missing flows too, some compute
  nodes have missing output to other compute nodes which prevent an
  instance on compute 1 to communicate to an instance on compute 2.
  
  When we add the missing flows for [1] and [2] we are able to fix the
  issues but if we restart neutron-openvswitch-agent the flows are missing
  again.
  
  For [1] sometime just disable/enable the port on the network nodes
  related to each DHCP solve the problem and sometime not.
  
  For [2] the only way we found to fix the flows without adding them
  manually is to remove all instances of a network from the compute and
  create a new instance from this network which will sends a notification
  message to all computing and network nodes.
  
  We cherry-picked the commits but nothing changed:
-   - https://review.openstack.org/#/c/600151/
-   - https://review.openstack.org/#/c/573785/
+   - https://review.openstack.org/#/c/600151/
+   - https://review.openstack.org/#/c/573785/
  
  Any ideas ?

** Description changed:

  We are using Neutron (Pike) configured as DVR with l2pop and ARP
- responder. Since few weeks we are experiencing unexpected behaviors:
+ responder and VxLAN. Since few weeks we are experiencing unexpected
+ behaviors:
  
  - [1] Some instances are not able to get DHCP address
  - [2] Instances are not able to ping other instances on different compute
  
  This is totally random, sometime it will work as expected and sometime
  we will have the behaviors describe above.
  
  After checking the flows between network and compute nodes we have been
  able to discover that for behavior [1] it is due to missing flows on the
  compute nodes pointing to the DHCP agent on the network one.
  
  About behavior [2] it is related to missing flows too, some compute
- nodes have missing output to other compute nodes which prevent an
- instance on compute 1 to communicate to an instance on compute 2.
+ nodes have missing output to other compute nodes (vxlan-xxxxxx) which
+ prevent an instance on compute 1 to communicate to an instance on
+ compute 2.
  
  When we add the missing flows for [1] and [2] we are able to fix the
  issues but if we restart neutron-openvswitch-agent the flows are missing
  again.
  
  For [1] sometime just disable/enable the port on the network nodes
  related to each DHCP solve the problem and sometime not.
  
  For [2] the only way we found to fix the flows without adding them
  manually is to remove all instances of a network from the compute and
  create a new instance from this network which will sends a notification
  message to all computing and network nodes.
  
  We cherry-picked the commits but nothing changed:
    - https://review.openstack.org/#/c/600151/
    - https://review.openstack.org/#/c/573785/
  
  Any ideas ?

** Description changed:

  We are using Neutron (Pike) configured as DVR with l2pop and ARP
  responder and VxLAN. Since few weeks we are experiencing unexpected
  behaviors:
  
  - [1] Some instances are not able to get DHCP address
  - [2] Instances are not able to ping other instances on different compute
  
  This is totally random, sometime it will work as expected and sometime
  we will have the behaviors describe above.
  
  After checking the flows between network and compute nodes we have been
  able to discover that for behavior [1] it is due to missing flows on the
  compute nodes pointing to the DHCP agent on the network one.
  
  About behavior [2] it is related to missing flows too, some compute
  nodes have missing output to other compute nodes (vxlan-xxxxxx) which
  prevent an instance on compute 1 to communicate to an instance on
  compute 2.
  
  When we add the missing flows for [1] and [2] we are able to fix the
  issues but if we restart neutron-openvswitch-agent the flows are missing
  again.
  
  For [1] sometime just disable/enable the port on the network nodes
  related to each DHCP solve the problem and sometime not.
  
  For [2] the only way we found to fix the flows without adding them
  manually is to remove all instances of a network from the compute and
  create a new instance from this network which will sends a notification
- message to all computing and network nodes.
+ message to all computing and network nodes but again when neutron-
+ openvswitch-agent restart the flows vanish again.
  
  We cherry-picked the commits but nothing changed:
    - https://review.openstack.org/#/c/600151/
    - https://review.openstack.org/#/c/573785/
  
  Any ideas ?

** Description changed:

  We are using Neutron (Pike) configured as DVR with l2pop and ARP
  responder and VxLAN. Since few weeks we are experiencing unexpected
  behaviors:
  
  - [1] Some instances are not able to get DHCP address
  - [2] Instances are not able to ping other instances on different compute
  
  This is totally random, sometime it will work as expected and sometime
  we will have the behaviors describe above.
  
  After checking the flows between network and compute nodes we have been
  able to discover that for behavior [1] it is due to missing flows on the
  compute nodes pointing to the DHCP agent on the network one.
  
  About behavior [2] it is related to missing flows too, some compute
  nodes have missing output to other compute nodes (vxlan-xxxxxx) which
  prevent an instance on compute 1 to communicate to an instance on
  compute 2.
  
  When we add the missing flows for [1] and [2] we are able to fix the
  issues but if we restart neutron-openvswitch-agent the flows are missing
  again.
  
  For [1] sometime just disable/enable the port on the network nodes
  related to each DHCP solve the problem and sometime not.
  
  For [2] the only way we found to fix the flows without adding them
  manually is to remove all instances of a network from the compute and
  create a new instance from this network which will sends a notification
  message to all computing and network nodes but again when neutron-
  openvswitch-agent restart the flows vanish again.
  
- We cherry-picked the commits but nothing changed:
+ We cherry-picked these commits but nothing changed:
    - https://review.openstack.org/#/c/600151/
    - https://review.openstack.org/#/c/573785/
  
  Any ideas ?

** Summary changed:

- Inconsistent flows with DVR l2pop on br-tun
+ Inconsistent flows with DVR l2pop VxLAN on br-tun

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1794991

Title:
  Inconsistent flows with DVR l2pop VxLAN on br-tun

Status in neutron:
  New

Bug description:
  We are using Neutron (Pike) configured as DVR with l2pop and ARP
  responder and VxLAN. Since few weeks we are experiencing unexpected
  behaviors:

  - [1] Some instances are not able to get DHCP address
  - [2] Instances are not able to ping other instances on different compute

  This is totally random, sometime it will work as expected and sometime
  we will have the behaviors describe above.

  After checking the flows between network and compute nodes we have
  been able to discover that for behavior [1] it is due to missing flows
  on the compute nodes pointing to the DHCP agent on the network one.

  About behavior [2] it is related to missing flows too, some compute
  nodes have missing output to other compute nodes (vxlan-xxxxxx) which
  prevent an instance on compute 1 to communicate to an instance on
  compute 2.

  When we add the missing flows for [1] and [2] we are able to fix the
  issues but if we restart neutron-openvswitch-agent the flows are
  missing again.

  For [1] sometime just disable/enable the port on the network nodes
  related to each DHCP solve the problem and sometime not.

  For [2] the only way we found to fix the flows without adding them
  manually is to remove all instances of a network from the compute and
  create a new instance from this network which will sends a
  notification message to all computing and network nodes but again when
  neutron-openvswitch-agent restart the flows vanish again.

  We cherry-picked these commits but nothing changed:
    - https://review.openstack.org/#/c/600151/
    - https://review.openstack.org/#/c/573785/

  Any ideas ?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1794991/+subscriptions
Follow ups

[Bug 1794991] Re: Inconsistent flows with DVR l2pop VxLAN on br-tun
From: OpenStack Infra, 2019-03-23