yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1794991] Re: Inconsistent flows with DVR l2pop VxLAN on br-tun

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1794991@xxxxxxxxxxxxxxxxxx>
Date: Sat, 23 Mar 2019 04:47:21 -0000
Reply-to: Bug 1794991 <1794991@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Reviewed:  https://review.openstack.org/640797
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch:    master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong <i@xxxxxxxxxxxx>
Date:   Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer
    
    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.
    
    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.
    
    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178
    
    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1794991

Title:
  Inconsistent flows with DVR l2pop VxLAN on br-tun

Status in neutron:
  Fix Released

Bug description:
  We are using Neutron (Pike) configured as DVR with l2pop and ARP
  responder and VxLAN. Since few weeks we are experiencing unexpected
  behaviors:

  - [1] Some instances are not able to get DHCP address
  - [2] Instances are not able to ping other instances on different compute

  This is totally random, sometime it will work as expected and sometime
  we will have the behaviors describe above.

  After checking the flows between network and compute nodes we have
  been able to discover that for behavior [1] it is due to missing flows
  on the compute nodes pointing to the DHCP agent on the network one.

  About behavior [2] it is related to missing flows too, some compute
  nodes have missing output to other compute nodes (vxlan-xxxxxx) which
  prevent an instance on compute 1 to communicate to an instance on
  compute 2.

  When we add the missing flows for [1] and [2] we are able to fix the
  issues but if we restart neutron-openvswitch-agent the flows are
  missing again.

  For [1] sometime just disable/enable the port on the network nodes
  related to each DHCP solve the problem and sometime not.

  For [2] the only way we found to fix the flows without adding them
  manually is to remove all instances of a network from the compute and
  create a new instance from this network which will sends a
  notification message to all computing and network nodes but again when
  neutron-openvswitch-agent restart the flows vanish again.

  We cherry-picked these commits but nothing changed:
    - https://review.openstack.org/#/c/600151/
    - https://review.openstack.org/#/c/573785/

  Information about our deployment:
    - OS: Ubuntu 16.04.5
    - Deployer: Kolla
    - Docker: 18.06
    - OpenStack: Pike/Rocky

  Any ideas ?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1794991/+subscriptions

References

[Bug 1794991] [NEW] Inconsistent flows with DVR l2pop VxLAN on br-tun
From: Gaëtan Trellu, 2018-09-28