← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1758868] Re: ovs restart can lead to critical ovs flows missing

 

*** This bug is a duplicate of bug 1584647 ***
    https://bugs.launchpad.net/bugs/1584647

On the assumption that bug 1584647 resolved this issue marking as a dupe
- please comment if this is not the case or the issue remains.

** This bug has been marked a duplicate of bug 1584647
   [SRU] "Interface monitor is not active" can be observed at ovs-agent start

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1758868

Title:
  ovs restart can lead to critical ovs flows missing

Status in neutron:
  New
Status in neutron package in Ubuntu:
  New

Bug description:
  Hi,

  Running mitaka on xenial (neutron 2:8.4.0-0ubuntu6). We have l2pop and
  no l3ha. Using ovs with GRE tunnels.

  The cloud has around 30 compute nodes (mostly arm64). Last week, ovs
  got restarted during a package upgrade :

  2018-03-21 17:17:25 upgrade openvswitch-common:arm64
  2.5.2-0ubuntu0.16.04.3 2.5.4-0ubuntu0.16.04.1

  This led to instances on 2 arm64 compute nodes lose networking
  completely. Upon closer inspection, I realized that a flow was missing
  in br-tun table 3 : https://pastebin.ubuntu.com/p/VXRJJX8J3k/

  I believe this is due to a race in ovs_neutron_agent.py. These flows
  in table 3 are set up in provision_local_vlan() :
  https://github.com/openstack/neutron/blob/mitaka-
  eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L675

  which is called by port_bound() :
  https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L789-L791

  which is called by treat_vif_port() :
  https://github.com/openstack/neutron/blob/mitaka-
  eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1405-L1410

  which is called by treat_devices_added_or_updated() :
  https://github.com/openstack/neutron/blob/mitaka-
  eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1517-L1525

  which is called by process_network_ports() :
  https://github.com/openstack/neutron/blob/mitaka-
  eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1618-L1623

  which is called by the big rpc_loop() :
  https://github.com/openstack/neutron/blob/mitaka-
  eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2023-L2029

  
  So how does the agent knows when to create these table 3 flows ? Well, in rpc_loop(), it checks for OVS restarts (https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1947-L1948), and if OVS did restart, it does some basic ovs setup (default flows, etc), and (very important for later), it restarts the OVS polling manager.

  Later (still in rpc_loop()), it sets "ovs_restarted" to True, and
  process the ports as usual. The expected behaviour here is that since
  the polling manager got restarted, any port up will be marked as
  "added" and processed as such, in port_bound() (see call stack above).
  If this function is called on a port when ovs_restarted is True, then
  provision_local_vlan() will get called and will able the table 3
  flows.

  This is all working great under the assumption that the polling
  manager (which is an async process) will raise the "I got new port !"
  event before the rpc_loop() checks it (in process_port_events(),
  called by process_port_info()). However, if for example the node is
  under load, this may not always be the case.

  What happens then is that the rpc_loop in which OVS is detected as
  restarted doesn't see any change on the ports, and so does nothing.
  The next run of the rpc_loop will process the "I got new port !"
  events, but that loop will not be running with ovs_restarted set to
  True, so the ports won't be brought up properly - more specifically,
  the table 3 flows in br-tun will be missing. This is shown in the
  debug logs : https://pastebin.ubuntu.com/p/M8yYn3YnQ6/ - you can see
  the loop in which "OVS is restarted" is detected (loop iteration
  320773) doesn't process any port ("iteration:320773 completed.
  Processed ports statistics: {'regular': {'updated': 0, 'added': 0,
  'removed': 0}}.), but the next iteration does process 3 "added" ports.
  You can see that the "output received" is logged in the first loop,
  49ms after "starting polling" is logged, which is presumably the
  problem. On all the non-failing nodes, the output is received before
  "starting polling".

  I believe the proper thing to do is to set "sync" to True (in
  rpc_loop()) if an ovs restart is detected, forcing process_port_info()
  to not use async events and scan the ports itself using scan_ports().

  Thanks

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1758868/+subscriptions