yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #72077
[Bug 1758868] Re: ovs restart can lead to critical ovs flows missing
I've added upstream neutron to the bug. Keeping in mind this is mitaka
and unsupported by upstream, but perhaps someone from upstream knows
whether this is fixed in a mitaka+ release or not.
** Also affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1758868
Title:
ovs restart can lead to critical ovs flows missing
Status in neutron:
New
Status in neutron package in Ubuntu:
New
Bug description:
Hi,
Running mitaka on xenial (neutron 2:8.4.0-0ubuntu6). We have l2pop and
no l3ha. Using ovs with GRE tunnels.
The cloud has around 30 compute nodes (mostly arm64). Last week, ovs
got restarted during a package upgrade :
2018-03-21 17:17:25 upgrade openvswitch-common:arm64
2.5.2-0ubuntu0.16.04.3 2.5.4-0ubuntu0.16.04.1
This led to instances on 2 arm64 compute nodes lose networking
completely. Upon closer inspection, I realized that a flow was missing
in br-tun table 3 : https://pastebin.ubuntu.com/p/VXRJJX8J3k/
I believe this is due to a race in ovs_neutron_agent.py. These flows
in table 3 are set up in provision_local_vlan() :
https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L675
which is called by port_bound() :
https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L789-L791
which is called by treat_vif_port() :
https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1405-L1410
which is called by treat_devices_added_or_updated() :
https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1517-L1525
which is called by process_network_ports() :
https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1618-L1623
which is called by the big rpc_loop() :
https://github.com/openstack/neutron/blob/mitaka-
eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2023-L2029
So how does the agent knows when to create these table 3 flows ? Well, in rpc_loop(), it checks for OVS restarts (https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1947-L1948), and if OVS did restart, it does some basic ovs setup (default flows, etc), and (very important for later), it restarts the OVS polling manager.
Later (still in rpc_loop()), it sets "ovs_restarted" to True, and
process the ports as usual. The expected behaviour here is that since
the polling manager got restarted, any port up will be marked as
"added" and processed as such, in port_bound() (see call stack above).
If this function is called on a port when ovs_restarted is True, then
provision_local_vlan() will get called and will able the table 3
flows.
This is all working great under the assumption that the polling
manager (which is an async process) will raise the "I got new port !"
event before the rpc_loop() checks it (in process_port_events(),
called by process_port_info()). However, if for example the node is
under load, this may not always be the case.
What happens then is that the rpc_loop in which OVS is detected as
restarted doesn't see any change on the ports, and so does nothing.
The next run of the rpc_loop will process the "I got new port !"
events, but that loop will not be running with ovs_restarted set to
True, so the ports won't be brought up properly - more specifically,
the table 3 flows in br-tun will be missing. This is shown in the
debug logs : https://pastebin.ubuntu.com/p/M8yYn3YnQ6/ - you can see
the loop in which "OVS is restarted" is detected (loop iteration
320773) doesn't process any port ("iteration:320773 completed.
Processed ports statistics: {'regular': {'updated': 0, 'added': 0,
'removed': 0}}.), but the next iteration does process 3 "added" ports.
You can see that the "output received" is logged in the first loop,
49ms after "starting polling" is logged, which is presumably the
problem. On all the non-failing nodes, the output is received before
"starting polling".
I believe the proper thing to do is to set "sync" to True (in
rpc_loop()) if an ovs restart is detected, forcing process_port_info()
to not use async events and scan the ports itself using scan_ports().
Thanks
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1758868/+subscriptions