yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #76147
[Bug 1803919] Re: [L2] dataplane down during ovs-agent restart
Reviewed: https://review.openstack.org/618720
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0385868848f8c18c8a37fd4c661d1b1a5078e044
Submitter: Zuul
Branch: master
commit 0385868848f8c18c8a37fd4c661d1b1a5078e044
Author: LIU Yulong <i@xxxxxxxxxxxx>
Date: Thu Nov 15 17:49:12 2018 +0800
Check if agent can reach neutron server
The ovs agent will install some basic drop flows first for the
physical bridge mappings during the init procedure. If message
queue is not connected, or neutron-servers are all down, real
traffic flows will not be refreshed anymore. This will cause
the data plane down if tenant network and provider network are
sharing the physical NICs.
This patch adds a RPC check during init L2 agent. When restart
the ovs-agent, if the MQ is OK and we have available neutron-server,
go next step. Otherwise, a rpc timeout will be raised. L2 agent
will start fail, physical bridge mapping drop flows will not be
installed. The original flows will not be replaced, so the traffic
can still work properly.
Closes-Bug: #1803919
Change-Id: Ie15cf625b3710eaf290d6aafecb3f65df664b9df
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1803919
Title:
[L2] dataplane down during ovs-agent restart
Status in neutron:
Fix Released
Bug description:
ENV:
neutron: stable/queens
tenant network type: vlan
provider network type: vlan
kernel: 3.10.0-862.3.2.el7.x86_64
Problem description:
This is an extremly case for neutron ovs-agent during restart.
(1) condition 1: tenant network and provider network share the physic NIC, aka send the traffic to the same physic NIC, so the brige mapping will be: br-provider:bond1. No other mappings.
(2) condition 2: Neutron-servers are all down, or message queue is down.
Then, restart the L2 ovs-agent, the dataplane will down.
This issue was seen during a large deployment upgrading procedure,
when restart neutron-server and ovs-agent synchronously, some ovs-
agent will get message timeout, and the VM traffic is down.
Code digging:
stable/queens and master branch has basicly same procedure for this issue.
The ovs-agent init procedure has a call for `setup_physical_bridges`, it has two drop flows:
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1225-L1226
After this two drop flows installed, the VMs traffic will go down.
If the MQ or neutron server is not up, the VM will be unreachable. Until the MQ or neutron server are all up, the ovs-agent will require a manually restart again to recover the traffic.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1803919/+subscriptions
References