← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1803919] [NEW] [L2] dataplane down during ovs-agent restart

 

Public bug reported:

ENV:
neutron: stable/queens
tenant network type: vlan
provider network type: vlan
kernel: 3.10.0-862.3.2.el7.x86_64

Problem description:
This is an extremly case for neutron ovs-agent during restart.
(1) condition 1: tenant network and provider network share the physic NIC, aka send the traffic to the same physic NIC, so the brige mapping will be: br-provider:bond1. No other mappings.
(2) condition 2: Neutron-servers are all down, or message queue is down.
Then, restart the L2 ovs-agent, the dataplane will down.

This issue was seen during a large deployment upgrading procedure, when
restart neutron-server and ovs-agent synchronously, some ovs-agent will
get message timeout, and the VM traffic is down.

Code digging:
stable/queens and master branch has basicly same procedure for this issue.
The ovs-agent init procedure has a call for `setup_physical_bridges`, it has two drop flows:
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1225-L1226
After this two drop flows installed, the VMs traffic will go down.
If the MQ or neutron server is not up, the VM will be unreachable. Until the MQ or neutron server are all up, the ovs-agent will require a manually restart again to recover the traffic.

** Affects: neutron
     Importance: Undecided
     Assignee: LIU Yulong (dragon889)
         Status: New

** Description changed:

  ENV:
  neutron: stable/queens
  tenant network type: vlan
  provider network type: vlan
  kernel: 3.10.0-862.3.2.el7.x86_64
  
  Problem description:
  This is an extremly case for neutron ovs-agent during restart.
  (1) condition 1: tenant network and provider network share the physic NIC, aka send the traffic to the same physic NIC, so the brige mapping will be: br-provider:bond1. No other mappings.
  (2) condition 2: Neutron-servers are all down, or message queue is down.
  Then, restart the L2 ovs-agent, the dataplane will down.
  
  This issue was seen during a large deployment upgrading procedure, when
  restart neutron-server and ovs-agent synchronously, some ovs-agent will
  get message timeout, and the VM traffic is down.
  
  Code digging:
  stable/queens and master branch has basicly same procedure for this issue.
  The ovs-agent init procedure has a call for `setup_physical_bridges`, it has two drop flows:
- https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1221-L1222
+ https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1225-L1226
  After this two drop flows installed, the VMs traffic will go down.
  If the MQ or neutron server is not up, the VM will be unreachable. Until the MQ or neutron server are all up, the ovs-agent will require a manually restart again to recover the traffic.

** Changed in: neutron
     Assignee: (unassigned) => LIU Yulong (dragon889)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1803919

Title:
  [L2] dataplane down during ovs-agent restart

Status in neutron:
  New

Bug description:
  ENV:
  neutron: stable/queens
  tenant network type: vlan
  provider network type: vlan
  kernel: 3.10.0-862.3.2.el7.x86_64

  Problem description:
  This is an extremly case for neutron ovs-agent during restart.
  (1) condition 1: tenant network and provider network share the physic NIC, aka send the traffic to the same physic NIC, so the brige mapping will be: br-provider:bond1. No other mappings.
  (2) condition 2: Neutron-servers are all down, or message queue is down.
  Then, restart the L2 ovs-agent, the dataplane will down.

  This issue was seen during a large deployment upgrading procedure,
  when restart neutron-server and ovs-agent synchronously, some ovs-
  agent will get message timeout, and the VM traffic is down.

  Code digging:
  stable/queens and master branch has basicly same procedure for this issue.
  The ovs-agent init procedure has a call for `setup_physical_bridges`, it has two drop flows:
  https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1225-L1226
  After this two drop flows installed, the VMs traffic will go down.
  If the MQ or neutron server is not up, the VM will be unreachable. Until the MQ or neutron server are all up, the ovs-agent will require a manually restart again to recover the traffic.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1803919/+subscriptions


Follow ups