← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1887148] [NEW] Network loop between physical networks with DVR

 

Public bug reported:

Our CI experienced a network loop due to
https://review.opendev.org/#/c/733568/ . DVR is enabled and there is
more than one physical bridge mapping, and the neutron server was not
available when the ovs agents were started.

Steps
=====
# add more physical bridges
ovs-vsctl add-br br-physnet1
ip link set dev br-physnet1 up

ovs-vsctl add-br br-physnet2
ip link set dev br-physnet2 up

# set a broadcast going from one bridge
ip address add 1.1.1.1/31 dev br-physnet1
arping -b -I br-physnet1 1.1.1.1

# listen on the other
tcpdump -eni br-physnet2

# Update /etc/neutron/plugins/ml2/ml2_conf.ini
[ml2_type_vlan]
network_vlan_ranges = public,physnet1,physnet2

[ovs]
datapath_type = system
bridge_mappings = public:br-ex,physnet1:br-physnet1,physnet2:br-physnet2
tunnel_bridge = br-tun
local_ip = 127.0.0.1

[agent]
tunnel_types = vxlan
root_helper_daemon = sudo /usr/local/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
root_helper = sudo /usr/local/bin/neutron-rootwrap /etc/neutron/rootwrap.conf
enable_distributed_routing = True
l2_population = True

# stop server and agent
systemctl stop devstack@q-svc
systemctl stop devstack@q-agt

# clear all flows
for BR in $(sudo ovs-vsctl list-br); do echo $BR; sudo ovs-ofctl del-flows $BR; done

# start agent
systemctl start devstack@q-agt

$ sudo tcpdump -eni br-physnet2
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br-physnet2, link-type EN10MB (Ethernet), capture size 262144 bytes
09:46:56.577183 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
09:46:57.577568 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
...

If there is more than one node running the ovs agent in this state, then
there will be a network loop and packets can multiple quickly and
overwhelm the network. We saw ~1 million packets/sec.

I think because the neutron server is not available, the get_dvr_mac_address rpc is blocked and the required drops are not installed:
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L138
https://github.com/openstack/neutron/blob/5999716cfc4a00ac426e016eabbb51247ba0b190/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L230-L234

** Affects: neutron
     Importance: Undecided
         Status: New

** Summary changed:

- Network loop between physical network with DVR
+ Network loop between physical networks with DVR

** Description changed:

  Our CI experienced a network loop due to
  https://review.opendev.org/#/c/733568/ . DVR is enabled and there is
  more than one physical bridge mapping, and the neutron server was not
  available when the ovs agents were started.
  
  Steps
  =====
  # add more physical bridges
  ovs-vsctl add-br br-physnet1
  ip link set dev br-physnet1 up
  
  ovs-vsctl add-br br-physnet2
  ip link set dev br-physnet2 up
  
  # set a broadcast going from one bridge
  ip address add 1.1.1.1/31 dev br-physnet1
  arping -b -I br-physnet1 1.1.1.1
  
  # listen on the other
  tcpdump -eni br-physnet2
  
- 
  # Update /etc/neutron/plugins/ml2/ml2_conf.ini
  [ml2_type_vlan]
  network_vlan_ranges = public,physnet1,physnet2
  
  [ovs]
  datapath_type = system
  bridge_mappings = public:br-ex,physnet1:br-physnet1,physnet2:br-physnet2
  tunnel_bridge = br-tun
  local_ip = 127.0.0.1
  
  [agent]
  tunnel_types = vxlan
  root_helper_daemon = sudo /usr/local/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
  root_helper = sudo /usr/local/bin/neutron-rootwrap /etc/neutron/rootwrap.conf
  enable_distributed_routing = True
  l2_population = True
  
- 
  # stop server and agent
  systemctl stop devstack@q-svc
  systemctl stop devstack@q-agt
  
  # clear all flows
  for BR in $(sudo ovs-vsctl list-br); do echo $BR; sudo ovs-ofctl del-flows $BR; done
  
- 
  systemctl start devstack@q-agt
  
- 
- $ sudo tcpdump -eni br-physnet2                                                                                                 
- tcpdump: verbose output suppressed, use -v or -vv for full protocol decode                                                                                   
- listening on br-physnet2, link-type EN10MB (Ethernet), capture size 262144 bytes                                                                             
+ $ sudo tcpdump -eni br-physnet2
+ tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
+ listening on br-physnet2, link-type EN10MB (Ethernet), capture size 262144 bytes
  09:46:56.577183 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
  09:46:57.577568 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
  ...
  
- If there is more than node running the ovs agent in this state, then
+ If there are more than node running the ovs agent in this state, then
  there will be a network loop and packets can multiple quickly and
  overwhelm the network. We saw ~1 million packets/sec.
- 
  
  I think because the neutron server is not available, the get_dvr_mac_address rpc is blocked and the required drops are not installed:
  https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L138
  https://github.com/openstack/neutron/blob/5999716cfc4a00ac426e016eabbb51247ba0b190/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L230-L234

** Description changed:

  Our CI experienced a network loop due to
  https://review.opendev.org/#/c/733568/ . DVR is enabled and there is
  more than one physical bridge mapping, and the neutron server was not
  available when the ovs agents were started.
  
  Steps
  =====
  # add more physical bridges
  ovs-vsctl add-br br-physnet1
  ip link set dev br-physnet1 up
  
  ovs-vsctl add-br br-physnet2
  ip link set dev br-physnet2 up
  
  # set a broadcast going from one bridge
  ip address add 1.1.1.1/31 dev br-physnet1
  arping -b -I br-physnet1 1.1.1.1
  
  # listen on the other
  tcpdump -eni br-physnet2
  
  # Update /etc/neutron/plugins/ml2/ml2_conf.ini
  [ml2_type_vlan]
  network_vlan_ranges = public,physnet1,physnet2
  
  [ovs]
  datapath_type = system
  bridge_mappings = public:br-ex,physnet1:br-physnet1,physnet2:br-physnet2
  tunnel_bridge = br-tun
  local_ip = 127.0.0.1
  
  [agent]
  tunnel_types = vxlan
  root_helper_daemon = sudo /usr/local/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
  root_helper = sudo /usr/local/bin/neutron-rootwrap /etc/neutron/rootwrap.conf
  enable_distributed_routing = True
  l2_population = True
  
  # stop server and agent
  systemctl stop devstack@q-svc
  systemctl stop devstack@q-agt
  
  # clear all flows
  for BR in $(sudo ovs-vsctl list-br); do echo $BR; sudo ovs-ofctl del-flows $BR; done
  
+ # start agent
  systemctl start devstack@q-agt
  
  $ sudo tcpdump -eni br-physnet2
  tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
  listening on br-physnet2, link-type EN10MB (Ethernet), capture size 262144 bytes
  09:46:56.577183 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
  09:46:57.577568 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
  ...
  
- If there are more than node running the ovs agent in this state, then
+ If there is more than node running the ovs agent in this state, then
  there will be a network loop and packets can multiple quickly and
  overwhelm the network. We saw ~1 million packets/sec.
  
  I think because the neutron server is not available, the get_dvr_mac_address rpc is blocked and the required drops are not installed:
  https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L138
  https://github.com/openstack/neutron/blob/5999716cfc4a00ac426e016eabbb51247ba0b190/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L230-L234

** Description changed:

  Our CI experienced a network loop due to
  https://review.opendev.org/#/c/733568/ . DVR is enabled and there is
  more than one physical bridge mapping, and the neutron server was not
  available when the ovs agents were started.
  
  Steps
  =====
  # add more physical bridges
  ovs-vsctl add-br br-physnet1
  ip link set dev br-physnet1 up
  
  ovs-vsctl add-br br-physnet2
  ip link set dev br-physnet2 up
  
  # set a broadcast going from one bridge
  ip address add 1.1.1.1/31 dev br-physnet1
  arping -b -I br-physnet1 1.1.1.1
  
  # listen on the other
  tcpdump -eni br-physnet2
  
  # Update /etc/neutron/plugins/ml2/ml2_conf.ini
  [ml2_type_vlan]
  network_vlan_ranges = public,physnet1,physnet2
  
  [ovs]
  datapath_type = system
  bridge_mappings = public:br-ex,physnet1:br-physnet1,physnet2:br-physnet2
  tunnel_bridge = br-tun
  local_ip = 127.0.0.1
  
  [agent]
  tunnel_types = vxlan
  root_helper_daemon = sudo /usr/local/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
  root_helper = sudo /usr/local/bin/neutron-rootwrap /etc/neutron/rootwrap.conf
  enable_distributed_routing = True
  l2_population = True
  
  # stop server and agent
  systemctl stop devstack@q-svc
  systemctl stop devstack@q-agt
  
  # clear all flows
  for BR in $(sudo ovs-vsctl list-br); do echo $BR; sudo ovs-ofctl del-flows $BR; done
  
  # start agent
  systemctl start devstack@q-agt
  
  $ sudo tcpdump -eni br-physnet2
  tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
  listening on br-physnet2, link-type EN10MB (Ethernet), capture size 262144 bytes
  09:46:56.577183 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
  09:46:57.577568 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
  ...
  
- If there is more than node running the ovs agent in this state, then
+ If there are more than node running the ovs agent in this state, then
  there will be a network loop and packets can multiple quickly and
  overwhelm the network. We saw ~1 million packets/sec.
  
  I think because the neutron server is not available, the get_dvr_mac_address rpc is blocked and the required drops are not installed:
  https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L138
  https://github.com/openstack/neutron/blob/5999716cfc4a00ac426e016eabbb51247ba0b190/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L230-L234

** Description changed:

  Our CI experienced a network loop due to
  https://review.opendev.org/#/c/733568/ . DVR is enabled and there is
  more than one physical bridge mapping, and the neutron server was not
  available when the ovs agents were started.
  
  Steps
  =====
  # add more physical bridges
  ovs-vsctl add-br br-physnet1
  ip link set dev br-physnet1 up
  
  ovs-vsctl add-br br-physnet2
  ip link set dev br-physnet2 up
  
  # set a broadcast going from one bridge
  ip address add 1.1.1.1/31 dev br-physnet1
  arping -b -I br-physnet1 1.1.1.1
  
  # listen on the other
  tcpdump -eni br-physnet2
  
  # Update /etc/neutron/plugins/ml2/ml2_conf.ini
  [ml2_type_vlan]
  network_vlan_ranges = public,physnet1,physnet2
  
  [ovs]
  datapath_type = system
  bridge_mappings = public:br-ex,physnet1:br-physnet1,physnet2:br-physnet2
  tunnel_bridge = br-tun
  local_ip = 127.0.0.1
  
  [agent]
  tunnel_types = vxlan
  root_helper_daemon = sudo /usr/local/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
  root_helper = sudo /usr/local/bin/neutron-rootwrap /etc/neutron/rootwrap.conf
  enable_distributed_routing = True
  l2_population = True
  
  # stop server and agent
  systemctl stop devstack@q-svc
  systemctl stop devstack@q-agt
  
  # clear all flows
  for BR in $(sudo ovs-vsctl list-br); do echo $BR; sudo ovs-ofctl del-flows $BR; done
  
  # start agent
  systemctl start devstack@q-agt
  
  $ sudo tcpdump -eni br-physnet2
  tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
  listening on br-physnet2, link-type EN10MB (Ethernet), capture size 262144 bytes
  09:46:56.577183 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
  09:46:57.577568 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
  ...
  
- If there are more than node running the ovs agent in this state, then
+ If there is more than one node running the ovs agent in this state, then
  there will be a network loop and packets can multiple quickly and
  overwhelm the network. We saw ~1 million packets/sec.
  
  I think because the neutron server is not available, the get_dvr_mac_address rpc is blocked and the required drops are not installed:
  https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L138
  https://github.com/openstack/neutron/blob/5999716cfc4a00ac426e016eabbb51247ba0b190/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L230-L234

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1887148

Title:
  Network loop between physical networks with DVR

Status in neutron:
  New

Bug description:
  Our CI experienced a network loop due to
  https://review.opendev.org/#/c/733568/ . DVR is enabled and there is
  more than one physical bridge mapping, and the neutron server was not
  available when the ovs agents were started.

  Steps
  =====
  # add more physical bridges
  ovs-vsctl add-br br-physnet1
  ip link set dev br-physnet1 up

  ovs-vsctl add-br br-physnet2
  ip link set dev br-physnet2 up

  # set a broadcast going from one bridge
  ip address add 1.1.1.1/31 dev br-physnet1
  arping -b -I br-physnet1 1.1.1.1

  # listen on the other
  tcpdump -eni br-physnet2

  # Update /etc/neutron/plugins/ml2/ml2_conf.ini
  [ml2_type_vlan]
  network_vlan_ranges = public,physnet1,physnet2

  [ovs]
  datapath_type = system
  bridge_mappings = public:br-ex,physnet1:br-physnet1,physnet2:br-physnet2
  tunnel_bridge = br-tun
  local_ip = 127.0.0.1

  [agent]
  tunnel_types = vxlan
  root_helper_daemon = sudo /usr/local/bin/neutron-rootwrap-daemon /etc/neutron/rootwrap.conf
  root_helper = sudo /usr/local/bin/neutron-rootwrap /etc/neutron/rootwrap.conf
  enable_distributed_routing = True
  l2_population = True

  # stop server and agent
  systemctl stop devstack@q-svc
  systemctl stop devstack@q-agt

  # clear all flows
  for BR in $(sudo ovs-vsctl list-br); do echo $BR; sudo ovs-ofctl del-flows $BR; done

  # start agent
  systemctl start devstack@q-agt

  $ sudo tcpdump -eni br-physnet2
  tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
  listening on br-physnet2, link-type EN10MB (Ethernet), capture size 262144 bytes
  09:46:56.577183 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
  09:46:57.577568 e2:ab:d4:16:46:4d > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 1.1.1.1 (ff:ff:ff:ff:ff:ff) tell 1.1.1.1, length 28
  ...

  If there is more than one node running the ovs agent in this state,
  then there will be a network loop and packets can multiple quickly and
  overwhelm the network. We saw ~1 million packets/sec.

  I think because the neutron server is not available, the get_dvr_mac_address rpc is blocked and the required drops are not installed:
  https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L138
  https://github.com/openstack/neutron/blob/5999716cfc4a00ac426e016eabbb51247ba0b190/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L230-L234

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1887148/+subscriptions


Follow ups