← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1651672] [NEW] dhcp_ready_on_ports causing race with Neutron OVS agent boot

 

Public bug reported:

When neutron-openvswitch-agent starts it indirectly causes all its ports
to transition into BUILD state. If DHCP is enabled a DHCP agent receives
a port.update.end notification and refreshes its configuration. After
this a dhcp_ready_on_ports RPC call is made.

In this stage there are no provisioning blocks as we haven't created any
as no new ports are actually created. However, PROVISIONING_COMPLETE
event is still emitted which causes the ports to transition into ACTIVE
state. If l2pop is enabled, fdb entries are sent at this stage.

The problem: with large number of ports, OVS agent is most likely still
processing and allocating local vlans. This causes some (or all) of the
fdb entries to be discarded as there are no local vlans. When the OVS
agent reaches the point where it uses update_device_list RPC call to
transition ports into ACTIVE they are already in that state and no fdb
entries are emitted.

Version: observed in Newton (neutron 9.0.0)

Pre-conditions: 
  - standalone network node with l3-agent in legacy mode
  - dhcp agent running on another node
  - ovsdb_interface in vsctl mode (due to performance issues with IDL)

To reproduce:
  - have a L3 node with large amount of ports (we had about 1000)
  - have a DHCP agent running on some other node
  - issue a cold boot on the L3 node (no ports in br-int, no existing flows in br-tun). start ovs agent and l3 agent at the same time
  - observe incoming fdb entries before ports are actually provisioned

Expected behaviour: dhcp agent should not cause these ports to
transition into ACTIVE. fdb entries should be emitted only when OVS
agent issues update_device_list call

Impact: if a network node is rebooted (due to hardware failure or some
other reason), the node is left in an inconsistent state after the
reboot. Random number of fdb entries are missing causing disruption to
user traffic.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1651672

Title:
  dhcp_ready_on_ports causing race with Neutron OVS agent boot

Status in neutron:
  New

Bug description:
  When neutron-openvswitch-agent starts it indirectly causes all its
  ports to transition into BUILD state. If DHCP is enabled a DHCP agent
  receives a port.update.end notification and refreshes its
  configuration. After this a dhcp_ready_on_ports RPC call is made.

  In this stage there are no provisioning blocks as we haven't created
  any as no new ports are actually created. However,
  PROVISIONING_COMPLETE event is still emitted which causes the ports to
  transition into ACTIVE state. If l2pop is enabled, fdb entries are
  sent at this stage.

  The problem: with large number of ports, OVS agent is most likely
  still processing and allocating local vlans. This causes some (or all)
  of the fdb entries to be discarded as there are no local vlans. When
  the OVS agent reaches the point where it uses update_device_list RPC
  call to transition ports into ACTIVE they are already in that state
  and no fdb entries are emitted.

  Version: observed in Newton (neutron 9.0.0)

  Pre-conditions: 
    - standalone network node with l3-agent in legacy mode
    - dhcp agent running on another node
    - ovsdb_interface in vsctl mode (due to performance issues with IDL)

  To reproduce:
    - have a L3 node with large amount of ports (we had about 1000)
    - have a DHCP agent running on some other node
    - issue a cold boot on the L3 node (no ports in br-int, no existing flows in br-tun). start ovs agent and l3 agent at the same time
    - observe incoming fdb entries before ports are actually provisioned

  Expected behaviour: dhcp agent should not cause these ports to
  transition into ACTIVE. fdb entries should be emitted only when OVS
  agent issues update_device_list call

  Impact: if a network node is rebooted (due to hardware failure or some
  other reason), the node is left in an inconsistent state after the
  reboot. Random number of fdb entries are missing causing disruption to
  user traffic.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1651672/+subscriptions