← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1853613] Re: VMs don't get ip from dhcp after compute restart

 

Reviewed:  https://review.opendev.org/697655
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=93e9dc5426764b791ac69e62c6d60be7591c16ab
Submitter: Zuul
Branch:    master

commit 93e9dc5426764b791ac69e62c6d60be7591c16ab
Author: Darragh O'Reilly <doreilly@xxxxxxxx>
Date:   Fri Dec 6 10:06:21 2019 +0000

    ovs agent: signal to plugin if tunnel refresh needed
    
    Currently the ovs agent calls update_device_list with the
    agent_restarted flag set only on the first loop iteration. Then the
    server knows to send the l2pop flooding entries for the network to
    the agent. But when a compute node with many instances on many
    networks reboots, it takes time to readd all the active devices and
    some may be readded after the first loop iteration. Then the server
    can fail to send the flooding entries which means there will be no
    flood_to_tuns flow and broadcasts like dhcp will fail.
    
    This patch fixes that by renaming the agent_restarted flag to
    refresh_tunnels and setting it if the agent has not received the
    flooding entries for the network.
    
    Change-Id: I607aa8fa399e72b037fd068ad4f02b6210e57e91
    Closes-Bug: #1853613


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1853613

Title:
  VMs don't get ip from dhcp after compute restart

Status in neutron:
  Fix Released

Bug description:
  Env: pike + ovs + vxlan + l2pop + iptables_hybrid.
  Dhcp agent on differnt node than compute.

  Steps:
  1. Boot 4 or more vms to same compute and same vxlan net.
  2. Wait until they are fully running and reboot compute node.
  3. After boot the vms are in status SHUTOFF. Start the vms.

  Vms don't get an ip address from neutron dhcp. The flood to tunnels
  flow (br-tun table 22) for the network is missing, so broadcasts like
  dhcp requests don't get on a tunnel to the node with dhcp agent.
  Neutron server did not send the flooding entry to the agent. It only
  does that for the first or second active port, or if the agent is
  restarted.

  After the compute boots, neutron-ovs-cleanup runs first and deletes
  the qvo ports from br-int [4]. Then the ovs-agent starts and nova-
  compute after it. Nova-compute destroys the domains and moves the vms
  to SHUTOFF status. It also (for some reason) recreates the qbr linux
  bridges and qvb/qvo veths connected to br-int. So neutron continues to
  see the ports as ACTIVE even though the vms are SHUTOFF, and
  agent_active_ports [1] never drops below 3. Also nova-compute might
  start a short time after the ovs-agent and the new ports are not
  detected in first iteration of the ovs agent loop, so agent_restarted
  will be false here [2].

  Before [3] agent_restarted was true if the agent was running for less
  than agent_boot_time (default 180 sec) and the problem did not show.

  It does not happen if neutron-ovs-cleanup is disabled. Then the ovs
  agent first treats them as skipped_devices and they get status DOWN.

  [1] https://github.com/openstack/neutron/blob/21a52f7ae597f7992f32ff41cedff0c31e35c762/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L306
  [2] https://github.com/openstack/neutron/blob/21a52f7ae597f7992f32ff41cedff0c31e35c762/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L310 
  [3] https://opendev.org/openstack/neutron/commit/62fe7852bbd70a24174853997096c52ee015e269
  [4] https://bugs.launchpad.net/neutron/+bug/1853582

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1853613/+subscriptions


References