← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1864711] [NEW] DHCP port rescheduling causes ports to grow, internal DNS to be broken

 

Public bug reported:

Suppose we have DHCP servers per network 2. And we have a # of DHCP
agents > 2.

During a time of network instability, RabbitMQ issues, or even a DHCP
host temporarily going down the DHCP port will get rescheduled.

Except it looks like it's not so much as getting rescheduled, but a
brand new port with IP/MAC is created on a new host. The old port is
only updated and marked as reserved, not deleted.

This causes two issues:

1. The # of DHCP ports grows. Even when the old host starts heartbeating
again, it's port is not deleted. For example we had an environment with
3 DHCP servers per network, and a dozen or so DHCP hosts. It was
observed that for some networks, there were 10+ DHCP ports allocated.

2. DNS is broken temporarily for VMs that still point to the old IPs.
/etc/resolv.conf can only store 3 servers, and either way, Linux's 5
second default DNS timeout means the first server going down or second
server going down causes a 5+ or 10+ delay, which breaks many other
apps.


I'm not sure if this is a bug, or by design. For example if the same IP/mac were re-used, we could have a conflict on the data plane. Neutron-server has no idea if DHCP/DNS services are actually down - it just knows it's not receiving heartbeats over the control plane. Is that why a new port is allocated? Prefer to mitigate the risk of conflict?

As for why the old ports aren't deleted or scaled down when connectivity
is restored, is this by design too?

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: dns l3-ipam-dhcp

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1864711

Title:
  DHCP port rescheduling causes ports to grow, internal DNS to be broken

Status in neutron:
  New

Bug description:
  Suppose we have DHCP servers per network 2. And we have a # of DHCP
  agents > 2.

  During a time of network instability, RabbitMQ issues, or even a DHCP
  host temporarily going down the DHCP port will get rescheduled.

  Except it looks like it's not so much as getting rescheduled, but a
  brand new port with IP/MAC is created on a new host. The old port is
  only updated and marked as reserved, not deleted.

  This causes two issues:

  1. The # of DHCP ports grows. Even when the old host starts
  heartbeating again, it's port is not deleted. For example we had an
  environment with 3 DHCP servers per network, and a dozen or so DHCP
  hosts. It was observed that for some networks, there were 10+ DHCP
  ports allocated.

  2. DNS is broken temporarily for VMs that still point to the old IPs.
  /etc/resolv.conf can only store 3 servers, and either way, Linux's 5
  second default DNS timeout means the first server going down or second
  server going down causes a 5+ or 10+ delay, which breaks many other
  apps.

  
  I'm not sure if this is a bug, or by design. For example if the same IP/mac were re-used, we could have a conflict on the data plane. Neutron-server has no idea if DHCP/DNS services are actually down - it just knows it's not receiving heartbeats over the control plane. Is that why a new port is allocated? Prefer to mitigate the risk of conflict?

  As for why the old ports aren't deleted or scaled down when
  connectivity is restored, is this by design too?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1864711/+subscriptions