← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1852504] [NEW] DHCP reserved ports that were unscheduled are advertised as DNS servers

 

Public bug reported:

We have 2 DHCP servers per network. After network outages, and when
hosts come back online, the number of ACTIVE DHCP servers grow. This
happened again after more outages, with some networks having up to 9-10+
DHCP ports, many in ACTIVE state, despite neutron-server's neutron.conf
only having dhcp_agents_per_network = 2

It turns out these are "reserved_dhcp_port" as indicated by the
device_id.

As you can see here:
https://github.com/openstack/neutron/blob/master/neutron/db/agentschedulers_db.py#L399

When a network is rescheduled to a new DHCP agent, the old port is not
deleted, not is its status marked as DOWN. All that is done is it is
marked as reserved and the port updated.

However VMs on the network now get advertised all the DHCP ports on the
network as internal DNS servers, several stale entries in
/etc/resolv.conf in our case. Problem is some of these DHCP agents have
been unscheduled so the DNS servers don't actually exist. Also in the
VMs, more than 3 entries are not queried.

As you can see here, is resolv.conf on a VM:

[root@arjunpmk-master ~]# vim /etc/resolv.conf

# Generated by NetworkManager
search mpt1.pf9.io
nameserver 10.128.144.16
nameserver 10.128.144.23
nameserver 10.128.144.15
# NOTE: the libc resolver may not support more than 3 nameservers.
# The nameservers listed below may not be recognized.
nameserver 10.128.144.7
nameserver 10.128.144.4
nameserver 10.128.144.8
nameserver 10.128.144.9
nameserver 10.128.144.17
nameserver 10.128.144.12
nameserver 10.128.144.45
nameserver 10.128.144.46
nameserver 10.128.144.51


Here you can see all the DHCP ports for the network of this VM:

[root@df-us-mpt1-kvm arjun(admin)]# openstack port list --network ead88ed3-f1e0-4498-8c1e-6d091083ae33 --device-owner network:dhcp
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
| ID                                   | Name | MAC Address       | Fixed IP Addresses                                                           | Status |
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
| 02ff0f4c-f39d-4207-90b4-2a69585f4c8a |      | fa:16:3e:a9:36:82 | ip_address='10.128.144.16', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| 0b612f86-ad06-4bce-a333-bc18f3e9e7b1 |      | fa:16:3e:bb:d8:3d | ip_address='10.128.144.23', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN   |
| 402338ac-2ca6-4312-a2df-a306fc589f10 |      | fa:16:3e:a3:a8:57 | ip_address='10.128.144.15', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| 5d2edc73-4eff-44c0-8993-125636973384 |      | fa:16:3e:6c:cd:2b | ip_address='10.128.144.7', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6'  | ACTIVE |
| 78241da3-9674-479a-8b45-a580c7f8b117 |      | fa:16:3e:d0:9d:ef | ip_address='10.128.144.4', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6'  | ACTIVE |
| 7b41bf47-d4d4-434a-b704-4c67182ffcaa |      | fa:16:3e:4c:cf:54 | ip_address='10.128.144.8', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6'  | ACTIVE |
| 96897190-1aa8-4c17-a7d1-c3744f1bf962 |      | fa:16:3e:e8:55:29 | ip_address='10.128.144.45', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| af87dde6-fb46-4516-9569-e46496398b64 |      | fa:16:3e:0e:61:14 | ip_address='10.128.144.9', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6'  | ACTIVE |
| c2a2112d-c6ef-4411-a415-1a453d74a838 |      | fa:16:3e:d0:39:67 | ip_address='10.128.144.46', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN   |
| c8298fbd-06e7-4488-a3e1-874e9341d4cf |      | fa:16:3e:d6:3c:ac | ip_address='10.128.144.51', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN   |
| d6f0206f-ae3c-4ebf-95cb-104dad786724 |      | fa:16:3e:ab:ab:22 | ip_address='10.128.144.17', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
| e2be0f98-3333-4645-b58a-435e5513a4d3 |      | fa:16:3e:b4:ba:c0 | ip_address='10.128.144.12', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN   |
+--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+


If I view the first DNS server for the VM's resolv.conf (10.128.144.16), you can see its status is ACTIVE but its actually a reserved port. This is the same case for 2nd nameserver entry. Luckily the 3rd entry is valid, but this causes timeouts and all DNS lookups to take 10 seconds since first two fail. VMs on other networks aren't so lucky, where all 3 nameservers are reserved.


Expectation: Only DHCP ports that are actually scheduled (not reserved) should be advertised as DNS nameservers. I don't know if this means marking the port as DOWN, or deleting the port when unscheduled. 

maybe status needs to also be updated here?
https://github.com/openstack/neutron/blob/master/neutron/db/agentschedulers_db.py#L417

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: dns

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1852504

Title:
  DHCP reserved ports that were unscheduled are advertised as DNS
  servers

Status in neutron:
  New

Bug description:
  We have 2 DHCP servers per network. After network outages, and when
  hosts come back online, the number of ACTIVE DHCP servers grow. This
  happened again after more outages, with some networks having up to
  9-10+ DHCP ports, many in ACTIVE state, despite neutron-server's
  neutron.conf only having dhcp_agents_per_network = 2

  It turns out these are "reserved_dhcp_port" as indicated by the
  device_id.

  As you can see here:
  https://github.com/openstack/neutron/blob/master/neutron/db/agentschedulers_db.py#L399

  When a network is rescheduled to a new DHCP agent, the old port is not
  deleted, not is its status marked as DOWN. All that is done is it is
  marked as reserved and the port updated.

  However VMs on the network now get advertised all the DHCP ports on
  the network as internal DNS servers, several stale entries in
  /etc/resolv.conf in our case. Problem is some of these DHCP agents
  have been unscheduled so the DNS servers don't actually exist. Also in
  the VMs, more than 3 entries are not queried.

  As you can see here, is resolv.conf on a VM:

  [root@arjunpmk-master ~]# vim /etc/resolv.conf

  # Generated by NetworkManager
  search mpt1.pf9.io
  nameserver 10.128.144.16
  nameserver 10.128.144.23
  nameserver 10.128.144.15
  # NOTE: the libc resolver may not support more than 3 nameservers.
  # The nameservers listed below may not be recognized.
  nameserver 10.128.144.7
  nameserver 10.128.144.4
  nameserver 10.128.144.8
  nameserver 10.128.144.9
  nameserver 10.128.144.17
  nameserver 10.128.144.12
  nameserver 10.128.144.45
  nameserver 10.128.144.46
  nameserver 10.128.144.51

  
  Here you can see all the DHCP ports for the network of this VM:

  [root@df-us-mpt1-kvm arjun(admin)]# openstack port list --network ead88ed3-f1e0-4498-8c1e-6d091083ae33 --device-owner network:dhcp
  +--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
  | ID                                   | Name | MAC Address       | Fixed IP Addresses                                                           | Status |
  +--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+
  | 02ff0f4c-f39d-4207-90b4-2a69585f4c8a |      | fa:16:3e:a9:36:82 | ip_address='10.128.144.16', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
  | 0b612f86-ad06-4bce-a333-bc18f3e9e7b1 |      | fa:16:3e:bb:d8:3d | ip_address='10.128.144.23', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN   |
  | 402338ac-2ca6-4312-a2df-a306fc589f10 |      | fa:16:3e:a3:a8:57 | ip_address='10.128.144.15', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
  | 5d2edc73-4eff-44c0-8993-125636973384 |      | fa:16:3e:6c:cd:2b | ip_address='10.128.144.7', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6'  | ACTIVE |
  | 78241da3-9674-479a-8b45-a580c7f8b117 |      | fa:16:3e:d0:9d:ef | ip_address='10.128.144.4', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6'  | ACTIVE |
  | 7b41bf47-d4d4-434a-b704-4c67182ffcaa |      | fa:16:3e:4c:cf:54 | ip_address='10.128.144.8', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6'  | ACTIVE |
  | 96897190-1aa8-4c17-a7d1-c3744f1bf962 |      | fa:16:3e:e8:55:29 | ip_address='10.128.144.45', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
  | af87dde6-fb46-4516-9569-e46496398b64 |      | fa:16:3e:0e:61:14 | ip_address='10.128.144.9', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6'  | ACTIVE |
  | c2a2112d-c6ef-4411-a415-1a453d74a838 |      | fa:16:3e:d0:39:67 | ip_address='10.128.144.46', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN   |
  | c8298fbd-06e7-4488-a3e1-874e9341d4cf |      | fa:16:3e:d6:3c:ac | ip_address='10.128.144.51', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN   |
  | d6f0206f-ae3c-4ebf-95cb-104dad786724 |      | fa:16:3e:ab:ab:22 | ip_address='10.128.144.17', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | ACTIVE |
  | e2be0f98-3333-4645-b58a-435e5513a4d3 |      | fa:16:3e:b4:ba:c0 | ip_address='10.128.144.12', subnet_id='9757ae4a-ccfb-49b0-a9cc-53b8664631a6' | DOWN   |
  +--------------------------------------+------+-------------------+------------------------------------------------------------------------------+--------+

  
  If I view the first DNS server for the VM's resolv.conf (10.128.144.16), you can see its status is ACTIVE but its actually a reserved port. This is the same case for 2nd nameserver entry. Luckily the 3rd entry is valid, but this causes timeouts and all DNS lookups to take 10 seconds since first two fail. VMs on other networks aren't so lucky, where all 3 nameservers are reserved.

  
  Expectation: Only DHCP ports that are actually scheduled (not reserved) should be advertised as DNS nameservers. I don't know if this means marking the port as DOWN, or deleting the port when unscheduled. 

  maybe status needs to also be updated here?
  https://github.com/openstack/neutron/blob/master/neutron/db/agentschedulers_db.py#L417

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1852504/+subscriptions