← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1288493] [NEW] race betwen ip allocations and dhcp agent notifications

 

Public bug reported:

I came across a race condition while doing some performance work. The
symptom is messages in syslog like this:

Mar  5 17:26:06 ctrl dnsmasq-dhcp[32356]: duplicate IP address 10.0.0.18
in .

Sure enough, there were duplicate entries in dnsmasq's host file for the
same IP address. In this example, the MACs fa:16:3e:6d:15:03 and
fa:16:3e:06:82:26 were both assigned 10.0.0.18.

I instrumented neutron-dhcp-agent to check the ports it emits into the
host file for duplicate IP addresses:

2014-03-05 17:26:02.519 32309 ERROR neutron.agent.dhcp_agent [req-
8f5d9c5d-4c20-4c75-95f5-f5f27e971a87 6e8f4d378eb147f8b9f055c13c113f4e
68153eceb4c64a049dea2de621858662] duplicates for ip 10.0.0.18:
[(u'56f01383-92e5-43ff-948a-8549ce247b6c', u'fa:16:3e:6d:15:03'),
(u'086b514f-c843-4f25-bad8-24f65c4932fa', u'fa:16:3e:06:82:26')]

I took a look in neutron-server.log for these port IDs and MAC
addresses:

2014-03-05 17:25:52.623 21630 DEBUG neutron.db.db_base_plugin_v2 [req-6ff63c51-d3fc-4a59-bdc8-1f90f92a16f4 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Generated mac for network 11df69c5-89eb-4ba6-b99e-69087237f4dc is fa:16:3e:6d:15:03 _generate_mac /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:302
2014-03-05 17:25:52.629 21630 DEBUG neutron.db.db_base_plugin_v2 [req-6ff63c51-d3fc-4a59-bdc8-1f90f92a16f4 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Allocated IP - 10.0.0.18 from 10.0.0.18 to 10.0.255.254 _generate_ip /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:472
2014-03-05 17:26:02.062 21718 DEBUG neutron.db.db_base_plugin_v2 [req-f8a8fdf7-0f29-451e-b6dc-b174ead51b3a 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Generated mac for network 11df69c5-89eb-4ba6-b99e-69087237f4dc is fa:16:3e:06:82:26 _generate_mac /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:302
2014-03-05 17:26:02.071 21681 DEBUG neutron.db.db_base_plugin_v2 [req-5bdd9f64-2ef8-4733-8fe1-1ae05e69c34a 7132e53040704655bda2dd1c355a60d9 1e38af7b2793478aba3a8a462d82ff84] Delete allocated IP 10.0.0.18 (11df69c5-89eb-4ba6-b99e-69087237f4dc/887baebc-9208-42e8-8243-9b8cea38d176) _delete_ip_allocation /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:444
2014-03-05 17:26:02.082 21718 DEBUG neutron.db.db_base_plugin_v2 [req-f8a8fdf7-0f29-451e-b6dc-b174ead51b3a 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Allocated IP - 10.0.0.18 from 10.0.0.18 to 10.0.0.18 _generate_ip /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:472
2014-03-05 17:26:17.267 21676 DEBUG neutron.db.db_base_plugin_v2 [req-7225f965-5e1f-4123-a28c-a688b3c40cce 7132e53040704655bda2dd1c355a60d9 1e38af7b2793478aba3a8a462d82ff84] Delete allocated IP 10.0.0.18 (11df69c5-89eb-4ba6-b99e-69087237f4dc/887baebc-9208-42e8-8243-9b8cea38d176) _delete_ip_allocation /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:444

Correctly, 10.0.0.18 was allocated, then deleted, then allocated, then
deleted.

So why did neutron-dhcp-agent think 10.0.0.18 was allocated to two
ports? The answer, I believe, is a race between the port allocation &
de-allocation notifications that neutron-server sends to neutron-dhcp-
agent and how neutron-dhcp-agent incrementally updates its cache of
network state. The following schedule of events illustrates the problem:

1. neutron-server: allocates 10.0.0.18 for MAC 1
2. neutron-server: emits event "port_create_end for (10.0.0.18, MAC1)"
3. neutron-dhcp-agent: receives event "port_create_end for (10.0.0.18, MAC1)" and adds (10.0.0.18, MAC1) to host file
4. neutron-server: deallocates 10.0.0.18 for MAC1
5. neutron-server: allocates 10.0.0.18 for MAC2
6. neutron-server: emits event "port_create_end for (10.0.0.18, MAC2)"
7. neutron-server: emits event "port_delete_end for (10.0.0.18, MAC1)"
8. neutron-dhcp-agent: receives event "port_create_end for (10.0.0.18, MAC2)" and adds (10.0.0.18, MAC2) to host file
... uh oh, 10.0.0.18 is in the host file twice ...
9. neutron-dhcp-agent: receives event "port_delete_end for (10.0.0.18, MAC1)" and removes (10.0.0.18, MAC1) from the host file
...

The key thing to observe is that 8 happens before 9 (i.e., neutron-dhcp-
agent finds out about the new port before it finds out about the old
port being deleted). The database transaction semantics only ensured
that (1, 4, 5) happened in order. There's no guarantee about the
relative order of event emission (6 and 7) with respect to transaction
commits because the events are emitted *after* the database transactions
are committed.

The consequence of these duplicate host entries is that dnsmasq doesn't
give MAC2 its IP address until after MAC1 is removed from the file.
Given an inopportune timing for both MAC addresses to appear in the file
and typical dhclient retry intervals of 30s, this race can delay the VM
with MAC2 from getting its IP address for a substantial amount of time.

To fix the bug, I think neutron-dhcp-agent should be changed to
accommodate the racy notifications. Whenever neutron-dhcp-agent is
notified that an (IP=X, MAC=Y) pair is added to a network, neutron-dhcp-
agent proactively drop existing ports from its cache and host entries
with the ip address X instead of waiting for the pending notification.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1288493

Title:
  race betwen ip allocations and dhcp agent notifications

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  I came across a race condition while doing some performance work. The
  symptom is messages in syslog like this:

  Mar  5 17:26:06 ctrl dnsmasq-dhcp[32356]: duplicate IP address
  10.0.0.18 in .

  Sure enough, there were duplicate entries in dnsmasq's host file for
  the same IP address. In this example, the MACs fa:16:3e:6d:15:03 and
  fa:16:3e:06:82:26 were both assigned 10.0.0.18.

  I instrumented neutron-dhcp-agent to check the ports it emits into the
  host file for duplicate IP addresses:

  2014-03-05 17:26:02.519 32309 ERROR neutron.agent.dhcp_agent [req-
  8f5d9c5d-4c20-4c75-95f5-f5f27e971a87 6e8f4d378eb147f8b9f055c13c113f4e
  68153eceb4c64a049dea2de621858662] duplicates for ip 10.0.0.18:
  [(u'56f01383-92e5-43ff-948a-8549ce247b6c', u'fa:16:3e:6d:15:03'),
  (u'086b514f-c843-4f25-bad8-24f65c4932fa', u'fa:16:3e:06:82:26')]

  I took a look in neutron-server.log for these port IDs and MAC
  addresses:

  2014-03-05 17:25:52.623 21630 DEBUG neutron.db.db_base_plugin_v2 [req-6ff63c51-d3fc-4a59-bdc8-1f90f92a16f4 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Generated mac for network 11df69c5-89eb-4ba6-b99e-69087237f4dc is fa:16:3e:6d:15:03 _generate_mac /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:302
  2014-03-05 17:25:52.629 21630 DEBUG neutron.db.db_base_plugin_v2 [req-6ff63c51-d3fc-4a59-bdc8-1f90f92a16f4 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Allocated IP - 10.0.0.18 from 10.0.0.18 to 10.0.255.254 _generate_ip /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:472
  2014-03-05 17:26:02.062 21718 DEBUG neutron.db.db_base_plugin_v2 [req-f8a8fdf7-0f29-451e-b6dc-b174ead51b3a 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Generated mac for network 11df69c5-89eb-4ba6-b99e-69087237f4dc is fa:16:3e:06:82:26 _generate_mac /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:302
  2014-03-05 17:26:02.071 21681 DEBUG neutron.db.db_base_plugin_v2 [req-5bdd9f64-2ef8-4733-8fe1-1ae05e69c34a 7132e53040704655bda2dd1c355a60d9 1e38af7b2793478aba3a8a462d82ff84] Delete allocated IP 10.0.0.18 (11df69c5-89eb-4ba6-b99e-69087237f4dc/887baebc-9208-42e8-8243-9b8cea38d176) _delete_ip_allocation /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:444
  2014-03-05 17:26:02.082 21718 DEBUG neutron.db.db_base_plugin_v2 [req-f8a8fdf7-0f29-451e-b6dc-b174ead51b3a 6e8f4d378eb147f8b9f055c13c113f4e 68153eceb4c64a049dea2de621858662] Allocated IP - 10.0.0.18 from 10.0.0.18 to 10.0.0.18 _generate_ip /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:472
  2014-03-05 17:26:17.267 21676 DEBUG neutron.db.db_base_plugin_v2 [req-7225f965-5e1f-4123-a28c-a688b3c40cce 7132e53040704655bda2dd1c355a60d9 1e38af7b2793478aba3a8a462d82ff84] Delete allocated IP 10.0.0.18 (11df69c5-89eb-4ba6-b99e-69087237f4dc/887baebc-9208-42e8-8243-9b8cea38d176) _delete_ip_allocation /usr/lib/python2.7/dist-packages/neutron/db/db_base_plugin_v2.py:444

  Correctly, 10.0.0.18 was allocated, then deleted, then allocated, then
  deleted.

  So why did neutron-dhcp-agent think 10.0.0.18 was allocated to two
  ports? The answer, I believe, is a race between the port allocation &
  de-allocation notifications that neutron-server sends to neutron-dhcp-
  agent and how neutron-dhcp-agent incrementally updates its cache of
  network state. The following schedule of events illustrates the
  problem:

  1. neutron-server: allocates 10.0.0.18 for MAC 1
  2. neutron-server: emits event "port_create_end for (10.0.0.18, MAC1)"
  3. neutron-dhcp-agent: receives event "port_create_end for (10.0.0.18, MAC1)" and adds (10.0.0.18, MAC1) to host file
  4. neutron-server: deallocates 10.0.0.18 for MAC1
  5. neutron-server: allocates 10.0.0.18 for MAC2
  6. neutron-server: emits event "port_create_end for (10.0.0.18, MAC2)"
  7. neutron-server: emits event "port_delete_end for (10.0.0.18, MAC1)"
  8. neutron-dhcp-agent: receives event "port_create_end for (10.0.0.18, MAC2)" and adds (10.0.0.18, MAC2) to host file
  ... uh oh, 10.0.0.18 is in the host file twice ...
  9. neutron-dhcp-agent: receives event "port_delete_end for (10.0.0.18, MAC1)" and removes (10.0.0.18, MAC1) from the host file
  ...

  The key thing to observe is that 8 happens before 9 (i.e., neutron-
  dhcp-agent finds out about the new port before it finds out about the
  old port being deleted). The database transaction semantics only
  ensured that (1, 4, 5) happened in order. There's no guarantee about
  the relative order of event emission (6 and 7) with respect to
  transaction commits because the events are emitted *after* the
  database transactions are committed.

  The consequence of these duplicate host entries is that dnsmasq
  doesn't give MAC2 its IP address until after MAC1 is removed from the
  file. Given an inopportune timing for both MAC addresses to appear in
  the file and typical dhclient retry intervals of 30s, this race can
  delay the VM with MAC2 from getting its IP address for a substantial
  amount of time.

  To fix the bug, I think neutron-dhcp-agent should be changed to
  accommodate the racy notifications. Whenever neutron-dhcp-agent is
  notified that an (IP=X, MAC=Y) pair is added to a network, neutron-
  dhcp-agent proactively drop existing ports from its cache and host
  entries with the ip address X instead of waiting for the pending
  notification.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1288493/+subscriptions


Follow ups

References