← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1732456] [NEW] dhcp agent race condition between between network_create_end and port_delete_end RPC

 

Public bug reported:

A race condition exists in the DHCP agent between the network_create_end
and port_delete_end RPC callbacks.  If a DHCP agent is fetching network
info [1] at the exact same time as a port is being deleted [2] it is
possible that the port delete RPC arrives after the network info RPC
completed on the server but before the result is processed on the agent.
That leads to a condition where the agent has a port in its network
cache that has already been deleted.  That then has the potential to add
duplicate entries in the dnsmasq host file for one IP address (two
different MAC addresses).

Because this is a timing issue I am not able to create a
standalone/isolated test case to show this behavior.  In one of our QA
test labs this happens ~20% of the time when a VM is deleted at the
roughly the same time that a DHCP server is moved from one node to
another.

This is happening because there is no synchronization between the
port_delete_end and the network_create_end RPC event handlers.  Since
the port_delete_end RPC does not have any network_id information there
is no way to synchronize the two operations in the agent.  In our system
we have addressed this by changing the *_delete_end RPC notifications
coming from [3] to also include the network_id and then changing [2] to
acquire _net_lock(network_id) before processing the port_delete_end
notification.  This approach addressed all occurrences of the issue.   I
have included a copy of our patch for reference here [4].

This an example of the series of logs generated by dnsmasq when a
duplicate entry is added because of the stale port described above.
dnsmasq then refuses to serve the IP address because of the duplicate
and the VM never gets an IP address.

2017-11-06T10:24:57.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/host 
2017-11-06T10:24:57.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/opts 
2017-11-06T10:24:58.000 compute-0 dnsmasq[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/addn_hosts - 13 addresses 
2017-11-06T10:24:58.000 compute-0 dnsmasq[57037]: err duplicate dhcp-host IP address 192.168.216.75 at line 13 of /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/host 
2017-11-06T10:24:58.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/host 
2017-11-06T10:24:58.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/opts


[1] neutron.agent.dhcp.agent.DhcpPluginApi.get_network_info
[2] neutron.agent.dhcp.agent.DhcpAgent.port_delete_end
[3] neutron.api.rpc.agentnotifiers.dhcp_rpc_agent_api.DhcpAgentNotifyAPI.notify
[4] http://paste.openstack.org/show/626381/

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1732456

Title:
  dhcp agent race condition between between network_create_end and
  port_delete_end RPC

Status in neutron:
  New

Bug description:
  A race condition exists in the DHCP agent between the
  network_create_end and port_delete_end RPC callbacks.  If a DHCP agent
  is fetching network info [1] at the exact same time as a port is being
  deleted [2] it is possible that the port delete RPC arrives after the
  network info RPC completed on the server but before the result is
  processed on the agent.  That leads to a condition where the agent has
  a port in its network cache that has already been deleted.  That then
  has the potential to add duplicate entries in the dnsmasq host file
  for one IP address (two different MAC addresses).

  Because this is a timing issue I am not able to create a
  standalone/isolated test case to show this behavior.  In one of our QA
  test labs this happens ~20% of the time when a VM is deleted at the
  roughly the same time that a DHCP server is moved from one node to
  another.

  This is happening because there is no synchronization between the
  port_delete_end and the network_create_end RPC event handlers.  Since
  the port_delete_end RPC does not have any network_id information there
  is no way to synchronize the two operations in the agent.  In our
  system we have addressed this by changing the *_delete_end RPC
  notifications coming from [3] to also include the network_id and then
  changing [2] to acquire _net_lock(network_id) before processing the
  port_delete_end notification.  This approach addressed all occurrences
  of the issue.   I have included a copy of our patch for reference here
  [4].

  This an example of the series of logs generated by dnsmasq when a
  duplicate entry is added because of the stale port described above.
  dnsmasq then refuses to serve the IP address because of the duplicate
  and the VM never gets an IP address.

  2017-11-06T10:24:57.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/host 
  2017-11-06T10:24:57.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/opts 
  2017-11-06T10:24:58.000 compute-0 dnsmasq[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/addn_hosts - 13 addresses 
  2017-11-06T10:24:58.000 compute-0 dnsmasq[57037]: err duplicate dhcp-host IP address 192.168.216.75 at line 13 of /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/host 
  2017-11-06T10:24:58.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/host 
  2017-11-06T10:24:58.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/opts


  
  [1] neutron.agent.dhcp.agent.DhcpPluginApi.get_network_info
  [2] neutron.agent.dhcp.agent.DhcpAgent.port_delete_end
  [3] neutron.api.rpc.agentnotifiers.dhcp_rpc_agent_api.DhcpAgentNotifyAPI.notify
  [4] http://paste.openstack.org/show/626381/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1732456/+subscriptions


Follow ups