yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #72840
[Bug 1732456] Re: dhcp agent race condition between between network_create_end and port_delete_end RPC
Reviewed: https://review.openstack.org/524711
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fa78b580105111b1238e5b18d415b08e8fb35d97
Submitter: Zuul
Branch: master
commit fa78b580105111b1238e5b18d415b08e8fb35d97
Author: Allain Legacy <allain.legacy@xxxxxxxxxxxxx>
Date: Fri Nov 10 10:44:50 2017 -0600
dhcp: serializing port delete and network rpc calls
The port delete events are not synchronized with network rpc events. This
creates a condition which makes it possible for a port delete event to be
processed just before a previously started network query completes.
The problematic order of operations is as follows:
1) a network is scheduled to an agent; a network rpc is sent to the
agent
2) the agent queries the network data from the server
3) while that query is in progress a port on that network is deleted; a
port rpc is sent to the agent
4) that port delete rpc is received before the network query rpc
completes
5) the port delete results in no action because the port was not present
on the agent
6) the network query finishes and adds the port to the cache (even
though the port has already been deleted)
7) some time passes and a new port is configured with the same IP
address as the port that was deleted in (3)
8) the dhcp host file is corrupted with 2 entries for the same IP
address.
9) dhcp queries for the newest port is rejected because of the duplicate
entry in the dhcp host file.
The solution is to add the network_id to the port_delete_end rpc event
so that the _net_lock(network_id) synchronization point can be acquired
so that it is processed serially with other network related events.
To ensure backwards compatibility with newer agents running against older
servers the determination of which network_id value to use in the lock is
handled using a utility that will fallback to the previous mode of operation
whenever the network_id attribute is not present in the *_delete_end RPC
events. That utility can be removed in the future when it is guaranteed
that the network_id attribute will be present in RPC messages from the
server.
Closes-Bug: #1732456
Change-Id: I735f8b1c9248b12e5feb6cbe970cf67f321e6ebc
Signed-off-by: Allain Legacy <allain.legacy@xxxxxxxxxxxxx>
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1732456
Title:
dhcp agent race condition between between network_create_end and
port_delete_end RPC
Status in neutron:
Fix Released
Bug description:
A race condition exists in the DHCP agent between the
network_create_end and port_delete_end RPC callbacks. If a DHCP agent
is fetching network info [1] at the exact same time as a port is being
deleted [2] it is possible that the port delete RPC arrives after the
network info RPC completed on the server but before the result is
processed on the agent. That leads to a condition where the agent has
a port in its network cache that has already been deleted. That then
has the potential to add duplicate entries in the dnsmasq host file
for one IP address (two different MAC addresses).
Because this is a timing issue I am not able to create a
standalone/isolated test case to show this behavior. In one of our QA
test labs this happens ~20% of the time when a VM is deleted at the
roughly the same time that a DHCP server is moved from one node to
another.
This is happening because there is no synchronization between the
port_delete_end and the network_create_end RPC event handlers. Since
the port_delete_end RPC does not have any network_id information there
is no way to synchronize the two operations in the agent. In our
system we have addressed this by changing the *_delete_end RPC
notifications coming from [3] to also include the network_id and then
changing [2] to acquire _net_lock(network_id) before processing the
port_delete_end notification. This approach addressed all occurrences
of the issue. I have included a copy of our patch for reference here
[4].
This an example of the series of logs generated by dnsmasq when a
duplicate entry is added because of the stale port described above.
dnsmasq then refuses to serve the IP address because of the duplicate
and the VM never gets an IP address.
2017-11-06T10:24:57.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/host
2017-11-06T10:24:57.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/opts
2017-11-06T10:24:58.000 compute-0 dnsmasq[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/addn_hosts - 13 addresses
2017-11-06T10:24:58.000 compute-0 dnsmasq[57037]: err duplicate dhcp-host IP address 192.168.216.75 at line 13 of /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/host
2017-11-06T10:24:58.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/host
2017-11-06T10:24:58.000 compute-0 dnsmasq-dhcp[57037]: info read /var/run/neutron/dhcp/c3b9af9c-2e6b-42af-8fb7-d73b552fb24b/opts
[1] neutron.agent.dhcp.agent.DhcpPluginApi.get_network_info
[2] neutron.agent.dhcp.agent.DhcpAgent.port_delete_end
[3] neutron.api.rpc.agentnotifiers.dhcp_rpc_agent_api.DhcpAgentNotifyAPI.notify
[4] http://paste.openstack.org/show/626381/
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1732456/+subscriptions
References