← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1913723] [NEW] [DHCP] Race condition during port processing events in DHCP agent

 

Public bug reported:

In the DHCP agent, network, subnet and port events are stored in a
"ResourceProcessingQueue". When
"neutron.agent.dhcp.agent._SYNC_STATE_LOCK" is free, the events are
processes according to (1) a priority table and (2) the timestamp.

Since [1], the Neutron server sends a high priority cast to one random
DHCP agent to speed up the port creation process. That means this event,
regardless of the timestamp, will be processed first. That improves the
VM creation process.

This bug exposes a problem detected in case that (1) only one single
DHCP agent is attending to those events and (2) the queue processing is
not fast enough to handle the events when arrive.

For example (seen in a customer deployment with thousands of ports assigned to the same DHCP agent):
- A port is created and processed. The port ID is stored in the DHCP agent cache.
- The port is deleted and the event arrives to the DHCP agent (*but* the event is not processed).
- Another port with the same IP address is created. The Neutron server will allow this process because the IP address is not used anymore. This port creation event arrives to the DHCP agent.
- The event queue is processed. At this point, this queue has both events: one deleting a port with IP address x.x.x.x and another event creating a port with the same IP address. Because the port creation event is processed before, a resync process occurs [2].

There is a very easy way to reproduce this error with a small trick: to add a time.sleep(5) at the beginning of [3]. When the port deletion event is received, the DHCP agent will wait this time. At the same time we can send a port creation event that will trigger the error reported. E.g.:
- Add the time.sleep(5) in the DHCP agent and restart it.
- Add a port with an IP. The DHCP agent will process this event correctly.
  $ openstack port create --fixed-ip ip-address=10.10.0.20 --network private port10; port=`openstack port show -c id -f value port10`
- Delete the port and create another one with the same IP address. Because of the sleep, both events will be processed at the same time and a resync will be executed:
  $ openstack port delete $port; openstack port create --fixed-ip ip-address=10.10.0.20 --network private port10; port=`openstack port show -c id -f value port10`

Log: http://paste.openstack.org/show/802126/

[1]https://review.opendev.org/c/openstack/neutron/+/626830
[2]https://github.com/openstack/neutron/blob/0fb63f729751d7e62ac107d3275d440632936978/neutron/agent/dhcp/agent.py#L664-L679
[3]https://github.com/openstack/neutron/blob/0fb63f729751d7e62ac107d3275d440632936978/neutron/agent/dhcp/agent.py#L682

** Affects: neutron
     Importance: Medium
     Assignee: Rodolfo Alonso (rodolfo-alonso-hernandez)
         Status: New

** Changed in: neutron
   Importance: Undecided => Medium

** Changed in: neutron
     Assignee: (unassigned) => Rodolfo Alonso (rodolfo-alonso-hernandez)

** Description changed:

  In the DHCP agent, network, subnet and port events are stored in a
  "ResourceProcessingQueue". When
  "neutron.agent.dhcp.agent._SYNC_STATE_LOCK" is free, the events are
  processes according to (1) a priority table and (2) the timestamp.
  
  Since [1], the Neutron server sends a high priority cast to one random
  DHCP agent to speed up the port creation process. That means this event,
  regardless of the timestamp, will be processed first. That improves the
  VM creation process.
  
  This bug exposes a problem detected in case that (1) only one single
  DHCP agent is attending to those events and (2) the queue processing is
  not fast enough to handle the events when arrive.
  
  For example (seen in a customer deployment with thousands of ports assigned to the same DHCP agent):
  - A port is created and processed. The port ID is stored in the DHCP agent cache.
  - The port is deleted and the event arrives to the DHCP agent (*but* the event is not processed).
  - Another port with the same IP address is created. The Neutron server will allow this process because the IP address is not used anymore. This port creation event arrives to the DHCP agent.
  - The event queue is processed. At this point, this queue has both events: one deleting a port with IP address x.x.x.x and another event creating a port with the same IP address. Because the port creation event is processed before, a resync process occurs [2].
  
- There is a very easy way to reproduce this error with an small trick: to add a time.sleep(5) at the beginning of [3]. When the port deletion event is received, the DHCP agent will wait this time. At the same time we can send a port creation event that will trigger the error reported. E.g.:
+ There is a very easy way to reproduce this error with a small trick: to add a time.sleep(5) at the beginning of [3]. When the port deletion event is received, the DHCP agent will wait this time. At the same time we can send a port creation event that will trigger the error reported. E.g.:
  - Add the time.sleep(5) in the DHCP agent and restart it.
  - Add a port with an IP. The DHCP agent will process this event correctly.
-   $ openstack port create --fixed-ip ip-address=10.10.0.20 --network private port10; port=`openstack port show -c id -f value port10`
+   $ openstack port create --fixed-ip ip-address=10.10.0.20 --network private port10; port=`openstack port show -c id -f value port10`
  - Delete the port and create another one with the same IP address. Because of the sleep, both events will be processed at the same time and a resync will be executed:
-   $ openstack port delete $port; openstack port create --fixed-ip ip-address=10.10.0.20 --network private port10; port=`openstack port show -c id -f value port10`
+   $ openstack port delete $port; openstack port create --fixed-ip ip-address=10.10.0.20 --network private port10; port=`openstack port show -c id -f value port10`
  
  Log: http://paste.openstack.org/show/802126/
- 
  
  [1]https://review.opendev.org/c/openstack/neutron/+/626830
  [2]https://github.com/openstack/neutron/blob/0fb63f729751d7e62ac107d3275d440632936978/neutron/agent/dhcp/agent.py#L664-L679
  [3]https://github.com/openstack/neutron/blob/0fb63f729751d7e62ac107d3275d440632936978/neutron/agent/dhcp/agent.py#L682

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1913723

Title:
  [DHCP] Race condition during port processing events in DHCP agent

Status in neutron:
  New

Bug description:
  In the DHCP agent, network, subnet and port events are stored in a
  "ResourceProcessingQueue". When
  "neutron.agent.dhcp.agent._SYNC_STATE_LOCK" is free, the events are
  processes according to (1) a priority table and (2) the timestamp.

  Since [1], the Neutron server sends a high priority cast to one random
  DHCP agent to speed up the port creation process. That means this
  event, regardless of the timestamp, will be processed first. That
  improves the VM creation process.

  This bug exposes a problem detected in case that (1) only one single
  DHCP agent is attending to those events and (2) the queue processing
  is not fast enough to handle the events when arrive.

  For example (seen in a customer deployment with thousands of ports assigned to the same DHCP agent):
  - A port is created and processed. The port ID is stored in the DHCP agent cache.
  - The port is deleted and the event arrives to the DHCP agent (*but* the event is not processed).
  - Another port with the same IP address is created. The Neutron server will allow this process because the IP address is not used anymore. This port creation event arrives to the DHCP agent.
  - The event queue is processed. At this point, this queue has both events: one deleting a port with IP address x.x.x.x and another event creating a port with the same IP address. Because the port creation event is processed before, a resync process occurs [2].

  There is a very easy way to reproduce this error with a small trick: to add a time.sleep(5) at the beginning of [3]. When the port deletion event is received, the DHCP agent will wait this time. At the same time we can send a port creation event that will trigger the error reported. E.g.:
  - Add the time.sleep(5) in the DHCP agent and restart it.
  - Add a port with an IP. The DHCP agent will process this event correctly.
    $ openstack port create --fixed-ip ip-address=10.10.0.20 --network private port10; port=`openstack port show -c id -f value port10`
  - Delete the port and create another one with the same IP address. Because of the sleep, both events will be processed at the same time and a resync will be executed:
    $ openstack port delete $port; openstack port create --fixed-ip ip-address=10.10.0.20 --network private port10; port=`openstack port show -c id -f value port10`

  Log: http://paste.openstack.org/show/802126/

  [1]https://review.opendev.org/c/openstack/neutron/+/626830
  [2]https://github.com/openstack/neutron/blob/0fb63f729751d7e62ac107d3275d440632936978/neutron/agent/dhcp/agent.py#L664-L679
  [3]https://github.com/openstack/neutron/blob/0fb63f729751d7e62ac107d3275d440632936978/neutron/agent/dhcp/agent.py#L682

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1913723/+subscriptions


Follow ups