← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1913723] Re: [DHCP] Race condition during port processing events in DHCP agent

 

** Changed in: neutron
       Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1913723

Title:
  [DHCP] Race condition during port processing events in DHCP agent

Status in neutron:
  Fix Released

Bug description:
  In the DHCP agent, network, subnet and port events are stored in a
  "ResourceProcessingQueue". When
  "neutron.agent.dhcp.agent._SYNC_STATE_LOCK" is free, the events are
  processes according to (1) a priority table and (2) the timestamp.

  Since [1], the Neutron server sends a high priority cast to one random
  DHCP agent to speed up the port creation process. That means this
  event, regardless of the timestamp, will be processed first. That
  improves the VM creation process.

  This bug exposes a problem detected in case that (1) only one single
  DHCP agent is attending to those events and (2) the queue processing
  is not fast enough to handle the events when arrive.

  For example (seen in a customer deployment with thousands of ports assigned to the same DHCP agent):
  - A port is created and processed. The port ID is stored in the DHCP agent cache.
  - The port is deleted and the event arrives to the DHCP agent (*but* the event is not processed).
  - Another port with the same IP address is created. The Neutron server will allow this process because the IP address is not used anymore. This port creation event arrives to the DHCP agent.
  - The event queue is processed. At this point, this queue has both events: one deleting a port with IP address x.x.x.x and another event creating a port with the same IP address. Because the port creation event is processed before, a resync process occurs [2].

  There is a very easy way to reproduce this error with a small trick: to add a time.sleep(10) at the beginning of [3]. Then we need to finish this processing thread sending a trivial operation, e.g. creating another port. Then process loop [4] will spawn a new thread that will stop in the sleep command; at this point we need to send both events, the deletion and the creation:
  - Add the time.sleep(10) in the resource processing queue and restart the DHCP agent.
  - Add a port with an IP. The DHCP agent will process this event correctly.

    $ openstack port create --fixed-ip ip-address=10.10.0.20 --network
  private port10; port=`openstack port show -c id -f value port10`

  - This command should be done at the same time:
  a) Create another port to consume the waiting thread.
  b) Wait a small amount of time to let the thread finish and allow the process_loop to start another one. This new thread will stop in the sleep command.
  c) Delete the port.
  d) Create a port with the same IP --> the thread will have both events in the queue. As commented, because the queue processes first the creation event, that will trigger the unwanted resync:

    $ openstack port create --network private port_trivial; sleep 3;
  openstack port delete $port; openstack port create --fixed-ip ip-
  address=10.10.0.20 --network private port10; port=`openstack port show
  -c id -f value port10`

  Log: http://paste.openstack.org/show/802126/

  [1]https://review.opendev.org/c/openstack/neutron/+/626830
  [2]https://github.com/openstack/neutron/blob/0fb63f729751d7e62ac107d3275d440632936978/neutron/agent/dhcp/agent.py#L664-L679
  [3]https://github.com/openstack/neutron/blob/a9fc746249cd34cb7cc594c0b4d74d8ddf65bd46/neutron/agent/common/resource_processing_queue.py#L171
  [4]https://github.com/openstack/neutron/blob/2bf558f6ec1dab70a81ba988f78d587433f43539/neutron/agent/dhcp/agent.py#L567

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1913723/+subscriptions



References