yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1939432] Re: Concurrent DHCP agent updates can result in a DB lock

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1939432@xxxxxxxxxxxxxxxxxx>
Date: Thu, 02 Sep 2021 04:05:02 -0000
Reply-to: Bug 1939432 <1939432@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/804218
Committed: https://opendev.org/openstack/neutron/commit/668b1cc652f076e555ef1fc1289684367159186a
Submitter: "Zuul (22348)"
Branch:    master

commit 668b1cc652f076e555ef1fc1289684367159186a
Author: Rodolfo Alonso Hernandez <ralonsoh@xxxxxxxxxx>
Date:   Wed Aug 11 09:13:55 2021 +0000

    Do not fail if the agent load is not bumped
    
    When a new network and its first subnet is created, the DHCP agent
    bumps the "load" parameter to reflect the number of networks handled.
    This "load" parameter is modified when:
    - As commented, when the first subnet of a network is created. The
      "load" value is bumped.
    - When periodically the DHCP agent sends the status, informing about
      the current number of networks handled.
    
    If during the subnet creation this "load" value is not updated, it will
    be in the next periodic update of the agent.
    
    This "load" value is used by the scheduler to equally distribute the
    objects to be managed by any agent type (DHCP agents manage networks).
    
    The bug refers to DHCP but is valid for any other agent.
    
    Change-Id: Ief402048d99d40b64d81fcf58eb2e39b1ba7ebbb
    Closes-Bug: #1939432


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1939432

Title:
  Concurrent DHCP agent updates can result in a DB lock

Status in neutron:
  Fix Released

Bug description:
  Bugzilla reference:
  https://bugzilla.redhat.com/show_bug.cgi?id=1982981

  When a new network and the first subnet are created, the DHCP agent is
  updated. The agent scheduler increases the DHCP agent register "load"
  [1] field that will be used to schedule new networks into the same
  agent.

  If multiple concurrent networks (and the first subnet) are created,
  the agent "load" will be modified concurrently. The DB guarantees that
  only one transaction can increase the agent "load" parameter at once;
  the other transactions will fail and retried again. E.g.:
  https://paste.opendev.org/show/807984/

  NOTE: when I say network and the first subnet is because that will
  trigger the spawn of a new dnsmasq process. This is the event that
  increases +1 the "load" value. Any other new subnet added to this
  network will modify the dnsmasq config but won't increase the "load"
  value.

  As commented in the "BaseResourceFilter.bind" method [2], "the
  resource being bound might or might not be of the same type which is
  accounted for the load. It isn't a problem because "+ 1" here does not
  meant to predict precisely what the load of the agent will be. The
  value will be corrected by the agent on the next report interval." In
  other words, when the DHCP agent reports the status, accurately
  updates the number of resources (networks) that is handling.

  This bug proposes to catch the DB errors in "BaseResourceFilter.bind"
  method [2] to avoid the DB retry action. That is unnecessary because
  the DHCP agent, as commented, will update the "load" value. By
  avoiding this retry, we avoid unnecessary Neutron server and DB
  operations and command delays (for example when creating a subnet).

  [1]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/db/models/agent.py#L55
  [2]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/scheduler/base_resource_filter.py#L35-L39

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1939432/+subscriptions

References

[Bug 1939432] [NEW] Concurrent DHCP agent updates can result in a DB lock
From: Rodolfo Alonso, 2021-08-10