yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1939432] [NEW] Concurrent DHCP agent updates can result in a DB lock

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Rodolfo Alonso <1939432@xxxxxxxxxxxxxxxxxx>
Date: Tue, 10 Aug 2021 16:59:17 -0000
Reply-to: Bug 1939432 <1939432@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Public bug reported:

Bugzilla reference: https://bugzilla.redhat.com/show_bug.cgi?id=1982981

When a new network and the first subnet are created, the DHCP agent is
updated. The agent scheduler increases the DHCP agent register "load"
[1] field that will be used to schedule new networks into the same
agent.

If multiple concurrent networks (and the first subnet) are created, the
agent "load" will be modified concurrently. The DB guarantees that only
one transaction can increase the agent "load" parameter at once; the
other transactions will fail and retried again. E.g.:
https://paste.opendev.org/show/807984/

NOTE: when I say network and the first subnet is because that will
trigger the spawn of a new dnsmasq process. This is the event that
increases +1 the "load" value. Any other new subnet added to this
network will modify the dnsmasq config but won't increase the "load"
value.

As commented in the "BaseResourceFilter.bind" method [2], "the resource
being bound might or might not be of the same type which is accounted
for the load. It isn't a problem because "+ 1" here does not meant to
predict precisely what the load of the agent will be. The value will be
corrected by the agent on the next report interval." In other words,
when the DHCP agent reports the status, accurately updates the number of
resources (networks) that is handling.

This bug proposes to catch the DB errors in "BaseResourceFilter.bind"
method [2] to avoid the DB retry action. That is unnecessary because the
DHCP agent, as commented, will update the "load" value. By avoiding this
retry, we avoid unnecessary Neutron server and DB operations and command
delays (for example when creating a subnet).

[1]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/db/models/agent.py#L55
[2]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/scheduler/base_resource_filter.py#L35-L39

** Affects: neutron
Importance: Undecided
Assignee: Rodolfo Alonso (rodolfo-alonso-hernandez)
Status: New

** Changed in: neutron
Assignee: (unassigned) => Rodolfo Alonso (rodolfo-alonso-hernandez)

** Description changed:

+ Bugzilla reference: https://bugzilla.redhat.com/show_bug.cgi?id=1982981
+
When a new network and the first subnet are created, the DHCP agent is
updated. The agent scheduler increases the DHCP agent register "load"
[1] field that will be used to schedule new networks into the same
agent.

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1939432

Title:
Concurrent DHCP agent updates can result in a DB lock

Status in neutron:
New

Bug description:
Bugzilla reference:
https://bugzilla.redhat.com/show_bug.cgi?id=1982981

If multiple concurrent networks (and the first subnet) are created,
the agent "load" will be modified concurrently. The DB guarantees that
only one transaction can increase the agent "load" parameter at once;
the other transactions will fail and retried again. E.g.:
https://paste.opendev.org/show/807984/

As commented in the "BaseResourceFilter.bind" method [2], "the
resource being bound might or might not be of the same type which is
accounted for the load. It isn't a problem because "+ 1" here does not
meant to predict precisely what the load of the agent will be. The
value will be corrected by the agent on the next report interval." In
other words, when the DHCP agent reports the status, accurately
updates the number of resources (networks) that is handling.

This bug proposes to catch the DB errors in "BaseResourceFilter.bind"
method [2] to avoid the DB retry action. That is unnecessary because
the DHCP agent, as commented, will update the "load" value. By
avoiding this retry, we avoid unnecessary Neutron server and DB
operations and command delays (for example when creating a subnet).

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1939432/+subscriptions

Follow ups

[Bug 1939432] Re: Concurrent DHCP agent updates can result in a DB lock
From: OpenStack Infra, 2021-09-02