← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1939432] [NEW] Concurrent DHCP agent updates can result in a DB lock

 

Public bug reported:

Bugzilla reference: https://bugzilla.redhat.com/show_bug.cgi?id=1982981

When a new network and the first subnet are created, the DHCP agent is
updated. The agent scheduler increases the DHCP agent register "load"
[1] field that will be used to schedule new networks into the same
agent.

If multiple concurrent networks (and the first subnet) are created, the
agent "load" will be modified concurrently. The DB guarantees that only
one transaction can increase the agent "load" parameter at once; the
other transactions will fail and retried again. E.g.:
https://paste.opendev.org/show/807984/

NOTE: when I say network and the first subnet is because that will
trigger the spawn of a new dnsmasq process. This is the event that
increases +1 the "load" value. Any other new subnet added to this
network will modify the dnsmasq config but won't increase the "load"
value.

As commented in the "BaseResourceFilter.bind" method [2], "the resource
being bound might or might not be of the same type which is accounted
for the load. It isn't a problem because "+ 1" here does not meant to
predict precisely what the load of the agent will be. The value will be
corrected by the agent on the next report interval." In other words,
when the DHCP agent reports the status, accurately updates the number of
resources (networks) that is handling.

This bug proposes to catch the DB errors in "BaseResourceFilter.bind"
method [2] to avoid the DB retry action. That is unnecessary because the
DHCP agent, as commented, will update the "load" value. By avoiding this
retry, we avoid unnecessary Neutron server and DB operations and command
delays (for example when creating a subnet).

[1]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/db/models/agent.py#L55
[2]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/scheduler/base_resource_filter.py#L35-L39

** Affects: neutron
     Importance: Undecided
     Assignee: Rodolfo Alonso (rodolfo-alonso-hernandez)
         Status: New

** Changed in: neutron
     Assignee: (unassigned) => Rodolfo Alonso (rodolfo-alonso-hernandez)

** Description changed:

+ Bugzilla reference: https://bugzilla.redhat.com/show_bug.cgi?id=1982981
+ 
  When a new network and the first subnet are created, the DHCP agent is
  updated. The agent scheduler increases the DHCP agent register "load"
  [1] field that will be used to schedule new networks into the same
  agent.
  
  If multiple concurrent networks (and the first subnet) are created, the
  agent "load" will be modified concurrently. The DB guarantees that only
  one transaction can increase the agent "load" parameter at once; the
  other transactions will fail and retried again. E.g.:
  https://paste.opendev.org/show/807984/
  
  NOTE: when I say network and the first subnet is because that will
  trigger the spawn of a new dnsmasq process. This is the event that
  increases +1 the "load" value. Any other new subnet added to this
  network will modify the dnsmasq config but won't increase the "load"
  value.
  
  As commented in the "BaseResourceFilter.bind" method [2], "the resource
  being bound might or might not be of the same type which is accounted
  for the load. It isn't a problem because "+ 1" here does not meant to
  predict precisely what the load of the agent will be. The value will be
  corrected by the agent on the next report interval." In other words,
  when the DHCP agent reports the status, accurately updates the number of
  resources (networks) that is handling.
  
  This bug proposes to catch the DB errors in "BaseResourceFilter.bind"
  method [2] to avoid the DB retry action. That is unnecessary because the
  DHCP agent, as commented, will update the "load" value. By avoiding this
  retry, we avoid unnecessary Neutron server and DB operations and command
  delays (for example when creating a subnet).
  
  [1]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/db/models/agent.py#L55
  [2]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/scheduler/base_resource_filter.py#L35-L39

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1939432

Title:
  Concurrent DHCP agent updates can result in a DB lock

Status in neutron:
  New

Bug description:
  Bugzilla reference:
  https://bugzilla.redhat.com/show_bug.cgi?id=1982981

  When a new network and the first subnet are created, the DHCP agent is
  updated. The agent scheduler increases the DHCP agent register "load"
  [1] field that will be used to schedule new networks into the same
  agent.

  If multiple concurrent networks (and the first subnet) are created,
  the agent "load" will be modified concurrently. The DB guarantees that
  only one transaction can increase the agent "load" parameter at once;
  the other transactions will fail and retried again. E.g.:
  https://paste.opendev.org/show/807984/

  NOTE: when I say network and the first subnet is because that will
  trigger the spawn of a new dnsmasq process. This is the event that
  increases +1 the "load" value. Any other new subnet added to this
  network will modify the dnsmasq config but won't increase the "load"
  value.

  As commented in the "BaseResourceFilter.bind" method [2], "the
  resource being bound might or might not be of the same type which is
  accounted for the load. It isn't a problem because "+ 1" here does not
  meant to predict precisely what the load of the agent will be. The
  value will be corrected by the agent on the next report interval." In
  other words, when the DHCP agent reports the status, accurately
  updates the number of resources (networks) that is handling.

  This bug proposes to catch the DB errors in "BaseResourceFilter.bind"
  method [2] to avoid the DB retry action. That is unnecessary because
  the DHCP agent, as commented, will update the "load" value. By
  avoiding this retry, we avoid unnecessary Neutron server and DB
  operations and command delays (for example when creating a subnet).

  [1]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/db/models/agent.py#L55
  [2]https://github.com/openstack/neutron/blob/0ccfed0ae13182f820e6a8c11a2fa801506f3a3a/neutron/scheduler/base_resource_filter.py#L35-L39

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1939432/+subscriptions



Follow ups