← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1952730] Re: Segment updates may cause unnecessary overload

 

Reviewed:  https://review.opendev.org/c/openstack/neutron/+/819777
Committed: https://opendev.org/openstack/neutron/commit/176503e610aee16cb5799a77466579bc55129450
Submitter: "Zuul (22348)"
Branch:    master

commit 176503e610aee16cb5799a77466579bc55129450
Author: Bence Romsics <bence.romsics@xxxxxxxxx>
Date:   Mon Nov 29 09:40:42 2021 +0100

    Avoid writing segments to the DB repeatedly
    
    When:
    * the segments service plugin is enabled and
    * we have multiple rpc worker processes (as in the sum of rpc_workers
      and rpc_state_report_workers, since both kind processes agent
      state_reports) and
    * many ovs-agents report physnets,
    then rabbitmq dispatches the state_report messages between the workers
    in a round robin fashion, therefore eventually the state_reports of the
    same agent will hit all rpc workers.
    
    Unfortunately all worker processes have a 'reported_hosts' set to
    remember from which host it has seen agent reports already. But right
    after a server start when that set is still empty, each worker will
    unconditionally write the received physnet-segment information into
    the db. This means we multiply the load on the db and rpc workers by
    a factor of the rpc worker count.
    
    This patch tries to reduce the load on the db by adding another early
    return before the unconditional db write.
    
    Change-Id: I935186b6ee95f0cae8dc05869d9742c8fb3353c3
    Closes-Bug: #1952730


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1952730

Title:
  Segment updates may cause unnecessary overload

Status in neutron:
  Fix Released

Bug description:
  When:

  * the segments service plugin is enabled and
  * we have many rpc worker processes (as in the sum of rpc_workers and rpc_state_report_workers, since both kind processes agent state_reports) and
  * many ovs-agents report physnets and
  * neutron-server is restarted,

  then rpc workers may get overloaded by state_report messages. That is:
  they may run at 100% CPU utilization for tens of minutes and during
  that they are not able process ovs-agent's state_reports in a timely
  manner. Which in turn causes the agent state to go down and back,
  maybe multiple times. Eventually, as the workers get through the
  initial processing, the load lessens, and the system stabilizes. The
  same rate of incoming state_report messages is not a problem at that
  point.

  (Colleagues working downstream observed this on a stable/victoria base
  with cc 150 ovs-agents and 3 neutron-servers each having maybe
  rpc_workers=6 and rpc_state_report_workers=6. The relevant code did
  not change at all since victoria, so I believe the same would happen
  on master.)

  I think the root cause is the following:

  rabbitmq dispatches the state_report messages between the workers in a
  round robin fashion,  terefore eventually the state_reports of the
  same agent will hit all rpc workers. Each worker has logic to update
  the host segment mapping if either the server or the agent got
  restarted:

  https://opendev.org/openstack/neutron/src/commit/90b5456b8c11011c41f2fcd53a8943cb45fb6479/neutron/services/segments/db.py#L304-L305
      
  Unfortunately the 'reported_hosts' set (to remember from which host the server has seen agent reports already) is private to each worker process. But right after a server (re-)start when that set is still empty, each worker will unconditionally write the received physnet-segment information into the db. This means we multiply the load on the db and rpc workers by a factor of the total rpc worker count.

  Pushing a fix attempt soon.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1952730/+subscriptions



References