yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #87825
[Bug 1952730] Re: Segment updates may cause unnecessary overload
Reviewed: https://review.opendev.org/c/openstack/neutron/+/819777
Committed: https://opendev.org/openstack/neutron/commit/176503e610aee16cb5799a77466579bc55129450
Submitter: "Zuul (22348)"
Branch: master
commit 176503e610aee16cb5799a77466579bc55129450
Author: Bence Romsics <bence.romsics@xxxxxxxxx>
Date: Mon Nov 29 09:40:42 2021 +0100
Avoid writing segments to the DB repeatedly
When:
* the segments service plugin is enabled and
* we have multiple rpc worker processes (as in the sum of rpc_workers
and rpc_state_report_workers, since both kind processes agent
state_reports) and
* many ovs-agents report physnets,
then rabbitmq dispatches the state_report messages between the workers
in a round robin fashion, therefore eventually the state_reports of the
same agent will hit all rpc workers.
Unfortunately all worker processes have a 'reported_hosts' set to
remember from which host it has seen agent reports already. But right
after a server start when that set is still empty, each worker will
unconditionally write the received physnet-segment information into
the db. This means we multiply the load on the db and rpc workers by
a factor of the rpc worker count.
This patch tries to reduce the load on the db by adding another early
return before the unconditional db write.
Change-Id: I935186b6ee95f0cae8dc05869d9742c8fb3353c3
Closes-Bug: #1952730
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1952730
Title:
Segment updates may cause unnecessary overload
Status in neutron:
Fix Released
Bug description:
When:
* the segments service plugin is enabled and
* we have many rpc worker processes (as in the sum of rpc_workers and rpc_state_report_workers, since both kind processes agent state_reports) and
* many ovs-agents report physnets and
* neutron-server is restarted,
then rpc workers may get overloaded by state_report messages. That is:
they may run at 100% CPU utilization for tens of minutes and during
that they are not able process ovs-agent's state_reports in a timely
manner. Which in turn causes the agent state to go down and back,
maybe multiple times. Eventually, as the workers get through the
initial processing, the load lessens, and the system stabilizes. The
same rate of incoming state_report messages is not a problem at that
point.
(Colleagues working downstream observed this on a stable/victoria base
with cc 150 ovs-agents and 3 neutron-servers each having maybe
rpc_workers=6 and rpc_state_report_workers=6. The relevant code did
not change at all since victoria, so I believe the same would happen
on master.)
I think the root cause is the following:
rabbitmq dispatches the state_report messages between the workers in a
round robin fashion, terefore eventually the state_reports of the
same agent will hit all rpc workers. Each worker has logic to update
the host segment mapping if either the server or the agent got
restarted:
https://opendev.org/openstack/neutron/src/commit/90b5456b8c11011c41f2fcd53a8943cb45fb6479/neutron/services/segments/db.py#L304-L305
Unfortunately the 'reported_hosts' set (to remember from which host the server has seen agent reports already) is private to each worker process. But right after a server (re-)start when that set is still empty, each worker will unconditionally write the received physnet-segment information into the db. This means we multiply the load on the db and rpc workers by a factor of the total rpc worker count.
Pushing a fix attempt soon.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1952730/+subscriptions
References