yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1952730] [NEW] Segment updates may cause unnecessary overload

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Bence Romsics <1952730@xxxxxxxxxxxxxxxxxx>
Date: Tue, 30 Nov 2021 09:27:39 -0000
Reply-to: Bug 1952730 <1952730@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Public bug reported:

When:

* the segments service plugin is enabled and
* we have many rpc worker processes (as in the sum of rpc_workers and rpc_state_report_workers, since both kind processes agent state_reports) and
* many ovs-agents report physnets and
* neutron-server is restarted,

then rpc workers may get overloaded by state_report messages. That is:
they may run at 100% CPU utilization for tens of minutes and during that
they are not able process ovs-agent's state_reports in a timely manner.
Which in turn causes the agent state to go down and back, maybe multiple
times. Eventually, as the workers get through the initial processing,
the load lessens, and the system stabilizes. The same rate of incoming
state_report messages is not a problem at that point.

(Colleagues working downstream observed this on a stable/victoria base
with cc 150 ovs-agents and 3 neutron-servers each having maybe
rpc_workers=6 and rpc_state_report_workers=6. The relevant code did not
change at all since victoria, so I believe the same would happen on
master.)

I think the root cause is the following:

rabbitmq dispatches the state_report messages between the workers in a
round robin fashion, terefore eventually the state_reports of the same
agent will hit all rpc workers. Each worker has logic to update the host
segment mapping if either the server or the agent got restarted:

https://opendev.org/openstack/neutron/src/commit/90b5456b8c11011c41f2fcd53a8943cb45fb6479/neutron/services/segments/db.py#L304-L305

Unfortunately the 'reported_hosts' set (to remember from which host the server has seen agent reports already) is private to each worker process. But right after a server (re-)start when that set is still empty, each worker will unconditionally write the received physnet-segment information into the db. This means we multiply the load on the db and rpc workers by a factor of the total rpc worker count.

Pushing a fix attempt soon.

** Affects: neutron
Importance: High
Assignee: Bence Romsics (bence-romsics)
Status: In Progress

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1952730

Title:
Segment updates may cause unnecessary overload

Status in neutron:
In Progress

Bug description:
When:

then rpc workers may get overloaded by state_report messages. That is:
they may run at 100% CPU utilization for tens of minutes and during
that they are not able process ovs-agent's state_reports in a timely
manner. Which in turn causes the agent state to go down and back,
maybe multiple times. Eventually, as the workers get through the
initial processing, the load lessens, and the system stabilizes. The
same rate of incoming state_report messages is not a problem at that
point.

(Colleagues working downstream observed this on a stable/victoria base
with cc 150 ovs-agents and 3 neutron-servers each having maybe
rpc_workers=6 and rpc_state_report_workers=6. The relevant code did
not change at all since victoria, so I believe the same would happen
on master.)

I think the root cause is the following:

rabbitmq dispatches the state_report messages between the workers in a
round robin fashion, terefore eventually the state_reports of the
same agent will hit all rpc workers. Each worker has logic to update
the host segment mapping if either the server or the agent got
restarted:

https://opendev.org/openstack/neutron/src/commit/90b5456b8c11011c41f2fcd53a8943cb45fb6479/neutron/services/segments/db.py#L304-L305

Pushing a fix attempt soon.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1952730/+subscriptions

Follow ups

[Bug 1952730] Re: Segment updates may cause unnecessary overload
From: OpenStack Infra, 2021-12-08