← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1952730] [NEW] Segment updates may cause unnecessary overload

 

Public bug reported:

When:

* the segments service plugin is enabled and
* we have many rpc worker processes (as in the sum of rpc_workers and rpc_state_report_workers, since both kind processes agent state_reports) and
* many ovs-agents report physnets and
* neutron-server is restarted,

then rpc workers may get overloaded by state_report messages. That is:
they may run at 100% CPU utilization for tens of minutes and during that
they are not able process ovs-agent's state_reports in a timely manner.
Which in turn causes the agent state to go down and back, maybe multiple
times. Eventually, as the workers get through the initial processing,
the load lessens, and the system stabilizes. The same rate of incoming
state_report messages is not a problem at that point.

(Colleagues working downstream observed this on a stable/victoria base
with cc 150 ovs-agents and 3 neutron-servers each having maybe
rpc_workers=6 and rpc_state_report_workers=6. The relevant code did not
change at all since victoria, so I believe the same would happen on
master.)

I think the root cause is the following:

rabbitmq dispatches the state_report messages between the workers in a
round robin fashion,  terefore eventually the state_reports of the same
agent will hit all rpc workers. Each worker has logic to update the host
segment mapping if either the server or the agent got restarted:

https://opendev.org/openstack/neutron/src/commit/90b5456b8c11011c41f2fcd53a8943cb45fb6479/neutron/services/segments/db.py#L304-L305
    
Unfortunately the 'reported_hosts' set (to remember from which host the server has seen agent reports already) is private to each worker process. But right after a server (re-)start when that set is still empty, each worker will unconditionally write the received physnet-segment information into the db. This means we multiply the load on the db and rpc workers by a factor of the total rpc worker count.

Pushing a fix attempt soon.

** Affects: neutron
     Importance: High
     Assignee: Bence Romsics (bence-romsics)
         Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1952730

Title:
  Segment updates may cause unnecessary overload

Status in neutron:
  In Progress

Bug description:
  When:

  * the segments service plugin is enabled and
  * we have many rpc worker processes (as in the sum of rpc_workers and rpc_state_report_workers, since both kind processes agent state_reports) and
  * many ovs-agents report physnets and
  * neutron-server is restarted,

  then rpc workers may get overloaded by state_report messages. That is:
  they may run at 100% CPU utilization for tens of minutes and during
  that they are not able process ovs-agent's state_reports in a timely
  manner. Which in turn causes the agent state to go down and back,
  maybe multiple times. Eventually, as the workers get through the
  initial processing, the load lessens, and the system stabilizes. The
  same rate of incoming state_report messages is not a problem at that
  point.

  (Colleagues working downstream observed this on a stable/victoria base
  with cc 150 ovs-agents and 3 neutron-servers each having maybe
  rpc_workers=6 and rpc_state_report_workers=6. The relevant code did
  not change at all since victoria, so I believe the same would happen
  on master.)

  I think the root cause is the following:

  rabbitmq dispatches the state_report messages between the workers in a
  round robin fashion,  terefore eventually the state_reports of the
  same agent will hit all rpc workers. Each worker has logic to update
  the host segment mapping if either the server or the agent got
  restarted:

  https://opendev.org/openstack/neutron/src/commit/90b5456b8c11011c41f2fcd53a8943cb45fb6479/neutron/services/segments/db.py#L304-L305
      
  Unfortunately the 'reported_hosts' set (to remember from which host the server has seen agent reports already) is private to each worker process. But right after a server (re-)start when that set is still empty, each worker will unconditionally write the received physnet-segment information into the db. This means we multiply the load on the db and rpc workers by a factor of the total rpc worker count.

  Pushing a fix attempt soon.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1952730/+subscriptions



Follow ups