yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #47602
[Bug 1554332] [NEW] neutron agents are too aggressive under server load
Public bug reported:
If a server operation takes long enough to trigger a timeout on an agent
call to the server, the agent will just give up and issue a new call
immediately. This pattern is pervasive throughout the agents and it
leads to two issues:
First, if the server is busy and the requests take more than the timeout
window to fulfill, the agent will just continually hammer the server
with calls that are bound to fail until the server load is reduced
enough to fulfill the query. If the load is a result of calls from
agents, this leads to a stampeding effect where the server will be
unable to fulfill requests until operator intervention.
Second, the server will build a backlog of call requests that makes the
window of time to process a message smaller as the backlog grows. With
enough clients making calls, the timeout threshold can be crossed before
a call even starts to process. For example, if it takes the server 6
seconds to process a given call and the clients are configured with a 60
second timeout, 30 agents making the call simultaneously will result in
a situation where 20 of the agents will never get a response. The first
10 will get their calls filled and the last 20 will end up in a loop
where the server is just spending time replying to calls that are
expired by the time it processes them.
See the push notification spec for a proposal to eliminate heavy agent
calls: https://review.openstack.org/#/c/225995/
However, even with that spec, we need more intelligent handling of the
cases where calls are required (e.g. initial sync) or where push
notifications are too invasive to change from a call.
** Affects: neutron
Importance: Undecided
Assignee: Kevin Benton (kevinbenton)
Status: New
** Changed in: neutron
Assignee: (unassigned) => Kevin Benton (kevinbenton)
** Changed in: neutron
Milestone: None => mitaka-rc1
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1554332
Title:
neutron agents are too aggressive under server load
Status in neutron:
New
Bug description:
If a server operation takes long enough to trigger a timeout on an
agent call to the server, the agent will just give up and issue a new
call immediately. This pattern is pervasive throughout the agents and
it leads to two issues:
First, if the server is busy and the requests take more than the
timeout window to fulfill, the agent will just continually hammer the
server with calls that are bound to fail until the server load is
reduced enough to fulfill the query. If the load is a result of calls
from agents, this leads to a stampeding effect where the server will
be unable to fulfill requests until operator intervention.
Second, the server will build a backlog of call requests that makes
the window of time to process a message smaller as the backlog grows.
With enough clients making calls, the timeout threshold can be crossed
before a call even starts to process. For example, if it takes the
server 6 seconds to process a given call and the clients are
configured with a 60 second timeout, 30 agents making the call
simultaneously will result in a situation where 20 of the agents will
never get a response. The first 10 will get their calls filled and the
last 20 will end up in a loop where the server is just spending time
replying to calls that are expired by the time it processes them.
See the push notification spec for a proposal to eliminate heavy agent
calls: https://review.openstack.org/#/c/225995/
However, even with that spec, we need more intelligent handling of the
cases where calls are required (e.g. initial sync) or where push
notifications are too invasive to change from a call.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1554332/+subscriptions
Follow ups