yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #50300
[Bug 1554332] Re: neutron agents are too aggressive under server load
Reviewed: https://review.openstack.org/280595
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3e668b6a3720c1509ffef4ad5b91b4242dfd47b3
Submitter: Jenkins
Branch: master
commit 3e668b6a3720c1509ffef4ad5b91b4242dfd47b3
Author: Kevin Benton <kevin@xxxxxxxxxx>
Date: Tue Feb 16 01:50:23 2016 -0800
Add exponential back-off RPC client
This adds an exponential backoff mechanism for timeout values
on any RPC calls in Neutron that don't explicitly request a timeout
value. This will prevent the clients from DDoSing the server by
giving up on requests and retrying them before they are fulfilled.
Each RPC call method in each namespace gets its own timeout value since
some calls are expected to be much more expensive than others and we
don't want to modify the timeouts of cheap calls.
The backoff currently has no reduction mechanism under the assumption
that timeouts not legitimately caused by heavy system load
(i.e. messages completely dropped by AMQP) are rare enough that the
cost of shrinking the timeout back down and potentially causing
another server timeout isn't worth it. The timeout does have a ceiling
of 10 times the configured default timeout value.
Whenever a timeout exception occurs, the client will also sleep for a
random value between 0 and the configured default timeout value to
introduce a splay across all of the agents that may be trying to
communicate with the server.
This patch is intended to be uninvasive for candidacy to be
back-ported. A larger refactor of delivering data to the agents
is being discussed in I3af200ad84483e6e1fe619d516ff20bc87041f7c.
Closes-Bug: #1554332
Change-Id: I923e415c1b8e9a431be89221c78c14f39c42c80f
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1554332
Title:
neutron agents are too aggressive under server load
Status in neutron:
Fix Released
Bug description:
If a server operation takes long enough to trigger a timeout on an
agent call to the server, the agent will just give up and issue a new
call immediately. This pattern is pervasive throughout the agents and
it leads to two issues:
First, if the server is busy and the requests take more than the
timeout window to fulfill, the agent will just continually hammer the
server with calls that are bound to fail until the server load is
reduced enough to fulfill the query. If the load is a result of calls
from agents, this leads to a stampeding effect where the server will
be unable to fulfill requests until operator intervention.
Second, the server will build a backlog of call requests that makes
the window of time to process a message smaller as the backlog grows.
With enough clients making calls, the timeout threshold can be crossed
before a call even starts to process. For example, if it takes the
server 6 seconds to process a given call and the clients are
configured with a 60 second timeout, 30 agents making the call
simultaneously will result in a situation where 20 of the agents will
never get a response. The first 10 will get their calls filled and the
last 20 will end up in a loop where the server is just spending time
replying to calls that are expired by the time it processes them.
See the push notification spec for a proposal to eliminate heavy agent
calls: https://review.openstack.org/#/c/225995/
However, even with that spec, we need more intelligent handling of the
cases where calls are required (e.g. initial sync) or where push
notifications are too invasive to change from a call.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1554332/+subscriptions
References