yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #54414
[Bug 1606827] Re: Agents might be reported as down for 10 minutes after all controllers restart
Reviewed: https://review.openstack.org/347708
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb989be99db84a2789abe2849c786a075e3f5ab7
Submitter: Jenkins
Branch: master
commit bb989be99db84a2789abe2849c786a075e3f5ab7
Author: John Schwarz <jschwarz@xxxxxxxxxx>
Date: Wed Jul 27 12:09:30 2016 +0300
Don't use exponential back-off for report_state
If an agent tries to report_state to the neutron-server and it fails
because of a timeout (raising oslo_messaging.MessagingTimeout), then
there is an exponential back-off effect, which causes the
seemingly-simple report_state RPC call to take 60 seconds, then 120,
then 240 and so on. This can happen if all the controllers are
restarted simultaneously a number of time, as the bug report describes.
Since the feature was intended for heavy RPC calls (like get_routers())
and not for light calls such as report_state, it's safe to reduce the
timeout to a constant 60 seconds interval.
Closes-Bug: #1606827
Change-Id: I15aeea9f8265b859bb1a8ee933b8b2ce1e64b695
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1606827
Title:
Agents might be reported as down for 10 minutes after all controllers
restart
Status in neutron:
Fix Released
Bug description:
The scenario which initially revealed this issue involved multiple
controllers and an extra compute node (total of 4) but it should also
reproduce on deployments smaller than described.
The issue is that if an agent tries to report_state to the neutron-
server and it fails because of a timeout (raising
oslo_messaging.MessagingTimeout), then there is an exponential back-
off effect which was put in place by [1]. The feature was intended for
heavy RPC calls (like get_routers()) and not for light calls such as
report_state, so this can be considered a regression. This can be
reproduced by restarting the controllers on a triple-O deployment and
specified before.
A solution would be to ensure PluginReportStateAPI doesn't use the
exponential backoff, instead seeking to always time out after
rpc_response_timeout.
[1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1606827/+subscriptions
References