yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1606827] Re: Agents might be reported as down for 10 minutes after all controllers restart

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1606827@xxxxxxxxxxxxxxxxxx>
Date: Fri, 29 Jul 2016 16:21:43 -0000
Reply-to: Bug 1606827 <1606827@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Reviewed:  https://review.openstack.org/347708
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb989be99db84a2789abe2849c786a075e3f5ab7
Submitter: Jenkins
Branch:    master

commit bb989be99db84a2789abe2849c786a075e3f5ab7
Author: John Schwarz <jschwarz@xxxxxxxxxx>
Date:   Wed Jul 27 12:09:30 2016 +0300

    Don't use exponential back-off for report_state
    
    If an agent tries to report_state to the neutron-server and it fails
    because of a timeout (raising oslo_messaging.MessagingTimeout), then
    there is an exponential back-off effect, which causes the
    seemingly-simple report_state RPC call to take 60 seconds, then 120,
    then 240 and so on. This can happen if all the controllers are
    restarted simultaneously a number of time, as the bug report describes.
    
    Since the feature was intended for heavy RPC calls (like get_routers())
    and not for light calls such as report_state, it's safe to reduce the
    timeout to a constant 60 seconds interval.
    
    Closes-Bug: #1606827
    Change-Id: I15aeea9f8265b859bb1a8ee933b8b2ce1e64b695


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1606827

Title:
  Agents might be reported as down for 10 minutes after all controllers
  restart

Status in neutron:
  Fix Released

Bug description:
  The scenario which initially revealed this issue involved multiple
  controllers and an extra compute node (total of 4) but it should also
  reproduce on deployments smaller than described.

  The issue is that if an agent tries to report_state to the neutron-
  server and it fails because of a timeout (raising
  oslo_messaging.MessagingTimeout), then there is an exponential back-
  off effect which was put in place by [1]. The feature was intended for
  heavy RPC calls (like get_routers()) and not for light calls such as
  report_state, so this can be considered a regression. This can be
  reproduced by restarting the controllers on a triple-O deployment and
  specified before.

  A solution would be to ensure PluginReportStateAPI doesn't use the
  exponential backoff, instead seeking to always time out after
  rpc_response_timeout.

  [1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1606827/+subscriptions

References

[Bug 1606827] [NEW] Agents might be reported as down for 10 minutes after all controllers restart
From: John Schwarz, 2016-07-27