yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1606827] [NEW] Agents might be reported as down for 10 minutes after all controllers restart

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: John Schwarz <jschwarz@xxxxxxxxxx>
Date: Wed, 27 Jul 2016 09:09:04 -0000
Reply-to: Bug 1606827 <1606827@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

The scenario which initially revealed this issue involved multiple
controllers and an extra compute node (total of 4) but it should also
reproduce on deployments smaller than described.

The issue is that if an agent tries to report_state to the neutron-
server and it fails because of a timeout (raising
oslo_messaging.MessagingTimeout), then there is an exponential back-off
effect which was put in place by [1]. The feature was intended for heavy
RPC calls (like get_routers()) and not for light calls such as
report_state, so this can be considered a regression. This can be
reproduced by restarting the controllers on a triple-O deployment and
specified before.

A solution would be to ensure PluginReportStateAPI doesn't use the
exponential backoff, instead seeking to always time out after
rpc_response_timeout.

[1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py

** Affects: neutron
     Importance: Undecided
     Assignee: John Schwarz (jschwarz)
         Status: In Progress


** Tags: liberty-backport-potential mitaka-backport-potential

** Description changed:

  The scenario which initially revealed this issue involved multiple
  controllers and an extra compute node (total of 4) but it should also
  reproduce on deployments smaller than described.
  
  The issue is that if an agent tries to report_state to the neutron-
  server and it fails because of a timeout (raising
  oslo_messaging.MessagingTimeout), then there is an exponential back-off
  effect which was put in place by [1]. The feature was intended for heavy
  RPC calls (like get_routers()) and not for light calls such as
- report_state, so this can be considered a regression.
+ report_state, so this can be considered a regression. This can be
+ reproduced by restarting the controllers on a triple-O deployment and
+ specified before.
  
  A solution would be to ensure PluginReportStateAPI doesn't use the
  exponential backoff, instead seeking to always time out after
  rpc_response_timeout.
  
  [1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py

** Tags added: mitaka-backport-potential

** Tags added: liberty-backport-potential

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1606827

Title:
  Agents might be reported as down for 10 minutes after all controllers
  restart

Status in neutron:
  In Progress

Bug description:
  The scenario which initially revealed this issue involved multiple
  controllers and an extra compute node (total of 4) but it should also
  reproduce on deployments smaller than described.

  The issue is that if an agent tries to report_state to the neutron-
  server and it fails because of a timeout (raising
  oslo_messaging.MessagingTimeout), then there is an exponential back-
  off effect which was put in place by [1]. The feature was intended for
  heavy RPC calls (like get_routers()) and not for light calls such as
  report_state, so this can be considered a regression. This can be
  reproduced by restarting the controllers on a triple-O deployment and
  specified before.

  A solution would be to ensure PluginReportStateAPI doesn't use the
  exponential backoff, instead seeking to always time out after
  rpc_response_timeout.

  [1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1606827/+subscriptions

Follow ups

[Bug 1606827] Re: Agents might be reported as down for 10 minutes after all controllers restart
From: OpenStack Infra, 2016-07-29