yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #54305
[Bug 1606827] [NEW] Agents might be reported as down for 10 minutes after all controllers restart
Public bug reported:
The scenario which initially revealed this issue involved multiple
controllers and an extra compute node (total of 4) but it should also
reproduce on deployments smaller than described.
The issue is that if an agent tries to report_state to the neutron-
server and it fails because of a timeout (raising
oslo_messaging.MessagingTimeout), then there is an exponential back-off
effect which was put in place by [1]. The feature was intended for heavy
RPC calls (like get_routers()) and not for light calls such as
report_state, so this can be considered a regression. This can be
reproduced by restarting the controllers on a triple-O deployment and
specified before.
A solution would be to ensure PluginReportStateAPI doesn't use the
exponential backoff, instead seeking to always time out after
rpc_response_timeout.
[1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py
** Affects: neutron
Importance: Undecided
Assignee: John Schwarz (jschwarz)
Status: In Progress
** Tags: liberty-backport-potential mitaka-backport-potential
** Description changed:
The scenario which initially revealed this issue involved multiple
controllers and an extra compute node (total of 4) but it should also
reproduce on deployments smaller than described.
The issue is that if an agent tries to report_state to the neutron-
server and it fails because of a timeout (raising
oslo_messaging.MessagingTimeout), then there is an exponential back-off
effect which was put in place by [1]. The feature was intended for heavy
RPC calls (like get_routers()) and not for light calls such as
- report_state, so this can be considered a regression.
+ report_state, so this can be considered a regression. This can be
+ reproduced by restarting the controllers on a triple-O deployment and
+ specified before.
A solution would be to ensure PluginReportStateAPI doesn't use the
exponential backoff, instead seeking to always time out after
rpc_response_timeout.
[1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py
** Tags added: mitaka-backport-potential
** Tags added: liberty-backport-potential
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1606827
Title:
Agents might be reported as down for 10 minutes after all controllers
restart
Status in neutron:
In Progress
Bug description:
The scenario which initially revealed this issue involved multiple
controllers and an extra compute node (total of 4) but it should also
reproduce on deployments smaller than described.
The issue is that if an agent tries to report_state to the neutron-
server and it fails because of a timeout (raising
oslo_messaging.MessagingTimeout), then there is an exponential back-
off effect which was put in place by [1]. The feature was intended for
heavy RPC calls (like get_routers()) and not for light calls such as
report_state, so this can be considered a regression. This can be
reproduced by restarting the controllers on a triple-O deployment and
specified before.
A solution would be to ensure PluginReportStateAPI doesn't use the
exponential backoff, instead seeking to always time out after
rpc_response_timeout.
[1]: https://review.openstack.org/#/c/280595/14/neutron/common/rpc.py
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1606827/+subscriptions
Follow ups