yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #85725
[Bug 1920977] Re: Error 504 when disabling a nova-compute service recently down
You shouldn't disable the host by calling the host API, but rather
either waiting for the periodic verification (indeed, around 60 secs) or
calling the force-down API.
https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-
detail#update-forced-down
** Changed in: nova
Status: New => Invalid
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1920977
Title:
Error 504 when disabling a nova-compute service recently down
Status in OpenStack Compute (nova):
Invalid
Bug description:
Description
===========
When a host fails and the nova-compute service stops working, it takes
some time for the nova control plane to detect it and mark the service
as "down" (I believe up to 60 seconds by default?).
During this time where nova-compute is dead but not marked as "down"
in nova, if an operator tries to set the compute service as
'disabled', the command hangs for quite some time before returning an
error.
Showing the status of compute services immediately after this error
indicates that the service was actually updated and marked as
disabled.
If the host is already seen as "down" in nova-api when trying to
update status, the command ends successfully
Steps to reproduce
==================
- On a working and enabled nova-compute host, stop nova-compute service
- Before host is reported as down in nova-api, run:
$ openstack compute service set --disable <host> nova-compute
Expected result
===============
- nova-compute service is marked as disabled in nova-api
- command returns with a success
- a nova-api log says something like "The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs or when the service is restarted."
Actual result
=============
- nova-compute service is marked as disabled in nova-api
- command hangs for some time before returning an error:
```
Failed to set service status to disabled
Compute service nova-compute of host <host> failed to set.
```
Logs & Configs
==============
When nova-api still thinks nova-compute is up and command fails, nova-api shows a stack trace with the following error:
```
An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host <host>. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID <id>
```
When nova-api already knows service is down, there is only an info log:
```
Compute service on host <host> is down. The COMPUTE_STATUS_DISABLED trait will be synchronized when the service is restarted.
```
Environment
===========
Encountered on ussuri
Impact
======
I would say disabling nova-compute may be one of the 1st actions an operator will try when a host is failing.
This behavior also has a bad impact when using Masakari, as the 1st action taken by default is to disable the nova-compute service (see https://docs.openstack.org/masakari/latest/configuration/recovery_workflow_custom_task.html).
As a result, recovery process in masakari ends up in error (even if a retry mecanism saves the day).
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1920977/+subscriptions
References