← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1920977] Re: Error 504 when disabling a nova-compute service recently down

 

You shouldn't disable the host by calling the host API, but rather
either waiting for the periodic verification (indeed, around 60 secs) or
calling the force-down API.

https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-
detail#update-forced-down


** Changed in: nova
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1920977

Title:
  Error 504 when disabling a nova-compute service recently down

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  Description
  ===========

  When a host fails and the nova-compute service stops working, it takes
  some time for the nova control plane to detect it and mark the service
  as "down" (I believe up to 60 seconds by default?).

  During this time where nova-compute is dead but not marked as "down"
  in nova, if an operator tries to set the compute service as
  'disabled', the command hangs for quite some time before returning an
  error.

  Showing the status of compute services immediately after this error
  indicates that the service was actually updated and marked as
  disabled.

  If the host is already seen as "down" in nova-api when trying to
  update status, the command ends successfully

  Steps to reproduce
  ==================

  - On a working and enabled nova-compute host, stop nova-compute service
  - Before host is reported as down in nova-api, run:

      $ openstack compute service set --disable <host> nova-compute

  Expected result
  ===============

  - nova-compute service is marked as disabled in nova-api
  - command returns with a success
  - a nova-api log says something like "The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs or when the service is restarted."

  Actual result
  =============

  - nova-compute service is marked as disabled in nova-api
  - command hangs for some time before returning an error:
  ```
  Failed to set service status to disabled
  Compute service nova-compute of host <host> failed to set.
  ```

  Logs & Configs
  ==============

  When nova-api still thinks nova-compute is up and command fails, nova-api shows a stack trace with the following error:
  ```
  An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host <host>. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID <id>
  ```

  When nova-api already knows service is down, there is only an info log:
  ```
  Compute service on host <host> is down. The COMPUTE_STATUS_DISABLED trait will be synchronized when the service is restarted.
  ```

  Environment
  ===========

  Encountered on ussuri

  Impact
  ======

  I would say disabling nova-compute may be one of the 1st actions an operator will try when a host is failing.
  This behavior also has a bad impact when using Masakari, as the 1st action taken by default is to disable the nova-compute service (see https://docs.openstack.org/masakari/latest/configuration/recovery_workflow_custom_task.html).
  As a result, recovery process in masakari ends up in error (even if a retry mecanism saves the day).

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1920977/+subscriptions


References