← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1920977] [NEW] Error 504 when disabling a nova-compute service recently down

 

Public bug reported:

Description
===========

When a host fails and the nova-compute service stops working, it takes
some time for the nova control plane to detect it and mark the service
as "down" (I believe up to 60 seconds by default?).

During this time where nova-compute is dead but not marked as "down" in
nova, if an operator tries to set the compute service as 'disabled', the
command hangs for quite some time before returning an error.

Showing the status of compute services immediately after this error
indicates that the service was actually updated and marked as disabled.

If the host is already seen as "down" in nova-api when trying to update
status, the command ends successfully

Steps to reproduce
==================

- On a working and enabled nova-compute host, stop nova-compute service
- Before host is reported as down in nova-api, run:

    $ openstack compute service set --disable <host> nova-compute

Expected result
===============

- nova-compute service is marked as disabled in nova-api
- command returns with a success
- a nova-api log says something like "The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs or when the service is restarted."

Actual result
=============

- nova-compute service is marked as disabled in nova-api
- command hangs for some time before returning an error:
```
Failed to set service status to disabled
Compute service nova-compute of host <host> failed to set.
```

Logs & Configs
==============

When nova-api still thinks nova-compute is up and command fails, nova-api shows a stack trace with the following error:
```
An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host <host>. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID <id>
```

When nova-api already knows service is down, there is only an info log:
```
Compute service on host <host> is down. The COMPUTE_STATUS_DISABLED trait will be synchronized when the service is restarted.
```

Environment
===========

Encountered on ussuri

Impact
======

I would say disabling nova-compute may be one of the 1st actions an operator will try when a host is failing.
This behavior also has a bad impact when using Masakari, as the 1st action taken by default is to disable the nova-compute service (see https://docs.openstack.org/masakari/latest/configuration/recovery_workflow_custom_task.html).
As a result, recovery process in masakari ends up in error (even if a retry mecanism saves the day).

** Affects: nova
     Importance: Undecided
         Status: New

** Description changed:

  Description
  ===========
  
- When a host fails and the nova-compute stops working, it takes some time
- for the nova control plane to detect it and mark the service as "down"
- (I believe up to 60 seconds by default?).
+ When a host fails and the nova-compute service stops working, it takes
+ some time for the nova control plane to detect it and mark the service
+ as "down" (I believe up to 60 seconds by default?).
  
  During this time where nova-compute is dead but not marked as "down" in
  nova, if an operator tries to set the compute service as 'disabled', the
  command hangs for quite some time before returning an error.
  
  Showing the status of compute services immediately after this error
  indicates that the service was actually updated and marked as disabled.
  
  If the host is already seen as "down" in nova-api when trying to update
  status, the command ends successfully
  
  Steps to reproduce
  ==================
  
  - On a working and enabled nova-compute host, stop nova-compute service
  - Before host is reported as down in nova-api, run: `openstack compute service set --disable <host> nova-compute`
  
  Expected result
  ===============
  
  - nova-compute service is marked as disabled in nova-api
  - command returns with a success
  - a nova-api log says something like "The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs or when the service is restarted."
  
  Actual result
  =============
  
  - nova-compute service is marked as disabled in nova-api
  - command hangs for some time before returning an error:
  ```
  Failed to set service status to disabled
  Compute service nova-compute of host <host> failed to set.
  ```
  
  Logs & Configs
  ==============
  
  When nova-api still thinks nova-compute is up and command fails, nova-api shows a stack trace with the following error:
  ```
  An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host <host>. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID <id>
  ```
  
  When nova-api already knows service is down, there is only an info log:
  ```
  Compute service on host <host> is down. The COMPUTE_STATUS_DISABLED trait will be synchronized when the service is restarted.
  ```
  
  Impact
  ======
  
  I would say disabling nova-compute may be one of the 1st actions an operator will try when a host is failing.
  This behavior also has a bad impact when using Masakari, as the 1st action taken by default is to disable the nova-compute service (see https://docs.openstack.org/masakari/latest/configuration/recovery_workflow_custom_task.html).
  As a result, recovery process in masakari ends up in error (and a retry mecanism saves the day).

** Description changed:

  Description
  ===========
  
  When a host fails and the nova-compute service stops working, it takes
  some time for the nova control plane to detect it and mark the service
  as "down" (I believe up to 60 seconds by default?).
  
  During this time where nova-compute is dead but not marked as "down" in
  nova, if an operator tries to set the compute service as 'disabled', the
  command hangs for quite some time before returning an error.
  
  Showing the status of compute services immediately after this error
  indicates that the service was actually updated and marked as disabled.
  
  If the host is already seen as "down" in nova-api when trying to update
  status, the command ends successfully
  
  Steps to reproduce
  ==================
  
  - On a working and enabled nova-compute host, stop nova-compute service
- - Before host is reported as down in nova-api, run: `openstack compute service set --disable <host> nova-compute`
+ - Before host is reported as down in nova-api, run: 
+ 
+     $ openstack compute service set --disable <host> nova-compute
  
  Expected result
  ===============
  
  - nova-compute service is marked as disabled in nova-api
  - command returns with a success
  - a nova-api log says something like "The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs or when the service is restarted."
  
  Actual result
  =============
  
  - nova-compute service is marked as disabled in nova-api
  - command hangs for some time before returning an error:
  ```
  Failed to set service status to disabled
  Compute service nova-compute of host <host> failed to set.
  ```
  
  Logs & Configs
  ==============
  
  When nova-api still thinks nova-compute is up and command fails, nova-api shows a stack trace with the following error:
  ```
  An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host <host>. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID <id>
  ```
  
  When nova-api already knows service is down, there is only an info log:
  ```
  Compute service on host <host> is down. The COMPUTE_STATUS_DISABLED trait will be synchronized when the service is restarted.
  ```
  
  Impact
  ======
  
  I would say disabling nova-compute may be one of the 1st actions an operator will try when a host is failing.
  This behavior also has a bad impact when using Masakari, as the 1st action taken by default is to disable the nova-compute service (see https://docs.openstack.org/masakari/latest/configuration/recovery_workflow_custom_task.html).
  As a result, recovery process in masakari ends up in error (and a retry mecanism saves the day).

** Description changed:

  Description
  ===========
  
  When a host fails and the nova-compute service stops working, it takes
  some time for the nova control plane to detect it and mark the service
  as "down" (I believe up to 60 seconds by default?).
  
  During this time where nova-compute is dead but not marked as "down" in
  nova, if an operator tries to set the compute service as 'disabled', the
  command hangs for quite some time before returning an error.
  
  Showing the status of compute services immediately after this error
  indicates that the service was actually updated and marked as disabled.
  
  If the host is already seen as "down" in nova-api when trying to update
  status, the command ends successfully
  
  Steps to reproduce
  ==================
  
  - On a working and enabled nova-compute host, stop nova-compute service
- - Before host is reported as down in nova-api, run: 
+ - Before host is reported as down in nova-api, run:
  
-     $ openstack compute service set --disable <host> nova-compute
+     $ openstack compute service set --disable <host> nova-compute
  
  Expected result
  ===============
  
  - nova-compute service is marked as disabled in nova-api
  - command returns with a success
  - a nova-api log says something like "The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs or when the service is restarted."
  
  Actual result
  =============
  
  - nova-compute service is marked as disabled in nova-api
  - command hangs for some time before returning an error:
  ```
  Failed to set service status to disabled
  Compute service nova-compute of host <host> failed to set.
  ```
  
  Logs & Configs
  ==============
  
  When nova-api still thinks nova-compute is up and command fails, nova-api shows a stack trace with the following error:
  ```
  An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host <host>. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID <id>
  ```
  
  When nova-api already knows service is down, there is only an info log:
  ```
  Compute service on host <host> is down. The COMPUTE_STATUS_DISABLED trait will be synchronized when the service is restarted.
  ```
  
  Impact
  ======
  
  I would say disabling nova-compute may be one of the 1st actions an operator will try when a host is failing.
  This behavior also has a bad impact when using Masakari, as the 1st action taken by default is to disable the nova-compute service (see https://docs.openstack.org/masakari/latest/configuration/recovery_workflow_custom_task.html).
- As a result, recovery process in masakari ends up in error (and a retry mecanism saves the day).
+ As a result, recovery process in masakari ends up in error (even if a retry mecanism saves the day).

** Description changed:

  Description
  ===========
  
  When a host fails and the nova-compute service stops working, it takes
  some time for the nova control plane to detect it and mark the service
  as "down" (I believe up to 60 seconds by default?).
  
  During this time where nova-compute is dead but not marked as "down" in
  nova, if an operator tries to set the compute service as 'disabled', the
  command hangs for quite some time before returning an error.
  
  Showing the status of compute services immediately after this error
  indicates that the service was actually updated and marked as disabled.
  
  If the host is already seen as "down" in nova-api when trying to update
  status, the command ends successfully
  
  Steps to reproduce
  ==================
  
  - On a working and enabled nova-compute host, stop nova-compute service
  - Before host is reported as down in nova-api, run:
  
      $ openstack compute service set --disable <host> nova-compute
  
  Expected result
  ===============
  
  - nova-compute service is marked as disabled in nova-api
  - command returns with a success
  - a nova-api log says something like "The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs or when the service is restarted."
  
  Actual result
  =============
  
  - nova-compute service is marked as disabled in nova-api
  - command hangs for some time before returning an error:
  ```
  Failed to set service status to disabled
  Compute service nova-compute of host <host> failed to set.
  ```
  
  Logs & Configs
  ==============
  
  When nova-api still thinks nova-compute is up and command fails, nova-api shows a stack trace with the following error:
  ```
  An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host <host>. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID <id>
  ```
  
  When nova-api already knows service is down, there is only an info log:
  ```
  Compute service on host <host> is down. The COMPUTE_STATUS_DISABLED trait will be synchronized when the service is restarted.
  ```
  
+ Environment
+ ===========
+ 
+ Encountered on ussuri
+ 
  Impact
  ======
  
  I would say disabling nova-compute may be one of the 1st actions an operator will try when a host is failing.
  This behavior also has a bad impact when using Masakari, as the 1st action taken by default is to disable the nova-compute service (see https://docs.openstack.org/masakari/latest/configuration/recovery_workflow_custom_task.html).
  As a result, recovery process in masakari ends up in error (even if a retry mecanism saves the day).

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1920977

Title:
  Error 504 when disabling a nova-compute service recently down

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  When a host fails and the nova-compute service stops working, it takes
  some time for the nova control plane to detect it and mark the service
  as "down" (I believe up to 60 seconds by default?).

  During this time where nova-compute is dead but not marked as "down"
  in nova, if an operator tries to set the compute service as
  'disabled', the command hangs for quite some time before returning an
  error.

  Showing the status of compute services immediately after this error
  indicates that the service was actually updated and marked as
  disabled.

  If the host is already seen as "down" in nova-api when trying to
  update status, the command ends successfully

  Steps to reproduce
  ==================

  - On a working and enabled nova-compute host, stop nova-compute service
  - Before host is reported as down in nova-api, run:

      $ openstack compute service set --disable <host> nova-compute

  Expected result
  ===============

  - nova-compute service is marked as disabled in nova-api
  - command returns with a success
  - a nova-api log says something like "The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs or when the service is restarted."

  Actual result
  =============

  - nova-compute service is marked as disabled in nova-api
  - command hangs for some time before returning an error:
  ```
  Failed to set service status to disabled
  Compute service nova-compute of host <host> failed to set.
  ```

  Logs & Configs
  ==============

  When nova-api still thinks nova-compute is up and command fails, nova-api shows a stack trace with the following error:
  ```
  An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host <host>. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID <id>
  ```

  When nova-api already knows service is down, there is only an info log:
  ```
  Compute service on host <host> is down. The COMPUTE_STATUS_DISABLED trait will be synchronized when the service is restarted.
  ```

  Environment
  ===========

  Encountered on ussuri

  Impact
  ======

  I would say disabling nova-compute may be one of the 1st actions an operator will try when a host is failing.
  This behavior also has a bad impact when using Masakari, as the 1st action taken by default is to disable the nova-compute service (see https://docs.openstack.org/masakari/latest/configuration/recovery_workflow_custom_task.html).
  As a result, recovery process in masakari ends up in error (even if a retry mecanism saves the day).

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1920977/+subscriptions


Follow ups