← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1685590] [NEW] No retry for removing instance in case of ironic service down

 

Public bug reported:

When ironic service is shortly down (e.g. ironic conductor down),
removing an instance will immediately make this instance into error
state without retry.

After investigation, it points to the code segment:
https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L977-L984

When conductor is down, we will not receive the InstanceDeployFailure
exception. The exception is raised, so ironic will not apply the
configuration CONF.ironic.api_max_retries and
CONF.ironic.api_retry_interval.

Reproduce:
1. nova boot a baremetal instance.
2. reboot the ironic conductor node (or stop conductor service).
3. remove instance in spawn.
4. instance go into error state, not after 2 minutes (default value).

As a comparison, simply comments L983-984 to reproduce.

Proposed fix:
Improve the exception handling to be more robust.

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: ironic

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1685590

Title:
  No retry for removing instance in case of ironic service down

Status in OpenStack Compute (nova):
  New

Bug description:
  When ironic service is shortly down (e.g. ironic conductor down),
  removing an instance will immediately make this instance into error
  state without retry.

  After investigation, it points to the code segment:
  https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L977-L984

  When conductor is down, we will not receive the InstanceDeployFailure
  exception. The exception is raised, so ironic will not apply the
  configuration CONF.ironic.api_max_retries and
  CONF.ironic.api_retry_interval.

  Reproduce:
  1. nova boot a baremetal instance.
  2. reboot the ironic conductor node (or stop conductor service).
  3. remove instance in spawn.
  4. instance go into error state, not after 2 minutes (default value).

  As a comparison, simply comments L983-984 to reproduce.

  Proposed fix:
  Improve the exception handling to be more robust.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1685590/+subscriptions