← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2019977] [NEW] Ironic VIF unplug in destroy may fail due to node locked by cleaning

 

Public bug reported:

This bug describes a race condition during deletion of a bare metal
(Ironic) instance when automated cleaning is enabled in Ironic (which is
the default).

# Steps to reproduce

As a race condition, this one is not easy to reproduce, although it has
been seen in the wild on several occasions. The easiest way to reproduce
it is in a development environment (e.g. devstack), tweaking some
configuration and adding some sleeps to particular operations in nova
and ironic.

nova.conf configuration: Reduce retries for ironic API polling. Needs to
be long enough to allow operations to complete.

[ironic]
api_max_retries = 10

sleep #1: nova-compute waits for the node to move to cleaning
(nova/virt/ironic/driver.py):

        try:
            if node.provision_state in _UNPROVISION_STATES:
                self._unprovision(instance, node)
            else:
                # NOTE(hshiina): if spawn() fails before ironic starts
                #                provisioning, instance information should be
                #                removed from ironic node.
                self._remove_instance_info_from_node(node, instance)
        finally:
            ##### HERE #######
            time.sleep(20)
            ##### HERE #######
            # NOTE(mgoddard): We don't need to remove instance info at this
            # point since we will have already done it. The destroy will only
            # succeed if this method returns without error, so we will end up
            # removing the instance info eventually.
            self._cleanup_deploy(node, instance, network_info,
                                 remove_instance_info=False)

sleep #2: ironic conductor holds onto the node lock:

@task_manager.require_exclusive_lock
def do_node_clean(task, clean_steps=None, disable_ramdisk=False):
    """Internal RPC method to perform cleaning of a node.

    :param task: a TaskManager instance with an exclusive lock on its node
    :param clean_steps: For a manual clean, the list of clean steps to
                        perform. Is None For automated cleaning (default).
                        For more information, see the clean_steps parameter
                        of :func:`ConductorManager.do_node_clean`.
    :param disable_ramdisk: Whether to skip booting ramdisk for cleaning.
    """
    ##### HERE #####
    import time
    time.sleep(120)
    ##### HERE #####
    node = task.node
    manual_clean = clean_steps is not None

Next, create, then destroy a bare metal instance.

# Expected results

The node is deleted

# Actual results

The node moves to an ERROR state.

Logs from nova-compute:

2023-05-17 14:30:30.493 7 ERROR ironicclient.common.http [req-c8d567dc-7088-4bad-9a09-6ba33b6c7b4d 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] Error contacting Ir
onic server: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409). Attempt 6 of 6: ironicclient.common.apic
lient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server [req-c8d567dc-7088-4bad-9a09-6ba33b6c7b4d 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] Exception during m
essage handling: ironicclient.common.apiclient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is complete
d. (HTTP 409)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/exception_wrapper.py", line 72, in wrapped
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     context, exc, binary)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self.force_reraise()
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     raise self.value
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/exception_wrapper.py", line 63, in wrapped
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 10693, in external_instance_event
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     event.tag)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 10498, in _process_instance_vif_deleted_event
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self.driver.detach_interface(context, instance, vif)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1630, in detach_interface
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self.unplug_vifs(instance, [vif])
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1591, in unplug_vifs
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self._unplug_vifs(node, instance, network_info)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1558, in _unplug_vifs
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     port_id)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/client_wrapper.py", line 180, in call
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return self._multi_getattr(client, method)(*args, **kwargs)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/v1/node.py", line 512, in vif_detach
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     global_request_id=global_request_id)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/v1/node.py", line 408, in delete
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     global_request_id=global_request_id)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/base.py", line 292, in _delete
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     headers=headers)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 426, in raw_request
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return self._http_request(url, method, **kwargs)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 287, in wrapper
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return func(self, url, method, **kwargs)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 387, in _http_request
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     error_json.get('debuginfo'), method, url)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server ironicclient.common.apiclient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409)

# Analysis

This issue is similar to https://bugs.launchpad.net/nova/+bug/1815799,
where instance_uuid and instance_info were being cleaned up. Ironic
deletes VIFs during tear down since Rocky [1], so we should also be safe
to skip this in the case where the node has been deprovisioned. Debug
logs confirm that the VIF cleanup is a noop:

2023-05-17 14:24:11.791 7 DEBUG nova.virt.ironic.driver [req-e419f330-fe14-4c4b-8d5a-b7a40bb77651 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] VIF a5c1622b-eb13-43
d3-a46d-aa5781a85cdb isn't attached to Ironic node 3152ff2d-a0cb-4e7a-bcc7-81266106fef2 _unplug_vifs /var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py:1561


[1] https://review.opendev.org/q/I8d683d2d506c97535b5a8f9a5de4c070c7e887df

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2019977

Title:
  Ironic VIF unplug in destroy may fail due to node locked by cleaning

Status in OpenStack Compute (nova):
  New

Bug description:
  This bug describes a race condition during deletion of a bare metal
  (Ironic) instance when automated cleaning is enabled in Ironic (which
  is the default).

  # Steps to reproduce

  As a race condition, this one is not easy to reproduce, although it
  has been seen in the wild on several occasions. The easiest way to
  reproduce it is in a development environment (e.g. devstack), tweaking
  some configuration and adding some sleeps to particular operations in
  nova and ironic.

  nova.conf configuration: Reduce retries for ironic API polling. Needs
  to be long enough to allow operations to complete.

  [ironic]
  api_max_retries = 10

  sleep #1: nova-compute waits for the node to move to cleaning
  (nova/virt/ironic/driver.py):

          try:
              if node.provision_state in _UNPROVISION_STATES:
                  self._unprovision(instance, node)
              else:
                  # NOTE(hshiina): if spawn() fails before ironic starts
                  #                provisioning, instance information should be
                  #                removed from ironic node.
                  self._remove_instance_info_from_node(node, instance)
          finally:
              ##### HERE #######
              time.sleep(20)
              ##### HERE #######
              # NOTE(mgoddard): We don't need to remove instance info at this
              # point since we will have already done it. The destroy will only
              # succeed if this method returns without error, so we will end up
              # removing the instance info eventually.
              self._cleanup_deploy(node, instance, network_info,
                                   remove_instance_info=False)

  sleep #2: ironic conductor holds onto the node lock:

  @task_manager.require_exclusive_lock
  def do_node_clean(task, clean_steps=None, disable_ramdisk=False):
      """Internal RPC method to perform cleaning of a node.

      :param task: a TaskManager instance with an exclusive lock on its node
      :param clean_steps: For a manual clean, the list of clean steps to
                          perform. Is None For automated cleaning (default).
                          For more information, see the clean_steps parameter
                          of :func:`ConductorManager.do_node_clean`.
      :param disable_ramdisk: Whether to skip booting ramdisk for cleaning.
      """
      ##### HERE #####
      import time
      time.sleep(120)
      ##### HERE #####
      node = task.node
      manual_clean = clean_steps is not None

  Next, create, then destroy a bare metal instance.

  # Expected results

  The node is deleted

  # Actual results

  The node moves to an ERROR state.

  Logs from nova-compute:

  2023-05-17 14:30:30.493 7 ERROR ironicclient.common.http [req-c8d567dc-7088-4bad-9a09-6ba33b6c7b4d 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] Error contacting Ir
  onic server: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409). Attempt 6 of 6: ironicclient.common.apic
  lient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server [req-c8d567dc-7088-4bad-9a09-6ba33b6c7b4d 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] Exception during m
  essage handling: ironicclient.common.apiclient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is complete
  d. (HTTP 409)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/exception_wrapper.py", line 72, in wrapped
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     context, exc, binary)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 227, in __exit__
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self.force_reraise()
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     raise self.value
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/exception_wrapper.py", line 63, in wrapped
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 10693, in external_instance_event
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     event.tag)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 10498, in _process_instance_vif_deleted_event
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self.driver.detach_interface(context, instance, vif)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1630, in detach_interface
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self.unplug_vifs(instance, [vif])
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1591, in unplug_vifs
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self._unplug_vifs(node, instance, network_info)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1558, in _unplug_vifs
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     port_id)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/client_wrapper.py", line 180, in call
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return self._multi_getattr(client, method)(*args, **kwargs)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/v1/node.py", line 512, in vif_detach
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     global_request_id=global_request_id)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/v1/node.py", line 408, in delete
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     global_request_id=global_request_id)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/base.py", line 292, in _delete
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     headers=headers)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 426, in raw_request
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return self._http_request(url, method, **kwargs)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 287, in wrapper
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return func(self, url, method, **kwargs)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 387, in _http_request
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     error_json.get('debuginfo'), method, url)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server ironicclient.common.apiclient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409)

  # Analysis

  This issue is similar to https://bugs.launchpad.net/nova/+bug/1815799,
  where instance_uuid and instance_info were being cleaned up. Ironic
  deletes VIFs during tear down since Rocky [1], so we should also be
  safe to skip this in the case where the node has been deprovisioned.
  Debug logs confirm that the VIF cleanup is a noop:

  2023-05-17 14:24:11.791 7 DEBUG nova.virt.ironic.driver [req-e419f330-fe14-4c4b-8d5a-b7a40bb77651 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] VIF a5c1622b-eb13-43
  d3-a46d-aa5781a85cdb isn't attached to Ironic node 3152ff2d-a0cb-4e7a-bcc7-81266106fef2 _unplug_vifs /var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py:1561

  
  [1] https://review.opendev.org/q/I8d683d2d506c97535b5a8f9a5de4c070c7e887df

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2019977/+subscriptions