← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2019977] Re: Ironic VIF unplug in destroy may fail due to node locked by cleaning

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/883411
Committed: https://opendev.org/openstack/nova/commit/6ebd8a56d1c79fc15f385f232ce0e22937b2fdac
Submitter: "Zuul (22348)"
Branch:    master

commit 6ebd8a56d1c79fc15f385f232ce0e22937b2fdac
Author: Mark Goddard <mark@xxxxxxxxxxxx>
Date:   Wed May 17 14:20:54 2023 +0100

    ironic: Fix ConflictException when deleting server
    
    When unprovision works via Ironic, all operations in _cleanup_deploy
    have already been completed. Previous to this patch, we continue
    attempting all the clean up steps, which eventually errors out with
    BadRequest, or similar, and we complete the delete.
    
    Howerver, if cleaning has started, we hit a conflict exception,
    so we don't hit the expected error above.
    
    Prior to moving to the SDK, that landed in Caracal,
    we would retry on conflict errors. You can tweak the
    config to keep retrying for the length of time cleaning
    usually takes in your enviroment.
    
    After this patch:
    Ieda5636a5e80ea4af25db2e90be241869759e30c
    
    We now hard fail with this error:
    
    openstack.exceptions.ConflictException:
    Failed to detach VIF xxx from bare metal node yyy
    ...
    Node yyy is locked by host zzz,
    please retry after the current operation is completed.
    
    This change simply skips calling the operations that
    will always error out, avoiding the need to wait for
    cleainging to complete before getting the expected
    error message.
    
    Closes-Bug: #2019977
    Related-Bug: #1815799
    Change-Id: I60971b58cf1f24bdb5d40f668e380ebfee2ac495


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2019977

Title:
  Ironic VIF unplug in destroy may fail due to node locked by cleaning

Status in Ironic:
  New
Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  This bug describes a race condition during deletion of a bare metal
  (Ironic) instance when automated cleaning is enabled in Ironic (which
  is the default).

  # Steps to reproduce

  As a race condition, this one is not easy to reproduce, although it
  has been seen in the wild on several occasions. The easiest way to
  reproduce it is in a development environment (e.g. devstack), tweaking
  some configuration and adding some sleeps to particular operations in
  nova and ironic.

  nova.conf configuration: Reduce retries for ironic API polling. Needs
  to be long enough to allow operations to complete.

  [ironic]
  api_max_retries = 10

  sleep #1: nova-compute waits for the node to move to cleaning
  (nova/virt/ironic/driver.py):

          try:
              if node.provision_state in _UNPROVISION_STATES:
                  self._unprovision(instance, node)
              else:
                  # NOTE(hshiina): if spawn() fails before ironic starts
                  #                provisioning, instance information should be
                  #                removed from ironic node.
                  self._remove_instance_info_from_node(node, instance)
          finally:
              ##### HERE #######
              time.sleep(20)
              ##### HERE #######
              # NOTE(mgoddard): We don't need to remove instance info at this
              # point since we will have already done it. The destroy will only
              # succeed if this method returns without error, so we will end up
              # removing the instance info eventually.
              self._cleanup_deploy(node, instance, network_info,
                                   remove_instance_info=False)

  sleep #2: ironic conductor holds onto the node lock:

  @task_manager.require_exclusive_lock
  def do_node_clean(task, clean_steps=None, disable_ramdisk=False):
      """Internal RPC method to perform cleaning of a node.

      :param task: a TaskManager instance with an exclusive lock on its node
      :param clean_steps: For a manual clean, the list of clean steps to
                          perform. Is None For automated cleaning (default).
                          For more information, see the clean_steps parameter
                          of :func:`ConductorManager.do_node_clean`.
      :param disable_ramdisk: Whether to skip booting ramdisk for cleaning.
      """
      ##### HERE #####
      import time
      time.sleep(120)
      ##### HERE #####
      node = task.node
      manual_clean = clean_steps is not None

  Next, create, then destroy a bare metal instance.

  # Expected results

  The node is deleted

  # Actual results

  The node moves to an ERROR state.

  Logs from nova-compute:

  2023-05-17 14:30:30.493 7 ERROR ironicclient.common.http [req-c8d567dc-7088-4bad-9a09-6ba33b6c7b4d 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] Error contacting Ir
  onic server: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409). Attempt 6 of 6: ironicclient.common.apic
  lient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server [req-c8d567dc-7088-4bad-9a09-6ba33b6c7b4d 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] Exception during m
  essage handling: ironicclient.common.apiclient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is complete
  d. (HTTP 409)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/exception_wrapper.py", line 72, in wrapped
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     context, exc, binary)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 227, in __exit__
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self.force_reraise()
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     raise self.value
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/exception_wrapper.py", line 63, in wrapped
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 10693, in external_instance_event
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     event.tag)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 10498, in _process_instance_vif_deleted_event
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self.driver.detach_interface(context, instance, vif)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1630, in detach_interface
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self.unplug_vifs(instance, [vif])
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1591, in unplug_vifs
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     self._unplug_vifs(node, instance, network_info)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1558, in _unplug_vifs
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     port_id)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/client_wrapper.py", line 180, in call
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return self._multi_getattr(client, method)(*args, **kwargs)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/v1/node.py", line 512, in vif_detach
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     global_request_id=global_request_id)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/v1/node.py", line 408, in delete
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     global_request_id=global_request_id)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/base.py", line 292, in _delete
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     headers=headers)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 426, in raw_request
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return self._http_request(url, method, **kwargs)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 287, in wrapper
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     return func(self, url, method, **kwargs)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 387, in _http_request
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server     error_json.get('debuginfo'), method, url)
  2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server ironicclient.common.apiclient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409)

  # Analysis

  This issue is similar to https://bugs.launchpad.net/nova/+bug/1815799,
  where instance_uuid and instance_info were being cleaned up. Ironic
  deletes VIFs during tear down since Rocky [1], so we should also be
  safe to skip this in the case where the node has been deprovisioned.
  Debug logs confirm that the VIF cleanup is a noop:

  2023-05-17 14:24:11.791 7 DEBUG nova.virt.ironic.driver [req-e419f330-fe14-4c4b-8d5a-b7a40bb77651 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] VIF a5c1622b-eb13-43
  d3-a46d-aa5781a85cdb isn't attached to Ironic node 3152ff2d-a0cb-4e7a-bcc7-81266106fef2 _unplug_vifs /var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py:1561

  
  [1] https://review.opendev.org/q/I8d683d2d506c97535b5a8f9a5de4c070c7e887df

To manage notifications about this bug go to:
https://bugs.launchpad.net/ironic/+bug/2019977/+subscriptions



References