yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #95283
[Bug 2019977] Re: Ironic VIF unplug in destroy may fail due to node locked by cleaning
Reviewed: https://review.opendev.org/c/openstack/nova/+/883411
Committed: https://opendev.org/openstack/nova/commit/6ebd8a56d1c79fc15f385f232ce0e22937b2fdac
Submitter: "Zuul (22348)"
Branch: master
commit 6ebd8a56d1c79fc15f385f232ce0e22937b2fdac
Author: Mark Goddard <mark@xxxxxxxxxxxx>
Date: Wed May 17 14:20:54 2023 +0100
ironic: Fix ConflictException when deleting server
When unprovision works via Ironic, all operations in _cleanup_deploy
have already been completed. Previous to this patch, we continue
attempting all the clean up steps, which eventually errors out with
BadRequest, or similar, and we complete the delete.
Howerver, if cleaning has started, we hit a conflict exception,
so we don't hit the expected error above.
Prior to moving to the SDK, that landed in Caracal,
we would retry on conflict errors. You can tweak the
config to keep retrying for the length of time cleaning
usually takes in your enviroment.
After this patch:
Ieda5636a5e80ea4af25db2e90be241869759e30c
We now hard fail with this error:
openstack.exceptions.ConflictException:
Failed to detach VIF xxx from bare metal node yyy
...
Node yyy is locked by host zzz,
please retry after the current operation is completed.
This change simply skips calling the operations that
will always error out, avoiding the need to wait for
cleainging to complete before getting the expected
error message.
Closes-Bug: #2019977
Related-Bug: #1815799
Change-Id: I60971b58cf1f24bdb5d40f668e380ebfee2ac495
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2019977
Title:
Ironic VIF unplug in destroy may fail due to node locked by cleaning
Status in Ironic:
New
Status in OpenStack Compute (nova):
Fix Released
Bug description:
This bug describes a race condition during deletion of a bare metal
(Ironic) instance when automated cleaning is enabled in Ironic (which
is the default).
# Steps to reproduce
As a race condition, this one is not easy to reproduce, although it
has been seen in the wild on several occasions. The easiest way to
reproduce it is in a development environment (e.g. devstack), tweaking
some configuration and adding some sleeps to particular operations in
nova and ironic.
nova.conf configuration: Reduce retries for ironic API polling. Needs
to be long enough to allow operations to complete.
[ironic]
api_max_retries = 10
sleep #1: nova-compute waits for the node to move to cleaning
(nova/virt/ironic/driver.py):
try:
if node.provision_state in _UNPROVISION_STATES:
self._unprovision(instance, node)
else:
# NOTE(hshiina): if spawn() fails before ironic starts
# provisioning, instance information should be
# removed from ironic node.
self._remove_instance_info_from_node(node, instance)
finally:
##### HERE #######
time.sleep(20)
##### HERE #######
# NOTE(mgoddard): We don't need to remove instance info at this
# point since we will have already done it. The destroy will only
# succeed if this method returns without error, so we will end up
# removing the instance info eventually.
self._cleanup_deploy(node, instance, network_info,
remove_instance_info=False)
sleep #2: ironic conductor holds onto the node lock:
@task_manager.require_exclusive_lock
def do_node_clean(task, clean_steps=None, disable_ramdisk=False):
"""Internal RPC method to perform cleaning of a node.
:param task: a TaskManager instance with an exclusive lock on its node
:param clean_steps: For a manual clean, the list of clean steps to
perform. Is None For automated cleaning (default).
For more information, see the clean_steps parameter
of :func:`ConductorManager.do_node_clean`.
:param disable_ramdisk: Whether to skip booting ramdisk for cleaning.
"""
##### HERE #####
import time
time.sleep(120)
##### HERE #####
node = task.node
manual_clean = clean_steps is not None
Next, create, then destroy a bare metal instance.
# Expected results
The node is deleted
# Actual results
The node moves to an ERROR state.
Logs from nova-compute:
2023-05-17 14:30:30.493 7 ERROR ironicclient.common.http [req-c8d567dc-7088-4bad-9a09-6ba33b6c7b4d 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] Error contacting Ir
onic server: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409). Attempt 6 of 6: ironicclient.common.apic
lient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server [req-c8d567dc-7088-4bad-9a09-6ba33b6c7b4d 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] Exception during m
essage handling: ironicclient.common.apiclient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is complete
d. (HTTP 409)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/exception_wrapper.py", line 72, in wrapped
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server context, exc, binary)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server self.force_reraise()
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server raise self.value
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/exception_wrapper.py", line 63, in wrapped
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 10693, in external_instance_event
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server event.tag)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/compute/manager.py", line 10498, in _process_instance_vif_deleted_event
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server self.driver.detach_interface(context, instance, vif)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1630, in detach_interface
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server self.unplug_vifs(instance, [vif])
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1591, in unplug_vifs
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server self._unplug_vifs(node, instance, network_info)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py", line 1558, in _unplug_vifs
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server port_id)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/client_wrapper.py", line 180, in call
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server return self._multi_getattr(client, method)(*args, **kwargs)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/v1/node.py", line 512, in vif_detach
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server global_request_id=global_request_id)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/v1/node.py", line 408, in delete
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server global_request_id=global_request_id)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/base.py", line 292, in _delete
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server headers=headers)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 426, in raw_request
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server return self._http_request(url, method, **kwargs)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 287, in wrapper
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server return func(self, url, method, **kwargs)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.6/site-packages/ironicclient/common/http.py", line 387, in _http_request
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server error_json.get('debuginfo'), method, url)
2023-05-17 14:30:30.496 7 ERROR oslo_messaging.rpc.server ironicclient.common.apiclient.exceptions.Conflict: Node 2666c347-ff75-4c04-9059-9b890ad14b08 is locked by host mark-ironic.novalocal, please retry after the current operation is completed. (HTTP 409)
# Analysis
This issue is similar to https://bugs.launchpad.net/nova/+bug/1815799,
where instance_uuid and instance_info were being cleaned up. Ironic
deletes VIFs during tear down since Rocky [1], so we should also be
safe to skip this in the case where the node has been deprovisioned.
Debug logs confirm that the VIF cleanup is a noop:
2023-05-17 14:24:11.791 7 DEBUG nova.virt.ironic.driver [req-e419f330-fe14-4c4b-8d5a-b7a40bb77651 5ef38591b78a42928e8dfbdbaa90acd9 60d7786e349549eeb5a309b96251c789 - default default] VIF a5c1622b-eb13-43
d3-a46d-aa5781a85cdb isn't attached to Ironic node 3152ff2d-a0cb-4e7a-bcc7-81266106fef2 _unplug_vifs /var/lib/kolla/venv/lib/python3.6/site-packages/nova/virt/ironic/driver.py:1561
[1] https://review.opendev.org/q/I8d683d2d506c97535b5a8f9a5de4c070c7e887df
To manage notifications about this bug go to:
https://bugs.launchpad.net/ironic/+bug/2019977/+subscriptions
References