yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #94841
[Bug 2085975] Re: Compute fails to clean up after evacuated instance if the evacuation still in progress
Reviewed: https://review.opendev.org/c/openstack/nova/+/933734
Committed: https://opendev.org/openstack/nova/commit/2c76fd3bafc90b23ed9d9e6a7f84919082dc0076
Submitter: "Zuul (22348)"
Branch: master
commit 2c76fd3bafc90b23ed9d9e6a7f84919082dc0076
Author: Balazs Gibizer <gibi@xxxxxxxxxx>
Date: Wed Oct 30 13:24:41 2024 +0100
Route shared storage RPC to evac dest at startup
If a compute is started up while an evacuation of an instance from this
host is still in progress then the destroy_evacuated_instances call will
try to check if the instance is on shared storage to decide if the local
disk needs to deleted from the source node or not. However this call
uses the instance.host to target the RPC call. If the evacuation is
still ongoing then the instance.host might still be set to the source
node. This means the source node during init_host tries to call RPC
on itself. This will always time out as the RPC server is only started
after init_host. Also it is wrong as the shared storage check RPC
should be called on another host. Moreover when this wrongly routed RPC
times out the source compute logs the exception, ignores it, and the
assume the disk is on shared storage so won't clean it up. This means
that a later evacuation of this VM targeting this node will fails as the
instance directory is already present on the node.
The fix is simple, the destroy_evacuated_instances call should always
send the shared storage check RPC call to the destination node of the
evacuation based on the migration record. It will be correct even if the
evacuation is still in progress or even if it is already finished.
Closes-Bug: #2085975
Change-Id: If5ad213649d68da995dad146f0a0c3cacc369309
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2085975
Title:
Compute fails to clean up after evacuated instance if the evacuation
still in progress
Status in OpenStack Compute (nova):
Fix Released
Bug description:
Reproduce:
* have a two node devstack hostA, hostB both with simple local storage
* start an instance on hostA
* inject a sleep in nova.virt.driver.rebuild to simulate that rebuild take time
* stop hostA
* evacuate the VM
* while the evacuation is still in progress on hostB start up hostA
Actual:
hostA will try to check if the VM is using shared storage and sends an RPC call to the instance.host as that is not yet set to the destination the RPC call hits hostA that is still in init_host so the RPC never answered and hostA'a destroy_evacuated_instances call will get a MessagingTimeout exception. That is logged and then ignored. But nova defaults the shared_storage flag to true so in this case the local instance dir is not cleaned.
Expected:
hostA sends the RPC call to hostB that responds and the local instance dir on hostkA is cleaned up.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2085975/+subscriptions
References