← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2085975] Re: Compute fails to clean up after evacuated instance if the evacuation still in progress

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/933734
Committed: https://opendev.org/openstack/nova/commit/2c76fd3bafc90b23ed9d9e6a7f84919082dc0076
Submitter: "Zuul (22348)"
Branch:    master

commit 2c76fd3bafc90b23ed9d9e6a7f84919082dc0076
Author: Balazs Gibizer <gibi@xxxxxxxxxx>
Date:   Wed Oct 30 13:24:41 2024 +0100

    Route shared storage RPC to evac dest at startup
    
    If a compute is started up while an evacuation of an instance from this
    host is still in progress then the destroy_evacuated_instances call will
    try to check if the instance is on shared storage to decide if the local
    disk needs to deleted from the source node or not. However this call
    uses the instance.host to target the RPC call. If the evacuation is
    still ongoing then the instance.host might still be set to the source
    node. This means the source node during init_host tries to call  RPC
    on itself. This will always time out as the RPC server is only started
    after init_host. Also it is wrong as the shared storage check RPC
    should be called on another host. Moreover when this wrongly routed RPC
    times out the source compute logs the exception, ignores it, and the
    assume the disk is on shared storage so won't clean it up. This means
    that a later evacuation of this VM targeting this node will fails as the
    instance directory is already present on the node.
    
    The fix is simple, the destroy_evacuated_instances call should always
    send the shared storage check RPC call to the destination node of the
    evacuation based on the migration record. It will be correct even if the
    evacuation is still in progress or even if it is already finished.
    
    Closes-Bug: #2085975
    Change-Id: If5ad213649d68da995dad146f0a0c3cacc369309


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2085975

Title:
  Compute fails to clean up after evacuated instance if the evacuation
  still in progress

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Reproduce:
  * have a two node devstack hostA, hostB both with simple local storage
  * start an instance on hostA
  * inject a sleep in nova.virt.driver.rebuild to simulate that rebuild take time
  * stop hostA
  * evacuate the VM 
  * while the evacuation is still in progress on hostB start up hostA

  Actual:
  hostA will try to check if the VM is using shared storage and sends an RPC call to the instance.host as that is not yet set to the destination the RPC call hits hostA that is still in init_host so the RPC never answered and hostA'a destroy_evacuated_instances call will get a MessagingTimeout exception. That is logged and then ignored. But nova defaults the shared_storage flag to true so in this case the local instance dir is not cleaned.

  Expected:
  hostA sends the RPC call to hostB that responds and the local instance dir on hostkA is cleaned up.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2085975/+subscriptions



References