yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #77866
[Bug 1821594] Re: Error in confirm_migration leaves stale allocations and 'confirming' migration state
Reviewed: https://review.openstack.org/647566
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=03a6d26691c1f182224d59190b79f48df278099e
Submitter: Zuul
Branch: master
commit 03a6d26691c1f182224d59190b79f48df278099e
Author: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Mon Mar 25 14:02:17 2019 -0400
Delete allocations even if _confirm_resize raises
When we are confirming a resize, the guest is on the dest
host and the instance host/node values in the database
are pointing at the dest host, so the _confirm_resize method
on the source is really best effort. If something fails, we
should not leak allocations in placement for the source compute
node resource provider since the instance is not actually
consuming the source node provider resources.
This change refactors the error handling around the _confirm_resize
call so the big nesting for _error_out_instance_on_exception is
moved to confirm_resize and then a try/finally is added around
_confirm_resize so we can be sure to try and cleanup the allocations
even if _confirm_resize fails in some obscure way. If _confirm_resize
does fail, the error gets re-raised along with logging a traceback
and hint about how to correct the instance state in the DB by hard
rebooting the server on the dest host.
Change-Id: I29c5f491ec20a71283190a1599e7732541de736f
Closes-Bug: #1821594
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821594
Title:
Error in confirm_migration leaves stale allocations and 'confirming'
migration state
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) pike series:
Triaged
Status in OpenStack Compute (nova) queens series:
Triaged
Status in OpenStack Compute (nova) rocky series:
Triaged
Status in OpenStack Compute (nova) stein series:
In Progress
Bug description:
Description:
When performing a cold migration, if an exception is raised by the
driver during confirm_migration (this runs in the source node), the
migration record is stuck in "confirming" state and the allocations
against the source node are not removed.
The instance is fine at the destination in this stage, but the source
host has allocations that is not possible to clean without going to
the database or invoking the Placement API via curl. After several
migration attempts that fail in the same spot, the source node is
filled with these allocations that prevent new instances from being
created or instances migrated to this node.
When confirm_migration fails in this stage, the migrating instance can
be saved through a hard reboot or a reset state to active.
Steps to reproduce:
Unfortunately, I don't have logs of the real root cause of the problem
inside driver.confirm_migration running libvirt driver. However, the
stale allocations and migration status problem can be easily
reproduced by raising an exception in libvirt driver's
confirm_migration method, and it would affect any driver.
Expected results:
Discussed this issue with efried and mriedem over #openstack-nova on
March 25th, 2019. They confirmed that allocations not being cleared up
is a bug.
Actual results:
Instance is fine at the destination after a reset-state. Source node
has stale allocations that prevent new instances from being
created/migrated to the source node. Migration record is stuck in
"confirming" state.
Environment:
I verified this bug on on pike, queens and stein branches. Running
libvirt KVM driver.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1821594/+subscriptions
References