yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #78851
[Bug 1821594] Re: Error in confirm_migration leaves stale allocations and 'confirming' migration state
** Also affects: cloud-archive
Importance: Undecided
Status: New
** Also affects: cloud-archive/queens
Importance: Undecided
Status: New
** Also affects: cloud-archive/stein
Importance: Undecided
Status: New
** Also affects: cloud-archive/train
Importance: Undecided
Status: New
** Also affects: cloud-archive/rocky
Importance: Undecided
Status: New
** Also affects: nova (Ubuntu)
Importance: Undecided
Status: New
** Also affects: nova (Ubuntu Bionic)
Importance: Undecided
Status: New
** Also affects: nova (Ubuntu Eoan)
Importance: Undecided
Status: New
** Also affects: nova (Ubuntu Cosmic)
Importance: Undecided
Status: New
** Also affects: nova (Ubuntu Disco)
Importance: Undecided
Status: New
** Tags added: sts-sru-needed
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821594
Title:
[SRU] Error in confirm_migration leaves stale allocations and
'confirming' migration state
Status in Ubuntu Cloud Archive:
New
Status in Ubuntu Cloud Archive queens series:
New
Status in Ubuntu Cloud Archive rocky series:
New
Status in Ubuntu Cloud Archive stein series:
New
Status in Ubuntu Cloud Archive train series:
Fix Committed
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) pike series:
Triaged
Status in OpenStack Compute (nova) queens series:
Fix Committed
Status in OpenStack Compute (nova) rocky series:
Fix Committed
Status in OpenStack Compute (nova) stein series:
Fix Committed
Status in nova package in Ubuntu:
Fix Committed
Status in nova source package in Bionic:
New
Status in nova source package in Cosmic:
New
Status in nova source package in Disco:
New
Status in nova source package in Eoan:
Fix Committed
Bug description:
Description:
When performing a cold migration, if an exception is raised by the
driver during confirm_migration (this runs in the source node), the
migration record is stuck in "confirming" state and the allocations
against the source node are not removed.
The instance is fine at the destination in this stage, but the source
host has allocations that is not possible to clean without going to
the database or invoking the Placement API via curl. After several
migration attempts that fail in the same spot, the source node is
filled with these allocations that prevent new instances from being
created or instances migrated to this node.
When confirm_migration fails in this stage, the migrating instance can
be saved through a hard reboot or a reset state to active.
Steps to reproduce:
Unfortunately, I don't have logs of the real root cause of the problem
inside driver.confirm_migration running libvirt driver. However, the
stale allocations and migration status problem can be easily
reproduced by raising an exception in libvirt driver's
confirm_migration method, and it would affect any driver.
Expected results:
Discussed this issue with efried and mriedem over #openstack-nova on
March 25th, 2019. They confirmed that allocations not being cleared up
is a bug.
Actual results:
Instance is fine at the destination after a reset-state. Source node
has stale allocations that prevent new instances from being
created/migrated to the source node. Migration record is stuck in
"confirming" state.
Environment:
I verified this bug on on pike, queens and stein branches. Running
libvirt KVM driver.
=======================================================================
[Impact]
If users attempting to perform cold migrations face any issues when
the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely.
This bug was confirmed from pike to stein release, and a fix was
implemented for queens, rocky and stein. It should be backported to
those releases to prevent the issue from reoccurring.
This fix prevents new stale allocations being left over, by cleaning
them up immediately when the failures occur. At the moment, the users
affected by this bug have to clean their previous stale allocations
manually.
[Test Case]
The root cause for this problem may vary for each driver and
environment, so to reproduce the bug, it is necessary first to inject
a failure in the driver's confirm_migration method to cause an
exception to be raised.
An example when using libvirt is to add a line:
raise Exception("TEST")
in
https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012
and then restart the nova-compute service.
Then, invoke a cold migration through "openstack server migrate {id}",
wait for VERIFY_RESIZE status, and then invoke "openstack server
resize {id} --confirm". The confirmation will fail asynchronously and
the instance will be in ERROR status, while the migration database
record is in "confirming" state and the stale allocations for the
source host is still present in the "allocations" database table.
[Regression Potential]
New functional test https://review.opendev.org/#/c/657870/ validated
the fix and was backported all the way to Queens. The fix being
backported caused no functional test to fail.
[Other Info]
None
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1821594/+subscriptions
References