yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #67226
[Bug 1712411] Re: Allocations may not be removed from dest node during failed migrations
Reviewed: https://review.openstack.org/498861
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=94b904abbad1c9655b6dec1a2e58d73bc913ed47
Submitter: Jenkins
Branch: master
commit 94b904abbad1c9655b6dec1a2e58d73bc913ed47
Author: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Tue Aug 29 12:28:43 2017 -0400
Cleanup allocations on invalid dest node during live migration
Starting in Pike, the scheduler creates an allocation on a
chosen destination node during live migration. This happens
before the destination node pre-checks occur, which could
fail and trigger a retry to the scheduler. The allocations
created in Placement against the failed destination node
were not being cleaned up though.
This change adds some cleanup code to the live migration task
in conductor to clean the allocations for the failed destination
node before retrying.
The functional recreate test for the bug is updated to show
that the bug is fixed now.
Also updates the docstring in the SchedulerReportClient
remove_provider_from_instance_allocation method so that we
no longer have to enumerate all of the places that call it.
Change-Id: I41e5e1fa9938b5e04f7e20f78ccd77eca658885f
Closes-Bug: #1712411
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1712411
Title:
Allocations may not be removed from dest node during failed migrations
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) pike series:
In Progress
Bug description:
This could also be true for cold migrate/resize/unshelve, but I'm
specifically looking at live migration here.
As of this change in Pike:
https://review.openstack.org/#/c/491012/
Once all computes are upgraded, the resource tracker will no longer
"heal" allocations in Placement for it's local node, meaning creating
allocations for the node if the instance is on it, or removing
allocations for the instance if the instance is not on the node.
During live migration, conductor will call the scheduler to select a
host which is also going to claim resources against the dest node:
https://github.com/openstack/nova/blob/16.0.0.0rc1/nova/conductor/tasks/live_migrate.py#L181
https://github.com/openstack/nova/blob/16.0.0.0rc1/nova/scheduler/filter_scheduler.py#L287
https://github.com/openstack/nova/blob/16.0.0.0rc1/nova/scheduler/client/report.py#L147
The problem during live migration is once the scheduler picks a host,
conductor performs some additional checks:
https://github.com/openstack/nova/blob/16.0.0.0rc1/nova/conductor/tasks/live_migrate.py#L194
Which could fail, and then conductor will retry the scheduler to get
another host, until one is found and passes the pre-migration checks,
or the number of retries are exhausted.
The problem is the allocation created in Placement for the destination
node, which failed some later pre-migration check, is never cleaned up
if the update_available_resource periodic task in the compute manager
doesn't clean it up (again, once all computes are upgraded to Pike).
This leaves the destination node having resources claimed against it
which aren't really on the node.
We could rollback the allocation in conductor on a failure, or we
could put some other kind of periodic cleanup task in the compute
service which looks for failed migrations where the destination node
in the migration record is for that node, and removes any failed
allocations for that node and the given instance.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1712411/+subscriptions
References