yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #73953
[Bug 1764883] Re: Evacuation fails if the source host returns while the migration is still in progress
Reviewed: https://review.openstack.org/562284
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c4988cdabf311d29cf64af732091068cfabeedaa
Submitter: Zuul
Branch: master
commit c4988cdabf311d29cf64af732091068cfabeedaa
Author: Lee Yarwood <lyarwood@xxxxxxxxxx>
Date: Wed Apr 18 14:35:07 2018 +0100
compute: Ensure pre-migrating instances are destroyed during init_host
Previously _destroy_evacuated_instances would not remove instances
associated with evacuation migration records in a pre-migrating state.
This could lead to a race between the original source host and the new
destination if the source returned early, calling init_instance during
the evacuation process.
This change now includes pre-migrating migration records when looking
for active evacuations. Additionally the dict of evacuating instances is
then returned to init_host and used to skip running init_instance
against such instances ensuring no race occurs between the two computes.
Closes-bug: #1764883
Change-Id: I379678dfdb2609f12a572d4f99c8e9da4deab803
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1764883
Title:
Evacuation fails if the source host returns while the migration is
still in progress
Status in OpenStack Compute (nova):
Fix Released
Bug description:
Description
===========
If the migration is in a 'pre-migrating' state this can result in the
source compute manager not removing the evacuating instances in
question during _destroy_evacuated_instances.
More importantly the source host returning online early allows
_init_instance to set instance.status to ERROR and instance.task_state
to None thanks to the following failed rebuild logic :
https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L810-L821
As a result the in-progress rebuild will fail when it attempts to save
the instance while expecting a certain task_state :
https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L3050-L3052
https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L3123
This issue was originally reported downstream while testing an
instance high-availability feature that uses a mixture of Pacemaker
and instance evacuation to keep instances online :
Nova reports overcloud instance in error state after failed double compute failover instance-ha evacuation
https://bugzilla.redhat.com/show_bug.cgi?id=1567606
This report includes an example UnexpectedTaskStateError failure in
c#8 :
2018-04-17 11:11:12.999 1 ERROR nova.compute.manager [req-ac20c023-9abf-412f-987f-2981c7837c57 da4d95c480c343c5bf6abe3b789f4c17 d2c2437b7f6642b4a1d5907fa5f373a9 - default default] [instance: d9419b05-025e-4193-b3f7-7f0efc23593b] Setting instance vm_state to ERROR:
UnexpectedTaskStateError_Remote: Conflict updating instance d9419b05-025e-4193-b3f7-7f0efc23593b. Expected: {'task_state': [u'rebuild_spawning']}. Actual: {'task_state': None}
The rally based tests for this feature just happen to use the `b`
sysrq-trigger that immediately reboots the host allowing them to
recover just in time to hit this.
Steps to reproduce
==================
- Evacuate an instance
- Restart the source compute service before the instance is fully rebuilt
Expected result
===============
The source compute removes the instance and does not attempt to update the instance or task state.
Actual result
=============
The source compute doesn't attempt to remove the instance and attempts to update the instance and task state before the rebuild is complete.
Environment
===========
1. Exact version of OpenStack you are running. See the following
88adde8bba393b8d08ce21e9e3334a76e853b2e0
2. Which hypervisor did you use?
(For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
What's the version of that?
Libvirt + KVM
2. Which storage type did you use?
(For example: Ceph, LVM, GPFS, ...)
What's the version of that?
Local, yet to test with shared storage.
3. Which networking type did you use?
(For example: nova-network, Neutron with OpenVSwitch, ...)
N/A
Logs & Configs
==============
See https://bugzilla.redhat.com/show_bug.cgi?id=1567606#c8
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1764883/+subscriptions
References