yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1764883] [NEW] Evacuation fails if the source host returns while the migration is still in progress

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Lee Yarwood <lyarwood@xxxxxxxxxx>
Date: Tue, 17 Apr 2018 22:25:14 -0000
Reply-to: Bug 1764883 <1764883@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Description
===========

If the migration is in a 'pre-migrating' state this can result in the
source compute manager not removing the evacuating instances in question
during _destroy_evacuated_instances.

More importantly the source host returning online early allows
_init_instance to set instance.status to ERROR and instance.task_state
to None thanks to the following failed rebuild logic :

https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L810-L821

As a result the in-progress rebuild will fail when it attempts to save
the instance while expecting a certain task_state :

https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L3050-L3052

https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L3123

This issue was originally reported downstream while testing an instance
high-availability feature that uses a mixture of Pacemaker and instance
evacuation to keep instances online :

Nova reports overcloud instance in error state after failed double compute failover instance-ha evacuation
https://bugzilla.redhat.com/show_bug.cgi?id=1567606

This report includes an example UnexpectedTaskStateError failure in c#8
:

2018-04-17 11:11:12.999 1 ERROR nova.compute.manager [req-ac20c023-9abf-412f-987f-2981c7837c57 da4d95c480c343c5bf6abe3b789f4c17 d2c2437b7f6642b4a1d5907fa5f373a9 - default default] [instance: d9419b05-025e-4193-b3f7-7f0efc23593b] Setting instance vm_state to ERROR: 
UnexpectedTaskStateError_Remote: Conflict updating instance d9419b05-025e-4193-b3f7-7f0efc23593b. Expected: {'task_state': [u'rebuild_spawning']}. Actual: {'task_state': None}

The rally based tests for this feature just happen to use the `b` sysrq-
trigger that immediately reboots the host allowing them to recover just
in time to hit this.

Steps to reproduce
==================
- Evacuate an instance
- Restart the source compute service before the instance is fully rebuilt

Expected result
===============
The source compute removes the instance and does not attempt to update the instance or task state.

Actual result
=============
The source compute doesn't attempt to remove the instance and attempts to update the instance and task state before the rebuild is complete.

Environment
===========
1. Exact version of OpenStack you are running. See the following

   88adde8bba393b8d08ce21e9e3334a76e853b2e0

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   Local, yet to test with shared storage.

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

  N/A

Logs & Configs
==============

See https://bugzilla.redhat.com/show_bug.cgi?id=1567606#c8

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: evacuate

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1764883

Title:
  Evacuation fails if the source host returns while the migration is
  still in progress

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  If the migration is in a 'pre-migrating' state this can result in the
  source compute manager not removing the evacuating instances in
  question during _destroy_evacuated_instances.

  More importantly the source host returning online early allows
  _init_instance to set instance.status to ERROR and instance.task_state
  to None thanks to the following failed rebuild logic :

  https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L810-L821

  As a result the in-progress rebuild will fail when it attempts to save
  the instance while expecting a certain task_state :

  https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L3050-L3052

  https://github.com/openstack/nova/blob/f106094e961c5ab430687d673063baee379f6bbd/nova/compute/manager.py#L3123

  This issue was originally reported downstream while testing an
  instance high-availability feature that uses a mixture of Pacemaker
  and instance evacuation to keep instances online :

  Nova reports overcloud instance in error state after failed double compute failover instance-ha evacuation
  https://bugzilla.redhat.com/show_bug.cgi?id=1567606

  This report includes an example UnexpectedTaskStateError failure in
  c#8 :

  2018-04-17 11:11:12.999 1 ERROR nova.compute.manager [req-ac20c023-9abf-412f-987f-2981c7837c57 da4d95c480c343c5bf6abe3b789f4c17 d2c2437b7f6642b4a1d5907fa5f373a9 - default default] [instance: d9419b05-025e-4193-b3f7-7f0efc23593b] Setting instance vm_state to ERROR: 
  UnexpectedTaskStateError_Remote: Conflict updating instance d9419b05-025e-4193-b3f7-7f0efc23593b. Expected: {'task_state': [u'rebuild_spawning']}. Actual: {'task_state': None}

  The rally based tests for this feature just happen to use the `b`
  sysrq-trigger that immediately reboots the host allowing them to
  recover just in time to hit this.

  Steps to reproduce
  ==================
  - Evacuate an instance
  - Restart the source compute service before the instance is fully rebuilt

  Expected result
  ===============
  The source compute removes the instance and does not attempt to update the instance or task state.

  Actual result
  =============
  The source compute doesn't attempt to remove the instance and attempts to update the instance and task state before the rebuild is complete.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following

     88adde8bba393b8d08ce21e9e3334a76e853b2e0

  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     Libvirt + KVM

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

     Local, yet to test with shared storage.

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

    N/A

  Logs & Configs
  ==============

  See https://bugzilla.redhat.com/show_bug.cgi?id=1567606#c8

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1764883/+subscriptions
Follow ups

[Bug 1764883] Re: Evacuation fails if the source host returns while the migration is still in progress
From: Matt Riedemann, 2018-12-06
[Bug 1764883] Re: Evacuation fails if the source host returns while the migration is still in progress
From: OpenStack Infra, 2018-07-25