← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1319797] [NEW] Restarting destination compute manager during live-migration can cause instance data loss

 

Public bug reported:

During compute manager startup init_host is called. One of the functions
there is to delete instance data that doesn't belong to this host ie.
_destroy_evacuated_instances. But this function only checks if the local
instance belongs to the host or not. It doesn't check the task_state.

Suppose a live-migration is in progress and the destination compute
manager is restarted, it will find the migrating instance as not
belonging to the host and destroy it. This can result in two outomes:

1. If live-migration is in progress, then the source hypervisor would hang, so a rollback is possible to trigger by killing the job.
2. However, if live-migration is completed and the post-live-migration-destination is messaged then by the time the compute manager gets to processing the message, the instance data would have been deleted. Subsequent periodic tasks would only get as far as defining the VM but there wouldn't be any disks left.

014-05-08 20:42:33.058 16724 WARNING nova.virt.libvirt.driver [-] Periodic task is updating the host stat, it is trying to get disk instance-00000002, but disk file was removed by concurrent operations such as resize.
2014-05-08 20:43:33.370 16724 WARNING nova.virt.libvirt.driver [-] Periodic task is updating the host stat, it is trying to get disk instance-00000002, but disk file was removed by concurrent operations such as resize.

Steps to reproduce:

1. Start live-migration
2. Wait for pre-live-migration to define the destination VM
3. Restart destination compute manager

To see what happens for case 2 (Live-migration having completed), put a
breakpoint in init_host and delay till instance is running on the
destination and then continue the nova-compute. In this case you'll end
up with the instance directory like this:


ls -l 06ddbe13-577b-4f9f-ac52-0c038aec04d8
total 8
-rw-r--r-- 1 root root   89 May  8 19:59 disk.info
-rw-r--r-- 1 root root 1548 May  8 19:59 libvirt.xml

I verified this in a tripleo devtest environment.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1319797

Title:
  Restarting destination compute manager during live-migration can cause
  instance data loss

Status in OpenStack Compute (Nova):
  New

Bug description:
  During compute manager startup init_host is called. One of the
  functions there is to delete instance data that doesn't belong to this
  host ie. _destroy_evacuated_instances. But this function only checks
  if the local instance belongs to the host or not. It doesn't check the
  task_state.

  Suppose a live-migration is in progress and the destination compute
  manager is restarted, it will find the migrating instance as not
  belonging to the host and destroy it. This can result in two outomes:

  1. If live-migration is in progress, then the source hypervisor would hang, so a rollback is possible to trigger by killing the job.
  2. However, if live-migration is completed and the post-live-migration-destination is messaged then by the time the compute manager gets to processing the message, the instance data would have been deleted. Subsequent periodic tasks would only get as far as defining the VM but there wouldn't be any disks left.

  014-05-08 20:42:33.058 16724 WARNING nova.virt.libvirt.driver [-] Periodic task is updating the host stat, it is trying to get disk instance-00000002, but disk file was removed by concurrent operations such as resize.
  2014-05-08 20:43:33.370 16724 WARNING nova.virt.libvirt.driver [-] Periodic task is updating the host stat, it is trying to get disk instance-00000002, but disk file was removed by concurrent operations such as resize.

  Steps to reproduce:

  1. Start live-migration
  2. Wait for pre-live-migration to define the destination VM
  3. Restart destination compute manager

  To see what happens for case 2 (Live-migration having completed), put
  a breakpoint in init_host and delay till instance is running on the
  destination and then continue the nova-compute. In this case you'll
  end up with the instance directory like this:

  
  ls -l 06ddbe13-577b-4f9f-ac52-0c038aec04d8
  total 8
  -rw-r--r-- 1 root root   89 May  8 19:59 disk.info
  -rw-r--r-- 1 root root 1548 May  8 19:59 libvirt.xml

  I verified this in a tripleo devtest environment.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1319797/+subscriptions


Follow ups

References