yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #14413
[Bug 1319797] [NEW] Restarting destination compute manager during live-migration can cause instance data loss
Public bug reported:
During compute manager startup init_host is called. One of the functions
there is to delete instance data that doesn't belong to this host ie.
_destroy_evacuated_instances. But this function only checks if the local
instance belongs to the host or not. It doesn't check the task_state.
Suppose a live-migration is in progress and the destination compute
manager is restarted, it will find the migrating instance as not
belonging to the host and destroy it. This can result in two outomes:
1. If live-migration is in progress, then the source hypervisor would hang, so a rollback is possible to trigger by killing the job.
2. However, if live-migration is completed and the post-live-migration-destination is messaged then by the time the compute manager gets to processing the message, the instance data would have been deleted. Subsequent periodic tasks would only get as far as defining the VM but there wouldn't be any disks left.
014-05-08 20:42:33.058 16724 WARNING nova.virt.libvirt.driver [-] Periodic task is updating the host stat, it is trying to get disk instance-00000002, but disk file was removed by concurrent operations such as resize.
2014-05-08 20:43:33.370 16724 WARNING nova.virt.libvirt.driver [-] Periodic task is updating the host stat, it is trying to get disk instance-00000002, but disk file was removed by concurrent operations such as resize.
Steps to reproduce:
1. Start live-migration
2. Wait for pre-live-migration to define the destination VM
3. Restart destination compute manager
To see what happens for case 2 (Live-migration having completed), put a
breakpoint in init_host and delay till instance is running on the
destination and then continue the nova-compute. In this case you'll end
up with the instance directory like this:
ls -l 06ddbe13-577b-4f9f-ac52-0c038aec04d8
total 8
-rw-r--r-- 1 root root 89 May 8 19:59 disk.info
-rw-r--r-- 1 root root 1548 May 8 19:59 libvirt.xml
I verified this in a tripleo devtest environment.
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1319797
Title:
Restarting destination compute manager during live-migration can cause
instance data loss
Status in OpenStack Compute (Nova):
New
Bug description:
During compute manager startup init_host is called. One of the
functions there is to delete instance data that doesn't belong to this
host ie. _destroy_evacuated_instances. But this function only checks
if the local instance belongs to the host or not. It doesn't check the
task_state.
Suppose a live-migration is in progress and the destination compute
manager is restarted, it will find the migrating instance as not
belonging to the host and destroy it. This can result in two outomes:
1. If live-migration is in progress, then the source hypervisor would hang, so a rollback is possible to trigger by killing the job.
2. However, if live-migration is completed and the post-live-migration-destination is messaged then by the time the compute manager gets to processing the message, the instance data would have been deleted. Subsequent periodic tasks would only get as far as defining the VM but there wouldn't be any disks left.
014-05-08 20:42:33.058 16724 WARNING nova.virt.libvirt.driver [-] Periodic task is updating the host stat, it is trying to get disk instance-00000002, but disk file was removed by concurrent operations such as resize.
2014-05-08 20:43:33.370 16724 WARNING nova.virt.libvirt.driver [-] Periodic task is updating the host stat, it is trying to get disk instance-00000002, but disk file was removed by concurrent operations such as resize.
Steps to reproduce:
1. Start live-migration
2. Wait for pre-live-migration to define the destination VM
3. Restart destination compute manager
To see what happens for case 2 (Live-migration having completed), put
a breakpoint in init_host and delay till instance is running on the
destination and then continue the nova-compute. In this case you'll
end up with the instance directory like this:
ls -l 06ddbe13-577b-4f9f-ac52-0c038aec04d8
total 8
-rw-r--r-- 1 root root 89 May 8 19:59 disk.info
-rw-r--r-- 1 root root 1548 May 8 19:59 libvirt.xml
I verified this in a tripleo devtest environment.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1319797/+subscriptions
Follow ups
References