yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #85824
[Bug 1924585] [NEW] Live Migration - if libvirt timeout the instance goes to error state but the live migration continues
Public bug reported:
Recently we live migrated an entire cell to new hardware and we hit the
following problem several times...
During a live migration Nova monitors the state of the migration quering
libvirt every 0.5s
https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452
If libvirt timeout, the instance is left in a very bad state...
The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node.
I'm using Stein release, but looking into the current release the code
path seems the same.
Here's the Stein trace:
```
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, in _do_live_migration
block_migration, migrate_data)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7581, in live_migration
migrate_data)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8068, in _live_migration
finish_event, disk_paths)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7873, in _live_migration_monitor
info = guest.get_job_info()
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, in get_job_info
stats = self._domain.jobStats()
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit
result = proxy_call(self._autowrap, f, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call
rv = execute(f, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute
six.reraise(c, e, tb)
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
rv = meth(*args, **kwargs)
File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats
if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self)
libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats)
```
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1924585
Title:
Live Migration - if libvirt timeout the instance goes to error state
but the live migration continues
Status in OpenStack Compute (nova):
New
Bug description:
Recently we live migrated an entire cell to new hardware and we hit
the following problem several times...
During a live migration Nova monitors the state of the migration
quering libvirt every 0.5s
https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452
If libvirt timeout, the instance is left in a very bad state...
The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node.
I'm using Stein release, but looking into the current release the code
path seems the same.
Here's the Stein trace:
```
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, in _do_live_migration
block_migration, migrate_data)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7581, in live_migration
migrate_data)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8068, in _live_migration
finish_event, disk_paths)
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7873, in _live_migration_monitor
info = guest.get_job_info()
File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, in get_job_info
stats = self._domain.jobStats()
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit
result = proxy_call(self._autowrap, f, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call
rv = execute(f, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute
six.reraise(c, e, tb)
File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
rv = meth(*args, **kwargs)
File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats
if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self)
libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats)
```
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1924585/+subscriptions