← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1924585] [NEW] Live Migration - if libvirt timeout the instance goes to error state but the live migration continues

 

Public bug reported:

Recently we live migrated an entire cell to new hardware and we hit the
following problem several times...

During a live migration Nova monitors the state of the migration quering
libvirt every 0.5s

https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452

If libvirt timeout, the instance is left in a very bad state...
The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node.

I'm using Stein release, but looking into the current release the code
path seems the same.

Here's the Stein trace:

```
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, in _do_live_migration
    block_migration, migrate_data)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7581, in live_migration
    migrate_data)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8068, in _live_migration
    finish_event, disk_paths)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7873, in _live_migration_monitor
    info = guest.get_job_info()
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, in get_job_info
    stats = self._domain.jobStats()
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute
    six.reraise(c, e, tb)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats
    if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self)
libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats)
```

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1924585

Title:
  Live Migration - if libvirt timeout the instance goes to error state
  but the live migration continues

Status in OpenStack Compute (nova):
  New

Bug description:
  Recently we live migrated an entire cell to new hardware and we hit
  the following problem several times...

  During a live migration Nova monitors the state of the migration
  quering libvirt every 0.5s

  https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452

  If libvirt timeout, the instance is left in a very bad state...
  The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node.

  I'm using Stein release, but looking into the current release the code
  path seems the same.

  Here's the Stein trace:

  ```
  Traceback (most recent call last):
    File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, in _do_live_migration
      block_migration, migrate_data)
    File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7581, in live_migration
      migrate_data)
    File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8068, in _live_migration
      finish_event, disk_paths)
    File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7873, in _live_migration_monitor
      info = guest.get_job_info()
    File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, in get_job_info
      stats = self._domain.jobStats()
    File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit
      result = proxy_call(self._autowrap, f, *args, **kwargs)
    File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call
      rv = execute(f, *args, **kwargs)
    File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute
      six.reraise(c, e, tb)
    File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
      rv = meth(*args, **kwargs)
    File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats
      if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self)
  libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats)
  ```

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1924585/+subscriptions