← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1838309] [NEW] Live migration might fail when run after revert of previous live migration

 

Public bug reported:

When migrating an instance between two computes on queens, running two
different qemu versions, first live migration failed and was rolled back
(traceback follows just in case, unrelated to this issue):

2019-07-26 14:39:44.469 1576 ERROR nova.virt.libvirt.driver [req-26f3a831-8e4f-43a2-83ce-e60645264147 0aa8a4a6ed7d4733871ef79fa0302d43 31ee6aa6bff7498fba21b9807697ec32 - default default] [instance: b0681d51-2924-44be-a8b7-36db0d86b92f] Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2019-07-26 14:39:43.479+0000: Domain id=16 is tainted: shell-scripts
2019-07-26T14:39:43.630545Z qemu-system-x86_64: -drive file=rbd:cinder/volume-df3d0060-451c-4b22-8d15-2c579fb47681:id=cinder:auth_supported=cephx\;none:mon_host=192.168.16.14\:6789\;192.168.16.15\:6789\;192.168.16.16\:6789,file.password-secret=virtio-disk2-secret0,format=raw,if=none,id=drive-virtio-disk2,serial=df3d0060-451c-4b22-8d15-2c579fb47681,cache=writeback,discard=unmap: 'serial' is deprecated, please use the corresponding option of '-device' instead
2019-07-26T14:39:44.075108Z qemu-system-x86_64: VQ 2 size 0x80 < last_avail_idx 0xedda - used_idx 0xeddd
2019-07-26T14:39:44.075130Z qemu-system-x86_64: Failed to load virtio-balloon:virtio
2019-07-26T14:39:44.075134Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:07.0/virtio-balloon'
2019-07-26T14:39:44.075582Z qemu-system-x86_64: load of migration failed: Operation not permitted: libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-07-26 14:39:43.479+0000: Domain id=16 is tainted: shell-scripts

then, after revert, live migration was retried, and now it failed
because of the following problem:

{u'message': u'Requested operation is not valid: cannot undefine transient domain', u'code': 500, u'details': u'  File "/usr/lib/python2.7/dist-packages/nova/compute/manag
er.py", line 202, in decorated_function\n    return function(self, context, *args, **kwargs)\n  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6438, in _post_live_migration\n    destroy_vifs=destroy_vifs)\n  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 1100, in cleanup\n    self._undefine_domain(instance)\n  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 1012, in _undefine_domain\n    instance=instance)\n  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 999, in _undefine_domain\n    guest.delete_configuration(support_uefi)\n  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 271, in delete_configuration\n    self._domain.undefine()\n  File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit\n    result = proxy_call(self._autowrap, f, *args, **kwargs)\n  File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call\n    rv = execute(f, *args, **kwargs)\n  File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute\n    six.reraise(c, e, tb)\n  File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker\n    rv = meth(*args, **kwargs)\n  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 2701, in undefine\n    if ret == -1: raise libvirtError (\'virDomainUndefine() failed\', dom=self)\n', u'created': u'2019-07-29T14:39:41Z'}

It seems to happen because a domain was already undefined once on the
first try to live migrate and after that it can not be undefined second
time. We might need to check if the domain is persistent before
undefining it in case of live migrations.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1838309

Title:
  Live migration might fail when run after revert of previous live
  migration

Status in OpenStack Compute (nova):
  New

Bug description:
  When migrating an instance between two computes on queens, running two
  different qemu versions, first live migration failed and was rolled
  back (traceback follows just in case, unrelated to this issue):

  2019-07-26 14:39:44.469 1576 ERROR nova.virt.libvirt.driver [req-26f3a831-8e4f-43a2-83ce-e60645264147 0aa8a4a6ed7d4733871ef79fa0302d43 31ee6aa6bff7498fba21b9807697ec32 - default default] [instance: b0681d51-2924-44be-a8b7-36db0d86b92f] Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2019-07-26 14:39:43.479+0000: Domain id=16 is tainted: shell-scripts
  2019-07-26T14:39:43.630545Z qemu-system-x86_64: -drive file=rbd:cinder/volume-df3d0060-451c-4b22-8d15-2c579fb47681:id=cinder:auth_supported=cephx\;none:mon_host=192.168.16.14\:6789\;192.168.16.15\:6789\;192.168.16.16\:6789,file.password-secret=virtio-disk2-secret0,format=raw,if=none,id=drive-virtio-disk2,serial=df3d0060-451c-4b22-8d15-2c579fb47681,cache=writeback,discard=unmap: 'serial' is deprecated, please use the corresponding option of '-device' instead
  2019-07-26T14:39:44.075108Z qemu-system-x86_64: VQ 2 size 0x80 < last_avail_idx 0xedda - used_idx 0xeddd
  2019-07-26T14:39:44.075130Z qemu-system-x86_64: Failed to load virtio-balloon:virtio
  2019-07-26T14:39:44.075134Z qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:07.0/virtio-balloon'
  2019-07-26T14:39:44.075582Z qemu-system-x86_64: load of migration failed: Operation not permitted: libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-07-26 14:39:43.479+0000: Domain id=16 is tainted: shell-scripts

  then, after revert, live migration was retried, and now it failed
  because of the following problem:

  {u'message': u'Requested operation is not valid: cannot undefine transient domain', u'code': 500, u'details': u'  File "/usr/lib/python2.7/dist-packages/nova/compute/manag
  er.py", line 202, in decorated_function\n    return function(self, context, *args, **kwargs)\n  File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6438, in _post_live_migration\n    destroy_vifs=destroy_vifs)\n  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 1100, in cleanup\n    self._undefine_domain(instance)\n  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 1012, in _undefine_domain\n    instance=instance)\n  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 999, in _undefine_domain\n    guest.delete_configuration(support_uefi)\n  File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/guest.py", line 271, in delete_configuration\n    self._domain.undefine()\n  File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 186, in doit\n    result = proxy_call(self._autowrap, f, *args, **kwargs)\n  File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 144, in proxy_call\n    rv = execute(f, *args, **kwargs)\n  File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 125, in execute\n    six.reraise(c, e, tb)\n  File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 83, in tworker\n    rv = meth(*args, **kwargs)\n  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 2701, in undefine\n    if ret == -1: raise libvirtError (\'virDomainUndefine() failed\', dom=self)\n', u'created': u'2019-07-29T14:39:41Z'}

  It seems to happen because a domain was already undefined once on the
  first try to live migrate and after that it can not be undefined
  second time. We might need to check if the domain is persistent before
  undefining it in case of live migrations.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1838309/+subscriptions


Follow ups