yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1846027] Re: [Error Code 42] Domain not found when hard-reset is used

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Launchpad Bug Tracker <1846027@xxxxxxxxxxxxxxxxxx>
Date: Wed, 24 Jun 2020 04:17:19 -0000
Reply-to: Bug 1846027 <1846027@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
[Expired for OpenStack Compute (nova) because there has been no activity
for 60 days.]

** Changed in: nova
       Status: Incomplete => Expired

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1846027

Title:
  [Error Code 42] Domain not found when hard-reset is used

Status in OpenStack Compute (nova):
  Expired

Bug description:
  Not entirely sure if this is a bug, but at least the underlying logic
  seems to mess this up.

  I have 7 computes nodes on a ostack cluster. THis issue happens on
  cluster1 and 5. for two VMs.

  When it happens: At hard reboot. Let's say I have a VM that for some
  reason is blocked (out of memory, whatever). Then I do a hard reboot.
  When I do that the underlying nova code closes the iSCSI connection to
  the cinder storage (I verified this), then it tries to restart the
  domain failing with:

  2019-09-30 11:54:00.366 4484 WARNING nova.virt.libvirt.driver [req-
  1c2a5462-50d1-4cfb-b743-a4ea2195acb0 - - - - -] Error from libvirt
  while getting description of instance-000002b1: [Error Code 42] Domain
  not found: no domain with matching uuid '39a02162-7e99-45b8-837c-
  4db0f20025af' (instance-000002b1): libvirt.libvirtError: Domain not
  found: no domain with matching uuid '39a02162-7e99-45b8-837c-
  4db0f20025af' (instance-000002b1)

  Let me stop here for a moment. If in this step I go to the compute
  node and do a virsh list --all the instance is not there at all.

  I also get:

   {u'message': u'Volume device not found at .', u'code': 500,
  u'details': u'  File "/usr/lib/python3/dist-
  packages/nova/compute/manager.py", line 202, in decorated_function\n
  return function(self, context, *args, **kwargs)\n  File
  "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3512,
  in reboot_instance\n    self._set_instance_obj_error_state(context,
  instance)\n  File "/usr/lib/python3/dist-
  packages/oslo_utils/excutils.py", line 220, in __exit__\n
  self.force_reraise()\n  File "/usr/lib/python3/dist-
  packages/oslo_utils/excutils.py", line 196, in force_reraise\n
  six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3
  /dist-packages/six.py", line 693, in reraise\n    raise value\n  File
  "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3486,
  in reboot_instance\n    bad_volumes_callback=bad_volumes_callback)\n
  File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py",
  line 2739, in reboot\n    block_device_info)\n  File "/usr/lib/python3
  /dist-packages/nova/virt/libvirt/driver.py", line 2833, in
  _hard_reboot\n    mdevs=mdevs)\n  File "/usr/lib/python3/dist-
  packages/nova/virt/libvirt/driver.py", line 5490, in _get_guest_xml\n
  context, mdevs)\n  File "/usr/lib/python3/dist-
  packages/nova/virt/libvirt/driver.py", line 5283, in
  _get_guest_config\n    flavor, guest.os_type)\n  File
  "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line
  4093, in _get_guest_storage_config\n    self._connect_volume(context,
  connection_info, instance)\n  File "/usr/lib/python3/dist-
  packages/nova/virt/libvirt/driver.py", line 1276, in _connect_volume\n
  vol_driver.connect_volume(connection_info, instance)\n  File
  "/usr/lib/python3/dist-packages/nova/virt/libvirt/volume/iscsi.py",
  line 64, in connect_volume\n    device_info =
  self.connector.connect_volume(connection_info[\'data\'])\n  File
  "/usr/lib/python3/dist-packages/os_brick/utils.py", line 137, in
  trace_logging_wrapper\n    return f(*args, **kwargs)\n  File
  "/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line
  328, in inner\n    return f(*args, **kwargs)\n  File "/usr/lib/python3
  /dist-packages/os_brick/initiator/connectors/iscsi.py", line 518, in
  connect_volume\n    self._cleanup_connection(connection_properties,
  force=True)\n  File "/usr/lib/python3/dist-
  packages/oslo_utils/excutils.py", line 220, in __exit__\n
  self.force_reraise()\n  File "/usr/lib/python3/dist-
  packages/oslo_utils/excutils.py", line 196, in force_reraise\n
  six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python3
  /dist-packages/six.py", line 693, in reraise\n    raise value\n  File
  "/usr/lib/python3/dist-
  packages/os_brick/initiator/connectors/iscsi.py", line 512, in
  connect_volume\n    return
  self._connect_single_volume(connection_properties)\n  File
  "/usr/lib/python3/dist-packages/os_brick/utils.py", line 61, in
  _wrapper\n    return r.call(f, *args, **kwargs)\n  File
  "/usr/lib/python3/dist-packages/retrying.py", line 212, in call\n
  raise attempt.get()\n  File "/usr/lib/python3/dist-
  packages/retrying.py", line 247, in get\n
  six.reraise(self.value[0], self.value[1], self.value[2])\n  File
  "/usr/lib/python3/dist-packages/six.py", line 693, in reraise\n
  raise value\n  File "/usr/lib/python3/dist-packages/retrying.py", line
  200, in call\n    attempt = Attempt(fn(*args, **kwargs),
  attempt_number, False)\n  File "/usr/lib/python3/dist-
  packages/os_brick/initiator/connectors/iscsi.py", line 587, in
  _connect_single_volume\n    raise
  exception.VolumeDeviceNotFound(device=\'\')\n', u'created':
  u'2019-09-29T23:44:32Z'} |

  
  And on the nova compute logs I see:

  2019-09-30 14:15:21.388 4484 WARNING nova.compute.manager [req-
  1c2a5462-50d1-4cfb-b743-a4ea2195acb0 - - - - -] While synchronizing
  instance power states, found 33 instances in the database and 34
  instances on the hypervisor.

  
  Something is not well synchronized and I believe this is the reason everything else is failing.

  
  My workaround:

  When this happens ostack set the vm-state to ERROR. I change the state
  to active, and the stop the Instance. then I detach the volume
  (cinder, iscsi based) start the VM, shutdown the VM, attach the volume
  agan, and start the VM. This fix it. But if my user do a hard reset
  again it will happen again.

  Let me know if you need more information and I would be eager to
  provide it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1846027/+subscriptions
References

[Bug 1846027] [NEW] [Error Code 42] Domain not found when hard-reset is used
From: Orestes Leal Rodriguez, 2019-09-30