yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #80218
[Bug 1846027] [NEW] [Error Code 42] Domain not found when hard-reset is used
Public bug reported:
Not entirely sure if this is a bug, but at least the underlying logic
seems to mess this up.
I have 7 computes nodes on a ostack cluster. THis issue happens on
cluster1 and 5. for two VMs.
When it happens: At hard reboot. Let's say I have a VM that for some
reason is blocked (out of memory, whatever). Then I do a hard reboot.
When I do that the underlying nova code closes the iSCSI connection to
the cinder storage (I verified this), then it tries to restart the
domain failing with:
2019-09-30 11:54:00.366 4484 WARNING nova.virt.libvirt.driver [req-
1c2a5462-50d1-4cfb-b743-a4ea2195acb0 - - - - -] Error from libvirt while
getting description of instance-000002b1: [Error Code 42] Domain not
found: no domain with matching uuid '39a02162-7e99-45b8-837c-
4db0f20025af' (instance-000002b1): libvirt.libvirtError: Domain not
found: no domain with matching uuid '39a02162-7e99-45b8-837c-
4db0f20025af' (instance-000002b1)
Let me stop here for a moment. If in this step I go to the compute node
and do a virsh list --all the instance is not there at all.
I also get:
{u'message': u'Volume device not found at .', u'code': 500, u'details':
u' File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line
202, in decorated_function\n return function(self, context, *args,
**kwargs)\n File "/usr/lib/python3/dist-
packages/nova/compute/manager.py", line 3512, in reboot_instance\n
self._set_instance_obj_error_state(context, instance)\n File
"/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in
__exit__\n self.force_reraise()\n File "/usr/lib/python3/dist-
packages/oslo_utils/excutils.py", line 196, in force_reraise\n
six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3
/dist-packages/six.py", line 693, in reraise\n raise value\n File
"/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3486, in
reboot_instance\n bad_volumes_callback=bad_volumes_callback)\n File
"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 2739,
in reboot\n block_device_info)\n File "/usr/lib/python3/dist-
packages/nova/virt/libvirt/driver.py", line 2833, in _hard_reboot\n
mdevs=mdevs)\n File "/usr/lib/python3/dist-
packages/nova/virt/libvirt/driver.py", line 5490, in _get_guest_xml\n
context, mdevs)\n File "/usr/lib/python3/dist-
packages/nova/virt/libvirt/driver.py", line 5283, in _get_guest_config\n
flavor, guest.os_type)\n File "/usr/lib/python3/dist-
packages/nova/virt/libvirt/driver.py", line 4093, in
_get_guest_storage_config\n self._connect_volume(context,
connection_info, instance)\n File "/usr/lib/python3/dist-
packages/nova/virt/libvirt/driver.py", line 1276, in _connect_volume\n
vol_driver.connect_volume(connection_info, instance)\n File
"/usr/lib/python3/dist-packages/nova/virt/libvirt/volume/iscsi.py", line
64, in connect_volume\n device_info =
self.connector.connect_volume(connection_info[\'data\'])\n File
"/usr/lib/python3/dist-packages/os_brick/utils.py", line 137, in
trace_logging_wrapper\n return f(*args, **kwargs)\n File
"/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line
328, in inner\n return f(*args, **kwargs)\n File "/usr/lib/python3
/dist-packages/os_brick/initiator/connectors/iscsi.py", line 518, in
connect_volume\n self._cleanup_connection(connection_properties,
force=True)\n File "/usr/lib/python3/dist-
packages/oslo_utils/excutils.py", line 220, in __exit__\n
self.force_reraise()\n File "/usr/lib/python3/dist-
packages/oslo_utils/excutils.py", line 196, in force_reraise\n
six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3
/dist-packages/six.py", line 693, in reraise\n raise value\n File
"/usr/lib/python3/dist-packages/os_brick/initiator/connectors/iscsi.py",
line 512, in connect_volume\n return
self._connect_single_volume(connection_properties)\n File
"/usr/lib/python3/dist-packages/os_brick/utils.py", line 61, in
_wrapper\n return r.call(f, *args, **kwargs)\n File
"/usr/lib/python3/dist-packages/retrying.py", line 212, in call\n
raise attempt.get()\n File "/usr/lib/python3/dist-
packages/retrying.py", line 247, in get\n six.reraise(self.value[0],
self.value[1], self.value[2])\n File "/usr/lib/python3/dist-
packages/six.py", line 693, in reraise\n raise value\n File
"/usr/lib/python3/dist-packages/retrying.py", line 200, in call\n
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)\n File
"/usr/lib/python3/dist-packages/os_brick/initiator/connectors/iscsi.py",
line 587, in _connect_single_volume\n raise
exception.VolumeDeviceNotFound(device=\'\')\n', u'created':
u'2019-09-29T23:44:32Z'} |
And on the nova compute logs I see:
2019-09-30 14:15:21.388 4484 WARNING nova.compute.manager [req-
1c2a5462-50d1-4cfb-b743-a4ea2195acb0 - - - - -] While synchronizing
instance power states, found 33 instances in the database and 34
instances on the hypervisor.
Something is not well synchronized and I believe this is the reason everything else is failing.
My workaround:
When this happens ostack set the vm-state to ERROR. I change the state
to active, and the stop the Instance. then I detach the volume (cinder,
iscsi based) start the VM, shutdown the VM, attach the volume agan, and
start the VM. This fix it. But if my user do a hard reset again it will
happen again.
Let me know if you need more information and I would be eager to provide
it.
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1846027
Title:
[Error Code 42] Domain not found when hard-reset is used
Status in OpenStack Compute (nova):
New
Bug description:
Not entirely sure if this is a bug, but at least the underlying logic
seems to mess this up.
I have 7 computes nodes on a ostack cluster. THis issue happens on
cluster1 and 5. for two VMs.
When it happens: At hard reboot. Let's say I have a VM that for some
reason is blocked (out of memory, whatever). Then I do a hard reboot.
When I do that the underlying nova code closes the iSCSI connection to
the cinder storage (I verified this), then it tries to restart the
domain failing with:
2019-09-30 11:54:00.366 4484 WARNING nova.virt.libvirt.driver [req-
1c2a5462-50d1-4cfb-b743-a4ea2195acb0 - - - - -] Error from libvirt
while getting description of instance-000002b1: [Error Code 42] Domain
not found: no domain with matching uuid '39a02162-7e99-45b8-837c-
4db0f20025af' (instance-000002b1): libvirt.libvirtError: Domain not
found: no domain with matching uuid '39a02162-7e99-45b8-837c-
4db0f20025af' (instance-000002b1)
Let me stop here for a moment. If in this step I go to the compute
node and do a virsh list --all the instance is not there at all.
I also get:
{u'message': u'Volume device not found at .', u'code': 500,
u'details': u' File "/usr/lib/python3/dist-
packages/nova/compute/manager.py", line 202, in decorated_function\n
return function(self, context, *args, **kwargs)\n File
"/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3512,
in reboot_instance\n self._set_instance_obj_error_state(context,
instance)\n File "/usr/lib/python3/dist-
packages/oslo_utils/excutils.py", line 220, in __exit__\n
self.force_reraise()\n File "/usr/lib/python3/dist-
packages/oslo_utils/excutils.py", line 196, in force_reraise\n
six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3
/dist-packages/six.py", line 693, in reraise\n raise value\n File
"/usr/lib/python3/dist-packages/nova/compute/manager.py", line 3486,
in reboot_instance\n bad_volumes_callback=bad_volumes_callback)\n
File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py",
line 2739, in reboot\n block_device_info)\n File "/usr/lib/python3
/dist-packages/nova/virt/libvirt/driver.py", line 2833, in
_hard_reboot\n mdevs=mdevs)\n File "/usr/lib/python3/dist-
packages/nova/virt/libvirt/driver.py", line 5490, in _get_guest_xml\n
context, mdevs)\n File "/usr/lib/python3/dist-
packages/nova/virt/libvirt/driver.py", line 5283, in
_get_guest_config\n flavor, guest.os_type)\n File
"/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line
4093, in _get_guest_storage_config\n self._connect_volume(context,
connection_info, instance)\n File "/usr/lib/python3/dist-
packages/nova/virt/libvirt/driver.py", line 1276, in _connect_volume\n
vol_driver.connect_volume(connection_info, instance)\n File
"/usr/lib/python3/dist-packages/nova/virt/libvirt/volume/iscsi.py",
line 64, in connect_volume\n device_info =
self.connector.connect_volume(connection_info[\'data\'])\n File
"/usr/lib/python3/dist-packages/os_brick/utils.py", line 137, in
trace_logging_wrapper\n return f(*args, **kwargs)\n File
"/usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py", line
328, in inner\n return f(*args, **kwargs)\n File "/usr/lib/python3
/dist-packages/os_brick/initiator/connectors/iscsi.py", line 518, in
connect_volume\n self._cleanup_connection(connection_properties,
force=True)\n File "/usr/lib/python3/dist-
packages/oslo_utils/excutils.py", line 220, in __exit__\n
self.force_reraise()\n File "/usr/lib/python3/dist-
packages/oslo_utils/excutils.py", line 196, in force_reraise\n
six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python3
/dist-packages/six.py", line 693, in reraise\n raise value\n File
"/usr/lib/python3/dist-
packages/os_brick/initiator/connectors/iscsi.py", line 512, in
connect_volume\n return
self._connect_single_volume(connection_properties)\n File
"/usr/lib/python3/dist-packages/os_brick/utils.py", line 61, in
_wrapper\n return r.call(f, *args, **kwargs)\n File
"/usr/lib/python3/dist-packages/retrying.py", line 212, in call\n
raise attempt.get()\n File "/usr/lib/python3/dist-
packages/retrying.py", line 247, in get\n
six.reraise(self.value[0], self.value[1], self.value[2])\n File
"/usr/lib/python3/dist-packages/six.py", line 693, in reraise\n
raise value\n File "/usr/lib/python3/dist-packages/retrying.py", line
200, in call\n attempt = Attempt(fn(*args, **kwargs),
attempt_number, False)\n File "/usr/lib/python3/dist-
packages/os_brick/initiator/connectors/iscsi.py", line 587, in
_connect_single_volume\n raise
exception.VolumeDeviceNotFound(device=\'\')\n', u'created':
u'2019-09-29T23:44:32Z'} |
And on the nova compute logs I see:
2019-09-30 14:15:21.388 4484 WARNING nova.compute.manager [req-
1c2a5462-50d1-4cfb-b743-a4ea2195acb0 - - - - -] While synchronizing
instance power states, found 33 instances in the database and 34
instances on the hypervisor.
Something is not well synchronized and I believe this is the reason everything else is failing.
My workaround:
When this happens ostack set the vm-state to ERROR. I change the state
to active, and the stop the Instance. then I detach the volume
(cinder, iscsi based) start the VM, shutdown the VM, attach the volume
agan, and start the VM. This fix it. But if my user do a hard reset
again it will happen again.
Let me know if you need more information and I would be eager to
provide it.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1846027/+subscriptions
Follow ups