yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #73750
[Bug 1754360] Re: no unquiesce for volume backed on quiesce failure
Reviewed: https://review.openstack.org/550865
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1e77faaa412ab9909dd9491cab4a819b5c84d3e8
Submitter: Zuul
Branch: master
commit 1e77faaa412ab9909dd9491cab4a819b5c84d3e8
Author: Eric M Gonzalez <eric@xxxxxxxxx>
Date: Thu Mar 8 09:11:25 2018 -0600
unquiesce instance after quiesce failure
If the call to compute_rpcapi.quisece_instance() raises an exception,
any uncaught exception will break out of the function
snapshot_volume_backed(). This can leave the instance in frozen state.
This patch adds a blanket Exception catch to the try block and calls
compute_rpcapi.unquiesce_instance() before reraising.
This has been seen in the wild with RPC timeouts, but this is not the
only possible genesis for an unknown error from quiesce_instance.
Change-Id: Idca5998da8bb42b29a8fffdf52b4af3a043c6326
Closes-Bug: #1754360
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1754360
Title:
no unquiesce for volume backed on quiesce failure
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) ocata series:
Confirmed
Status in OpenStack Compute (nova) pike series:
In Progress
Status in OpenStack Compute (nova) queens series:
In Progress
Bug description:
Extension of bug #1731986;
The above bug and fix catches errors that occur during the snapshot of
an instance's volumes. I later discovered that a failure can occur
during the call to quisce_instance() that raises an uncaught
Exceptions through snapshot_volume_backed() that can leave the
instance frozen / quiesced.
Replication is tricky; my failures result during the RPC call to the
compute host and a MessagingTimeout waiting for a reply. I have not
found a way to handily replicate this. My compute combination is: Nova
Mitaka, Libvirt-1.3.1, & Ceph Jewel
Similar to the above bug, this condition was discovered in Mitaka and
the issue remains in Queens.
My proposed patch adds a blanket Exception catch around the call to
rpcapi.quiesce_instance(), logs the caught exception, and issues an
immediate rpcapi.unquiesce_instance() in order to thaw the instance.
Stack trace from nova-api-os container, responsible for quiesce /
unquiesce of instance during snapshot:
[req-6229d689-dcc3-41ca-99b5-3dfc04e1e994 50505ffa89754660b4e6f7ebf69532b5 24bfcdab70714b85b5cb9f5f8270a414 - - -] Unexpected exception in API method
Traceback (most recent call last):
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/openstack/extensions.py", line 478, in wrapped
return f(*args, **kwargs)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/openstack/common.py", line 391, in inner
return f(*args, **kwargs)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 73, in wrapper
return func(*args, **kwargs)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 73, in wrapper
return func(*args, **kwargs)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/openstack/compute/servers.py", line 1108, in _action_create_image
metadata)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/api.py", line 140, in inner
return f(self, context, instance, *args, **kw)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/api.py", line 2389, in snapshot_volume_backed
mapping=None)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
self.force_reraise()
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
six.reraise(self.type_, self.value, self.tb)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/api.py", line 2368, in snapshot_volume_backed
self.compute_rpcapi.quiesce_instance(context, instance)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/rpcapi.py", line 1041, in quiesce_instance
return cctxt.call(ctxt, 'quiesce_instance', instance=instance)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
retry=self.retry)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
timeout=timeout, retry=retry)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
retry=retry)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send
result = self._waiter.wait(msg_id, timeout)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait
message = self.waiters.get(msg_id, timeout=timeout)
File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get
'to message ID %s' % msg_id)
MessagingTimeout: Timed out waiting for a reply to message ID 70ee5f80284b4b68a289bf232b89325c
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1754360/+subscriptions
References