← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1754360] Re: no unquiesce for volume backed on quiesce failure

 

Reviewed:  https://review.openstack.org/550865
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1e77faaa412ab9909dd9491cab4a819b5c84d3e8
Submitter: Zuul
Branch:    master

commit 1e77faaa412ab9909dd9491cab4a819b5c84d3e8
Author: Eric M Gonzalez <eric@xxxxxxxxx>
Date:   Thu Mar 8 09:11:25 2018 -0600

    unquiesce instance after quiesce failure
    
    If the call to compute_rpcapi.quisece_instance() raises an exception,
    any uncaught exception will break out of the function
    snapshot_volume_backed(). This can leave the instance in frozen state.
    
    This patch adds a blanket Exception catch to the try block and calls
    compute_rpcapi.unquiesce_instance() before reraising.
    
    This has been seen in the wild with RPC timeouts, but this is not the
    only possible genesis for an unknown error from quiesce_instance.
    
    Change-Id: Idca5998da8bb42b29a8fffdf52b4af3a043c6326
    Closes-Bug: #1754360


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1754360

Title:
  no unquiesce for volume backed on quiesce failure

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) ocata series:
  Confirmed
Status in OpenStack Compute (nova) pike series:
  In Progress
Status in OpenStack Compute (nova) queens series:
  In Progress

Bug description:
  Extension of bug #1731986;

  The above bug and fix catches errors that occur during the snapshot of
  an instance's volumes. I later discovered that a failure can occur
  during the call to quisce_instance() that raises an uncaught
  Exceptions through snapshot_volume_backed() that can leave the
  instance frozen / quiesced.

  Replication is tricky; my failures result during the RPC call to the
  compute host and a MessagingTimeout waiting for a reply. I have not
  found a way to handily replicate this. My compute combination is: Nova
  Mitaka, Libvirt-1.3.1, & Ceph Jewel

  Similar to the above bug, this condition was discovered in Mitaka and
  the issue remains in Queens.

  My proposed patch adds a blanket Exception catch around the call to
  rpcapi.quiesce_instance(), logs the caught exception, and issues an
  immediate rpcapi.unquiesce_instance() in order to thaw the instance.

  Stack trace from nova-api-os container, responsible for quiesce /
  unquiesce of instance during snapshot:

  [req-6229d689-dcc3-41ca-99b5-3dfc04e1e994 50505ffa89754660b4e6f7ebf69532b5 24bfcdab70714b85b5cb9f5f8270a414 - - -] Unexpected exception in API method
  Traceback (most recent call last):
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/openstack/extensions.py", line 478, in wrapped
      return f(*args, **kwargs)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/openstack/common.py", line 391, in inner
      return f(*args, **kwargs)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 73, in wrapper
      return func(*args, **kwargs)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/validation/__init__.py", line 73, in wrapper
      return func(*args, **kwargs)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/api/openstack/compute/servers.py", line 1108, in _action_create_image
      metadata)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/api.py", line 140, in inner
      return f(self, context, instance, *args, **kw)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/api.py", line 2389, in snapshot_volume_backed
      mapping=None)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
      self.force_reraise()
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
      six.reraise(self.type_, self.value, self.tb)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/api.py", line 2368, in snapshot_volume_backed
      self.compute_rpcapi.quiesce_instance(context, instance)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/nova/compute/rpcapi.py", line 1041, in quiesce_instance
      return cctxt.call(ctxt, 'quiesce_instance', instance=instance)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
      retry=self.retry)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
      timeout=timeout, retry=retry)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
      retry=retry)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send
      result = self._waiter.wait(msg_id, timeout)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait
      message = self.waiters.get(msg_id, timeout=timeout)
    File "/openstack/venvs/nova-13.3.7/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get
      'to message ID %s' % msg_id)
  MessagingTimeout: Timed out waiting for a reply to message ID 70ee5f80284b4b68a289bf232b89325c

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1754360/+subscriptions


References