← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1423654] [NEW] Nova rescue causes LVM timeouts after moving attachments

 

Public bug reported:

The Nova rescue feature powers off a running instance and, boots a
rescue instance attaching the ephemeral disk of the original instance to
it to allow an admin to try and recover the instance.  The problem is
that if a Cinder Volume is attached to that instance when we do a rescue
we don't do a detach or any sort of maintenance on the block mapping
that we have set up for it.  We do check to see if we have it, and
verify it's attached but that's it.

The result is that after the rescue operation subsequent LVM calls to do things like lvs and vgs will attempt to open a device file that no longer exists which takes up to 60 seconds for each device.  An example is the current tempest test:
tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume[gate,negative,volume]

Which if you look at tempest results you'll notice that test always
takes in excess of 100 seconds, but it's not just because it's a long
test, it's the blocking LVM calls.

We should detach any cinder volumes that are attached to an instance during the rescue process.  One concern with this that came from folks on the Nova team was 'what about boot from volume'?  Rescue of a volume booted instance is currently an invalid case as is evident by the code that checks for it and fails here:
https://github.com/openstack/nova/blob/master/nova/compute/api.py#L2822

Probably no reason we can't automate this as part of rescue in the
future but for now it's a separate enhancement independent of this bug.

** Affects: nova
     Importance: Undecided
     Assignee: John Griffith (john-griffith)
         Status: New

** Changed in: nova
     Assignee: (unassigned) => John Griffith (john-griffith)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1423654

Title:
  Nova rescue causes LVM timeouts after moving attachments

Status in OpenStack Compute (Nova):
  New

Bug description:
  The Nova rescue feature powers off a running instance and, boots a
  rescue instance attaching the ephemeral disk of the original instance
  to it to allow an admin to try and recover the instance.  The problem
  is that if a Cinder Volume is attached to that instance when we do a
  rescue we don't do a detach or any sort of maintenance on the block
  mapping that we have set up for it.  We do check to see if we have it,
  and verify it's attached but that's it.

  The result is that after the rescue operation subsequent LVM calls to do things like lvs and vgs will attempt to open a device file that no longer exists which takes up to 60 seconds for each device.  An example is the current tempest test:
  tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume[gate,negative,volume]

  Which if you look at tempest results you'll notice that test always
  takes in excess of 100 seconds, but it's not just because it's a long
  test, it's the blocking LVM calls.

  We should detach any cinder volumes that are attached to an instance during the rescue process.  One concern with this that came from folks on the Nova team was 'what about boot from volume'?  Rescue of a volume booted instance is currently an invalid case as is evident by the code that checks for it and fails here:
  https://github.com/openstack/nova/blob/master/nova/compute/api.py#L2822

  Probably no reason we can't automate this as part of rescue in the
  future but for now it's a separate enhancement independent of this
  bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1423654/+subscriptions


Follow ups

References