yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1250869] Re: Recover from "stuck" states on compute manager start-up

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: David McNally <1250869@xxxxxxxxxxxxxxxxxx>
Date: Fri, 15 Nov 2013 15:09:50 -0000
Reply-to: Bug 1250869 <1250869@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
** Changed in: nova
       Status: In Progress => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1250869

Title:
  Recover from "stuck" states on compute manager start-up

Status in OpenStack Compute (Nova):
  Invalid

Bug description:
  If a compute manager is stopped / fails during certain operations then
  the instance will be left stuck with a transitional task_state.

  Ideally during compute manager start-up we would identify instances in
  these states and transition them to a logical stable state.

  Already there are two bugs dealing with specific cases of this problem (instances stuck in BUILDING and DELETING):
  https://bugs.launchpad.net/nova/+bug/1247174
  https://bugs.launchpad.net/nova/+bug/1197024

  This bug is to avoid raising more individual bugs for other states
  that require a handling.

  More information (taken from https://etherpad.openstack.org/p/NovaCleaningUpStuckInstances):
  Cleaning up "Stuck" instance state

  What do you mean by "Stuck" ?
  "Stuck" state in this context occurs when an action fails to complete in the computer manager.
  Typically seen on failure / restart

  Why do you care ?
  In some as state gates actions it stops you from being able to move forwards
  Relying on the user to clean up is a real pin when you want to migrate an instance
  It's confusing for the users (which means we have to spend time diagnosing and helping to fix it)

  Isn't this all going to be fixed by the task manager / clean-shutdown ? 
  Probably - but there some even quicker wins that also help towards that, and some issues that
  are also going to be relevent to task manager.

  Basic Premis:  The one time you know there is no running thread in the compute manager is during start-up.
  At that point there are some task states that can be safely cleared / re-processed.  The tricky thing is to 
  disambiguate between an action which has started and failed to complete, and an action which is actually still
  on the message queue (given that the compute manager may have been down for some time)

  A bit of history:
  We tried to address all of these and disambiguate the "still queued" case by recoding the task_state seen on the compute manager at the
  start of the action, but that was (rightly) blocked on because it involved more DB access and is going to be fixed by task manager.
  Are now re-working some easier cases that don't need the disambiguation. 
  https://review.openstack.org/#/c/47836/ 

  Easy cases:    
  Deleting:    It's always safe to go ahead and rerun the delete.

  Buliding:   Can always be put into an error state.  If the message was
  still on the queue instance.host won't have been set

  Image_pending_upload / Image_uploading:   Can be cleared - these are
  only set in the compute manager.

  Powering Off:   re-run the power off.  If the VM is already off, or
  the request is in the queue this is a no-op.

  Powering On:  re-run the power on: If the VM is already off, or the
  request is in the queue this is a no-op.

  All accepted as worth doing - submit as separte patches

  Harder cases:
  Image_snaphot:  (Set in API) - could be cleared on start-up and re-asserted on the compute manager at the start
  of snapshot to cover the case of a still queued request

  Rebooting:
      If the VM isn't running - reboot it  (risk is a second reboot)
      If the VM is running - just clear the status (risk is a user needs to make another reboot)
      
  Accepted to add additional task_state value to be set on compute manager to disambiguate the queued vs started case
      
  Even harder:
  Rebuilding:  Would be nice to be able to treat this like Building and go to an error state, but we can't use instance.host to
  disambiguate.   We could do something here if we add an extra task state (Rebuild_started) that is set immediatly on the 
  compute manager.   Could use the same approach to remove the risk of missed / additional reboots.

  As above

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1250869/+subscriptions