yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #06025
[Bug 1250869] Re: Recover from "stuck" states on compute manager start-up
** Changed in: nova
Status: In Progress => Invalid
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1250869
Title:
Recover from "stuck" states on compute manager start-up
Status in OpenStack Compute (Nova):
Invalid
Bug description:
If a compute manager is stopped / fails during certain operations then
the instance will be left stuck with a transitional task_state.
Ideally during compute manager start-up we would identify instances in
these states and transition them to a logical stable state.
Already there are two bugs dealing with specific cases of this problem (instances stuck in BUILDING and DELETING):
https://bugs.launchpad.net/nova/+bug/1247174
https://bugs.launchpad.net/nova/+bug/1197024
This bug is to avoid raising more individual bugs for other states
that require a handling.
More information (taken from https://etherpad.openstack.org/p/NovaCleaningUpStuckInstances):
Cleaning up "Stuck" instance state
What do you mean by "Stuck" ?
"Stuck" state in this context occurs when an action fails to complete in the computer manager.
Typically seen on failure / restart
Why do you care ?
In some as state gates actions it stops you from being able to move forwards
Relying on the user to clean up is a real pin when you want to migrate an instance
It's confusing for the users (which means we have to spend time diagnosing and helping to fix it)
Isn't this all going to be fixed by the task manager / clean-shutdown ?
Probably - but there some even quicker wins that also help towards that, and some issues that
are also going to be relevent to task manager.
Basic Premis: The one time you know there is no running thread in the compute manager is during start-up.
At that point there are some task states that can be safely cleared / re-processed. The tricky thing is to
disambiguate between an action which has started and failed to complete, and an action which is actually still
on the message queue (given that the compute manager may have been down for some time)
A bit of history:
We tried to address all of these and disambiguate the "still queued" case by recoding the task_state seen on the compute manager at the
start of the action, but that was (rightly) blocked on because it involved more DB access and is going to be fixed by task manager.
Are now re-working some easier cases that don't need the disambiguation.
https://review.openstack.org/#/c/47836/
Easy cases:
Deleting: It's always safe to go ahead and rerun the delete.
Buliding: Can always be put into an error state. If the message was
still on the queue instance.host won't have been set
Image_pending_upload / Image_uploading: Can be cleared - these are
only set in the compute manager.
Powering Off: re-run the power off. If the VM is already off, or
the request is in the queue this is a no-op.
Powering On: re-run the power on: If the VM is already off, or the
request is in the queue this is a no-op.
All accepted as worth doing - submit as separte patches
Harder cases:
Image_snaphot: (Set in API) - could be cleared on start-up and re-asserted on the compute manager at the start
of snapshot to cover the case of a still queued request
Rebooting:
If the VM isn't running - reboot it (risk is a second reboot)
If the VM is running - just clear the status (risk is a user needs to make another reboot)
Accepted to add additional task_state value to be set on compute manager to disambiguate the queued vs started case
Even harder:
Rebuilding: Would be nice to be able to treat this like Building and go to an error state, but we can't use instance.host to
disambiguate. We could do something here if we add an extra task state (Rebuild_started) that is set immediatly on the
compute manager. Could use the same approach to remove the risk of missed / additional reboots.
As above
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1250869/+subscriptions