yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #26214
[Bug 1376933] Re: _poll_unconfirmed_resize timing window causes instance to stay in verify_resize state forever
** Changed in: nova
Status: Fix Committed => Fix Released
** Changed in: nova
Milestone: None => kilo-1
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1376933
Title:
_poll_unconfirmed_resize timing window causes instance to stay in
verify_resize state forever
Status in OpenStack Compute (Nova):
Fix Released
Status in OpenStack Compute (nova) juno series:
Fix Released
Bug description:
If the _poll_unconfirmed_resizes periodic task runs in
nova/compute/manager.py:ComputeManager._finish_resize() after the
migration record has been updated in the database but before the
instances has been updated.
2014-09-30 16:15:00.897 112868 INFO nova.compute.manager [-] Automatically confirming migration 207 for instance 799f9246-bc05-4ae8-8737-4f358240f586
2014-09-30 16:15:01.109 112868 WARNING nova.compute.manager [-] [instance: 799f9246-bc05-4ae8-8737-4f358240f586] Setting migration 207 to error: In states stopped/resize_finish, not RESIZED/None
This causes _poll_unconfirmed_resizes to see that the VM task_state is
still 'resize_finish' instead of None, and set the migration record to
error state. Which in turn causes the VM to be stuck in resizing
forever.
Two fixes have been proposed for this issue so far but were reverted
because they caused other race conditions. See the following two bugs
for more details.
https://bugs.launchpad.net/nova/+bug/1321298
https://bugs.launchpad.net/nova/+bug/1326778
This timing issue still exists in Juno today in an environment with
periodic tasks set to run once every 60 seconds and with a
resize_confirm_window of 1 second.
Would a possible solution for this be to change the code in
_poll_unconfirmed_resizes() to ignore any VMs with a task state of
'resize_finish' instead of setting the corresponding migration record
to error? This is the task_state it should have right before changed
to None in finish_resize(). Then next time _poll_unconfirmed_resizes()
is called, the migration record will still be fetched and the VM will
be checked again and in the updated vm_state/task_state.
add the following in _poll_unconfirmed_resizes():
# This removes a race condition
if task_state == 'resize_finish':
continue
prior to:
elif vm_state != vm_states.RESIZED or task_state is not None:
reason = (_("In states %(vm_state)s/%(task_state)s, not "
"RESIZED/None") %
{'vm_state': vm_state,
'task_state': task_state})
_set_migration_to_error(migration, reason,
instance=instance)
continue
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1376933/+subscriptions
References