yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1376933] Re: _poll_unconfirmed_resize timing window causes instance to stay in verify_resize state forever

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Alan Pevec <1376933@xxxxxxxxxxxxxxxxxx>
Date: Fri, 05 Dec 2014 08:17:03 -0000
Reply-to: Bug 1376933 <1376933@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

** Changed in: nova/juno
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1376933

Title:
  _poll_unconfirmed_resize timing window causes instance to stay in
  verify_resize state forever

Status in OpenStack Compute (Nova):
  Fix Committed
Status in OpenStack Compute (nova) juno series:
  Fix Released

Bug description:
  If the _poll_unconfirmed_resizes periodic task runs in
  nova/compute/manager.py:ComputeManager._finish_resize() after the
  migration record has been updated in the database but before the
  instances has been updated.

  2014-09-30 16:15:00.897 112868 INFO nova.compute.manager [-] Automatically confirming migration 207 for instance 799f9246-bc05-4ae8-8737-4f358240f586
  2014-09-30 16:15:01.109 112868 WARNING nova.compute.manager [-] [instance: 799f9246-bc05-4ae8-8737-4f358240f586] Setting migration 207 to error: In states stopped/resize_finish, not RESIZED/None

  This causes _poll_unconfirmed_resizes to see that the VM task_state is
  still 'resize_finish' instead of None, and set the migration record to
  error state. Which in turn causes the VM to be stuck in resizing
  forever.

  Two fixes have been proposed for this issue so far but were reverted
  because they caused other race conditions. See the following two bugs
  for more details.

  https://bugs.launchpad.net/nova/+bug/1321298
  https://bugs.launchpad.net/nova/+bug/1326778

  This timing issue still exists in Juno today in an environment with
  periodic tasks set to run once every 60 seconds and with a
  resize_confirm_window of 1 second.

  Would a possible solution for this be to change the code in
  _poll_unconfirmed_resizes() to ignore any VMs with a task state of
  'resize_finish' instead of setting the corresponding migration record
  to error? This is the task_state it should have right before changed
  to None in finish_resize(). Then next time _poll_unconfirmed_resizes()
  is called, the migration record will still be fetched and the VM will
  be checked again and in the updated vm_state/task_state.

  add the following in _poll_unconfirmed_resizes():

               # This removes a race condition
              if task_state == 'resize_finish':
                  continue

  prior to: 
              elif vm_state != vm_states.RESIZED or task_state is not None:
                  reason = (_("In states %(vm_state)s/%(task_state)s, not "
                             "RESIZED/None") %
                            {'vm_state': vm_state,
                             'task_state': task_state})
                  _set_migration_to_error(migration, reason,
                                          instance=instance)
                  continue

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1376933/+subscriptions

References

[Bug 1376933] [NEW] _poll_unconfirmed_resize timing window causes instance to stay in verify_resize state forever
From: Jennifer Mulsow, 2014-10-02