yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1376933] [NEW] _poll_unconfirmed_resize timing window causes instance to stay in verify_resize state forever

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Jennifer Mulsow <jmulsow@xxxxxxxxxx>
Date: Thu, 02 Oct 2014 21:23:53 -0000
Reply-to: Bug 1376933 <1376933@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

If the _poll_unconfirmed_resizes periodic task runs in
nova/compute/manager.py:ComputeManager._finish_resize() after the
migration record has been updated in the database but before the
instances has been updated.

2014-09-30 16:15:00.897 112868 INFO nova.compute.manager [-] Automatically confirming migration 207 for instance 799f9246-bc05-4ae8-8737-4f358240f586
2014-09-30 16:15:01.109 112868 WARNING nova.compute.manager [-] [instance: 799f9246-bc05-4ae8-8737-4f358240f586] Setting migration 207 to error: In states stopped/resize_finish, not RESIZED/None

This causes _poll_unconfirmed_resizes to see that the VM task_state is
still 'resize_finish' instead of None, and set the migration record to
error state. Which in turn causes the VM to be stuck in resizing
forever.

Two fixes have been proposed for this issue so far but were reverted
because they caused other race conditions. See the following two bugs
for more details.

https://bugs.launchpad.net/nova/+bug/1321298
https://bugs.launchpad.net/nova/+bug/1326778

This timing issue still exists in Juno today in an environment with
periodic tasks set to run once every 60 seconds and with a
resize_confirm_window of 1 second.

Would a possible solution for this be to change the code in
_poll_unconfirmed_resizes() to ignore any VMs with a task state of
'resize_finish' instead of setting the corresponding migration record to
error? This is the task_state it should have right before changed to
None in finish_resize(). Then next time _poll_unconfirmed_resizes() is
called, the migration record will still be fetched and the VM will be
checked again and in the updated vm_state/task_state.

add the following in _poll_unconfirmed_resizes():

             # This removes a race condition
            if task_state == 'resize_finish':
                continue

prior to: 
            elif vm_state != vm_states.RESIZED or task_state is not None:
                reason = (_("In states %(vm_state)s/%(task_state)s, not "
                           "RESIZED/None") %
                          {'vm_state': vm_state,
                           'task_state': task_state})
                _set_migration_to_error(migration, reason,
                                        instance=instance)
                continue

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1376933

Title:
  _poll_unconfirmed_resize timing window causes instance to stay in
  verify_resize state forever

Status in OpenStack Compute (Nova):
  New

Bug description:
  If the _poll_unconfirmed_resizes periodic task runs in
  nova/compute/manager.py:ComputeManager._finish_resize() after the
  migration record has been updated in the database but before the
  instances has been updated.

  2014-09-30 16:15:00.897 112868 INFO nova.compute.manager [-] Automatically confirming migration 207 for instance 799f9246-bc05-4ae8-8737-4f358240f586
  2014-09-30 16:15:01.109 112868 WARNING nova.compute.manager [-] [instance: 799f9246-bc05-4ae8-8737-4f358240f586] Setting migration 207 to error: In states stopped/resize_finish, not RESIZED/None

  This causes _poll_unconfirmed_resizes to see that the VM task_state is
  still 'resize_finish' instead of None, and set the migration record to
  error state. Which in turn causes the VM to be stuck in resizing
  forever.

  Two fixes have been proposed for this issue so far but were reverted
  because they caused other race conditions. See the following two bugs
  for more details.

  https://bugs.launchpad.net/nova/+bug/1321298
  https://bugs.launchpad.net/nova/+bug/1326778

  This timing issue still exists in Juno today in an environment with
  periodic tasks set to run once every 60 seconds and with a
  resize_confirm_window of 1 second.

  Would a possible solution for this be to change the code in
  _poll_unconfirmed_resizes() to ignore any VMs with a task state of
  'resize_finish' instead of setting the corresponding migration record
  to error? This is the task_state it should have right before changed
  to None in finish_resize(). Then next time _poll_unconfirmed_resizes()
  is called, the migration record will still be fetched and the VM will
  be checked again and in the updated vm_state/task_state.

  add the following in _poll_unconfirmed_resizes():

               # This removes a race condition
              if task_state == 'resize_finish':
                  continue

  prior to: 
              elif vm_state != vm_states.RESIZED or task_state is not None:
                  reason = (_("In states %(vm_state)s/%(task_state)s, not "
                             "RESIZED/None") %
                            {'vm_state': vm_state,
                             'task_state': task_state})
                  _set_migration_to_error(migration, reason,
                                          instance=instance)
                  continue

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1376933/+subscriptions

Follow ups

[Bug 1376933] Re: _poll_unconfirmed_resize timing window causes instance to stay in verify_resize state forever
From: Thierry Carrez, 2014-12-18
[Bug 1376933] Re: _poll_unconfirmed_resize timing window causes instance to stay in verify_resize state forever
From: Alan Pevec, 2014-12-05
[Bug 1376933] Re: _poll_unconfirmed_resize timing window causes instance to stay in verify_resize state forever
From: Alan Pevec, 2014-12-04
[Bug 1376933] [NEW] _poll_unconfirmed_resize timing window causes instance to stay in verify_resize state forever
From: Jennifer Mulsow, 2014-10-02

References

[Bug 1376933] [NEW] _poll_unconfirmed_resize timing window causes instance to stay in verify_resize state forever
From: Jennifer Mulsow, 2014-10-02