← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1855927] [NEW] _poll_unconfirmed_resizes may not retry later if confirm_resize fails in API

 

Public bug reported:

This is based on code inspection but let's say I have configured my
computes to set resize_confirm_window=3600 to automatically confirm a
resized server after 1 hour. Within that hour, let's say the source
compute service is down.

The periodic task gets the unconfirmed migrations with status='finished'
which have been updated some time older than the given configurable
window:

https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/manager.py#L8793

https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/db/sqlalchemy/api.py#L4342

The periodic task then calls the compute API code to confirm the resize:

https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7160

which changes the migration status to 'confirming':

https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/api.py#L3684

And casts off to the source compute:

https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/rpcapi.py#L600

Now if the source compute is down and that fails, the compute manager
task code will handle it and say it will retry later:

https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7163

However, because the migration status was changed from 'finished' to
'confirming' the task will not retry because it won't find the migration
given the DB query. And trying to confirm the resize via the API will
fail as well because we'll get MigrationNotFoundByStatus since the
migration status is no longer 'finished':

https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/api.py#L3681

The compute manager code should probably mark the migration status as
'finished' again if it's really going to try later, or mark the
migration status as 'error'. Note that the confirm_resize method in the
compute manager doesn't mark the migration status as 'error' if
something fails there either:

https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L3807

** Affects: nova
     Importance: Low
         Status: New


** Tags: error-handling migrate resize

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1855927

Title:
  _poll_unconfirmed_resizes may not retry later if confirm_resize fails
  in API

Status in OpenStack Compute (nova):
  New

Bug description:
  This is based on code inspection but let's say I have configured my
  computes to set resize_confirm_window=3600 to automatically confirm a
  resized server after 1 hour. Within that hour, let's say the source
  compute service is down.

  The periodic task gets the unconfirmed migrations with
  status='finished' which have been updated some time older than the
  given configurable window:

  https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/manager.py#L8793

  https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/db/sqlalchemy/api.py#L4342

  The periodic task then calls the compute API code to confirm the
  resize:

  https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7160

  which changes the migration status to 'confirming':

  https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/api.py#L3684

  And casts off to the source compute:

  https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/rpcapi.py#L600

  Now if the source compute is down and that fails, the compute manager
  task code will handle it and say it will retry later:

  https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L7163

  However, because the migration status was changed from 'finished' to
  'confirming' the task will not retry because it won't find the
  migration given the DB query. And trying to confirm the resize via the
  API will fail as well because we'll get MigrationNotFoundByStatus
  since the migration status is no longer 'finished':

  https://github.com/openstack/nova/blob/5a3ef39539ca112ae0552aef5cbd536338db61b7/nova/compute/api.py#L3681

  The compute manager code should probably mark the migration status as
  'finished' again if it's really going to try later, or mark the
  migration status as 'error'. Note that the confirm_resize method in
  the compute manager doesn't mark the migration status as 'error' if
  something fails there either:

  https://github.com/openstack/nova/blob/c295e395d/nova/compute/manager.py#L3807

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1855927/+subscriptions


Follow ups