yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1846262] [NEW] Failed resize claim leaves otherwise active instance in ERROR state

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Tue, 01 Oct 2019 19:43:53 -0000
Reply-to: Bug 1846262 <1846262@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

I noticed this while working on a functional test to recreate a bug
during resize reschedule:

https://review.opendev.org/#/c/686017/

And discussed a bit in IRC:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-
nova.2019-10-01.log.html#t2019-10-01T16:33:27

The issue is that we can start a resize (or cold migration) of a stopped
or active (normally active) server and fail a resize claim in the
compute service due to some race issue or for resource claims that are
not handled by placement yet, like NUMA and PCI devices:

https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4527

That ResourceTracker.resize_claim can raise ComputeResourcesUnavailable
which is handled here:

https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4610

We may try to reschedule but if rescheduling fails, or we don't
reschedule, the instance is set to error state by this context manager:

https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4592

That will set the instance vm_state to error:

https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L8809

If we failed a resize claim, there is actually no change in the guest,
same like if we failed a cold migration because the scheduler selected
the same host and the virt driver does not support that, see:

https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4489

If _prep_resize raises InstanceFaultRollback the
_error_out_instance_on_exception will handle it differently since
https://review.opendev.org/#/c/633212/ and not put the instance into
ERROR state but revert the vm_state to its previous value (active or
stopped).

If the guest is not changed I don't think the instance should be in
ERROR status because of a resize claim failure, but opinions on that
differ, e.g.:

(11:40:45 AM) mriedem: dansmith: ok, but still, the user shouldn't have to stop and then start to get out of that, or hard reboot, when the thing that failed is a resize claim race
(11:41:03 AM) dansmith: mriedem: so maybe it's just stop I'm thinking of.. anyway, I dunno.. it's very annoying as a user to do something, come back later and have it not obvious that the thing has happened, or failed or whatever
(11:41:52 AM) dansmith: mriedem: if you're going to retry the operation for them, I agree. if you're not, then being super obvious about what has happened is best, IMHO

If we aren't going to automatically handle the resize claim failure and
not set the instance to error state, then we should at least have
something in the API reference documentation about post-conditions for
resize and cold migrate actions such that if the instance is in ERROR
state and there is a fault for the resize claim failure, the user can
stop/start or hard reboot the server to reset its status. I do think we
have some precedence in handling non-error conditions like this though
since https://review.opendev.org/#/c/633227/.

This is latent behavior so I'm going to mark it low priority but I
wanted to make sure we have a bug reported for it.

** Affects: nova
     Importance: Low
         Status: Triaged


** Tags: resize

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1846262

Title:
  Failed resize claim leaves otherwise active instance in ERROR state

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  I noticed this while working on a functional test to recreate a bug
  during resize reschedule:

  https://review.opendev.org/#/c/686017/

  And discussed a bit in IRC:

  http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-
  nova.2019-10-01.log.html#t2019-10-01T16:33:27

  The issue is that we can start a resize (or cold migration) of a
  stopped or active (normally active) server and fail a resize claim in
  the compute service due to some race issue or for resource claims that
  are not handled by placement yet, like NUMA and PCI devices:

  https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4527

  That ResourceTracker.resize_claim can raise
  ComputeResourcesUnavailable which is handled here:

  https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4610

  We may try to reschedule but if rescheduling fails, or we don't
  reschedule, the instance is set to error state by this context
  manager:

  https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4592

  That will set the instance vm_state to error:

  https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L8809

  If we failed a resize claim, there is actually no change in the guest,
  same like if we failed a cold migration because the scheduler selected
  the same host and the virt driver does not support that, see:

  https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4489

  If _prep_resize raises InstanceFaultRollback the
  _error_out_instance_on_exception will handle it differently since
  https://review.opendev.org/#/c/633212/ and not put the instance into
  ERROR state but revert the vm_state to its previous value (active or
  stopped).

  If the guest is not changed I don't think the instance should be in
  ERROR status because of a resize claim failure, but opinions on that
  differ, e.g.:

  (11:40:45 AM) mriedem: dansmith: ok, but still, the user shouldn't have to stop and then start to get out of that, or hard reboot, when the thing that failed is a resize claim race
  (11:41:03 AM) dansmith: mriedem: so maybe it's just stop I'm thinking of.. anyway, I dunno.. it's very annoying as a user to do something, come back later and have it not obvious that the thing has happened, or failed or whatever
  (11:41:52 AM) dansmith: mriedem: if you're going to retry the operation for them, I agree. if you're not, then being super obvious about what has happened is best, IMHO

  If we aren't going to automatically handle the resize claim failure
  and not set the instance to error state, then we should at least have
  something in the API reference documentation about post-conditions for
  resize and cold migrate actions such that if the instance is in ERROR
  state and there is a fault for the resize claim failure, the user can
  stop/start or hard reboot the server to reset its status. I do think
  we have some precedence in handling non-error conditions like this
  though since https://review.opendev.org/#/c/633227/.

  This is latent behavior so I'm going to mark it low priority but I
  wanted to make sure we have a bug reported for it.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1846262/+subscriptions