yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #80228
[Bug 1846262] [NEW] Failed resize claim leaves otherwise active instance in ERROR state
Public bug reported:
I noticed this while working on a functional test to recreate a bug
during resize reschedule:
https://review.opendev.org/#/c/686017/
And discussed a bit in IRC:
http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-
nova.2019-10-01.log.html#t2019-10-01T16:33:27
The issue is that we can start a resize (or cold migration) of a stopped
or active (normally active) server and fail a resize claim in the
compute service due to some race issue or for resource claims that are
not handled by placement yet, like NUMA and PCI devices:
https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4527
That ResourceTracker.resize_claim can raise ComputeResourcesUnavailable
which is handled here:
https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4610
We may try to reschedule but if rescheduling fails, or we don't
reschedule, the instance is set to error state by this context manager:
https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4592
That will set the instance vm_state to error:
https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L8809
If we failed a resize claim, there is actually no change in the guest,
same like if we failed a cold migration because the scheduler selected
the same host and the virt driver does not support that, see:
https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4489
If _prep_resize raises InstanceFaultRollback the
_error_out_instance_on_exception will handle it differently since
https://review.opendev.org/#/c/633212/ and not put the instance into
ERROR state but revert the vm_state to its previous value (active or
stopped).
If the guest is not changed I don't think the instance should be in
ERROR status because of a resize claim failure, but opinions on that
differ, e.g.:
(11:40:45 AM) mriedem: dansmith: ok, but still, the user shouldn't have to stop and then start to get out of that, or hard reboot, when the thing that failed is a resize claim race
(11:41:03 AM) dansmith: mriedem: so maybe it's just stop I'm thinking of.. anyway, I dunno.. it's very annoying as a user to do something, come back later and have it not obvious that the thing has happened, or failed or whatever
(11:41:52 AM) dansmith: mriedem: if you're going to retry the operation for them, I agree. if you're not, then being super obvious about what has happened is best, IMHO
If we aren't going to automatically handle the resize claim failure and
not set the instance to error state, then we should at least have
something in the API reference documentation about post-conditions for
resize and cold migrate actions such that if the instance is in ERROR
state and there is a fault for the resize claim failure, the user can
stop/start or hard reboot the server to reset its status. I do think we
have some precedence in handling non-error conditions like this though
since https://review.opendev.org/#/c/633227/.
This is latent behavior so I'm going to mark it low priority but I
wanted to make sure we have a bug reported for it.
** Affects: nova
Importance: Low
Status: Triaged
** Tags: resize
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1846262
Title:
Failed resize claim leaves otherwise active instance in ERROR state
Status in OpenStack Compute (nova):
Triaged
Bug description:
I noticed this while working on a functional test to recreate a bug
during resize reschedule:
https://review.opendev.org/#/c/686017/
And discussed a bit in IRC:
http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-
nova.2019-10-01.log.html#t2019-10-01T16:33:27
The issue is that we can start a resize (or cold migration) of a
stopped or active (normally active) server and fail a resize claim in
the compute service due to some race issue or for resource claims that
are not handled by placement yet, like NUMA and PCI devices:
https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4527
That ResourceTracker.resize_claim can raise
ComputeResourcesUnavailable which is handled here:
https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4610
We may try to reschedule but if rescheduling fails, or we don't
reschedule, the instance is set to error state by this context
manager:
https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4592
That will set the instance vm_state to error:
https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L8809
If we failed a resize claim, there is actually no change in the guest,
same like if we failed a cold migration because the scheduler selected
the same host and the virt driver does not support that, see:
https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4489
If _prep_resize raises InstanceFaultRollback the
_error_out_instance_on_exception will handle it differently since
https://review.opendev.org/#/c/633212/ and not put the instance into
ERROR state but revert the vm_state to its previous value (active or
stopped).
If the guest is not changed I don't think the instance should be in
ERROR status because of a resize claim failure, but opinions on that
differ, e.g.:
(11:40:45 AM) mriedem: dansmith: ok, but still, the user shouldn't have to stop and then start to get out of that, or hard reboot, when the thing that failed is a resize claim race
(11:41:03 AM) dansmith: mriedem: so maybe it's just stop I'm thinking of.. anyway, I dunno.. it's very annoying as a user to do something, come back later and have it not obvious that the thing has happened, or failed or whatever
(11:41:52 AM) dansmith: mriedem: if you're going to retry the operation for them, I agree. if you're not, then being super obvious about what has happened is best, IMHO
If we aren't going to automatically handle the resize claim failure
and not set the instance to error state, then we should at least have
something in the API reference documentation about post-conditions for
resize and cold migrate actions such that if the instance is in ERROR
state and there is a fault for the resize claim failure, the user can
stop/start or hard reboot the server to reset its status. I do think
we have some precedence in handling non-error conditions like this
though since https://review.opendev.org/#/c/633227/.
This is latent behavior so I'm going to mark it low priority but I
wanted to make sure we have a bug reported for it.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1846262/+subscriptions