yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1784983] Re: we should not set instance to ERROR state when rebuild_claim faild

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1784983@xxxxxxxxxxxxxxxxxx>
Date: Fri, 15 Nov 2019 11:48:54 -0000
Reply-to: Bug 1784983 <1784983@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Reviewed:  https://review.opendev.org/692185
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=26e1d9c7237f7bd97ec5f1fd3e572b3927eea725
Submitter: Zuul
Branch:    master

commit 26e1d9c7237f7bd97ec5f1fd3e572b3927eea725
Author: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date:   Wed Oct 30 12:11:43 2019 -0400

    Reset vm_state to original value if rebuild claim fails
    
    If while evacuating an active or stopped server the rebuild
    resource claim or group affinity policy check fails, the state
    of the server has not actually changed but the vm_state is changed
    to ERROR because of the _error_out_instance_on_exception context
    manager.
    
    This builds on Ie4f9177f4d54cbc7dbcf58bd107fd5f24c60d8bb by
    wrapping the BuildAbortException in InstanceFaultRollback for the
    claim/group policy failures so the vm_state remains unchanged.
    Note that the overall instance action record will still be marked
    as a failure since the BuildAbortException is re-raised and the
    wrap_instance_event decorator will fail the action (this is how the
    user can know the operation failed).
    
    Change-Id: I07fa46690d8f7b846665bc59c5e361873154382b
    Closes-Bug: #1784983


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1784983

Title:
  we should not set instance to ERROR state when rebuild_claim faild

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===========
  When a compute node is down, we evacaute the instances which locate in this compute. In concurrent scenario， serveral instances selecte the same destination node. And unfortunately，the memory is not enough for some instance, then the destination node raise the ComputeResourcesUnavailable exception, and set the instance to error state finally. But I think in ComputeResourcesUnavailable excepton, we should not set the instance to error state. In fact the instance remains in the source node.

  Steps to reproduce
  ==================
  * Create many instances in on source node, and the destination have little resource such memory.
  * Power off the compute or stop the compute service in this node.
  * Concurrently evacuate all instances in source node with specifying the destination node. 
  * Fortunately， you will find one or more instance in error state.

  
  Expected result
  ===============
  I wonder no instance is in error state when no enough resources.

  Actual result
  =============
  Some instance is in error state .

  Environment
  ===========
  P release，But I found the issue also exists in main branch.

  
  Logs & Configs
  ==============
  2018-08-01 16:21:45.739 41514 DEBUG nova.notifications.objects.base [req-1710e7e5-9073-47f1-8ae8-1e68c65272c9 855c20651d244348b10c91d907aa59ca - - - -] Defaulting the value of the field 'projects' to None in FlavorPayload due to 'Cannot call _load_projects on orphaned Flavor object' populate_schema /usr/lib/python2.7/site-packages/nova/notifications/objects/base.py:125
  2018-08-01 16:21:45.747 41514 ERROR nova.compute.manager [req-1710e7e5-9073-47f1-8ae8-1e68c65272c9 855c20651d244348b10c91d907aa59ca - - - -] [instance: 5b8ae80d-7e33-4099-8732-905355cee045] Setting instance vm_state to ERROR: BuildAbortException: Build of instance 5b8ae80d-7e33-4099-8732-905355cee045 aborted: Insufficient compute resources: Free memory 1141.00 MB < requested 2048 MB.
  2018-08-01 16:21:45.747 41514 ERROR nova.compute.manager [instance: 5b8ae80d-7e33-4099-8732-905355cee045] Traceback (most recent call last):
  2018-08-01 16:21:45.747 41514 ERROR nova.compute.manager [instance: 5b8ae80d-7e33-4099-8732-905355cee045]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7142, in _error_out_instance_on_exception
  2018-08-01 16:21:45.747 41514 ERROR nova.compute.manager [instance: 5b8ae80d-7e33-4099-8732-905355cee045]     yield
  2018-08-01 16:21:45.747 41514 ERROR nova.compute.manager [instance: 5b8ae80d-7e33-4099-8732-905355cee045]   File "/usr/lib/python2.7/site-packages/nova/fh/compute/manager.py", line 700, in rebuild_instance
  2018-08-01 16:21:45.747 41514 ERROR nova.compute.manager [instance: 5b8ae80d-7e33-4099-8732-905355cee045]     instance_uuid=instance.uuid, reason=e.format_message())
  2018-08-01 16:21:45.747 41514 ERROR nova.compute.manager [instance: 5b8ae80d-7e33-4099-8732-905355cee045] BuildAbortException: Build of instance 5b8ae80d-7e33-4099-8732-905355cee045 aborted: Insufficient compute resources: Free memory 1141.00 MB < requested 2048 MB.
  2018-08-01 16:21:45.747 41514 ERROR nova.compute.manager [instance: 5b8ae80d-7e33-4099-8732-905355cee045]

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1784983/+subscriptions
References

[Bug 1784983] [NEW] we should not set instance to ERROR state when rebuild_claim faild
From: Tao Li, 2018-08-02