yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1944759] Re: confirm resize fails with CPUUnpinningInvalid

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1944759@xxxxxxxxxxxxxxxxxx>
Date: Thu, 30 Sep 2021 20:07:31 -0000
Reply-to: Bug 1944759 <1944759@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Reviewed:  https://review.opendev.org/c/openstack/nova/+/810909
Committed: https://opendev.org/openstack/nova/commit/b841e553214be9a732703e2dfed6c97698ef9b71
Submitter: "Zuul (22348)"
Branch:    master

commit b841e553214be9a732703e2dfed6c97698ef9b71
Author: Balazs Gibizer <balazs.gibizer@xxxxxxxx>
Date:   Fri Sep 24 15:17:28 2021 +0200

    Store old_flavor already on source host during resize
    
    During resize, on the source host, in resize_instance(), the instance.host
    and .node is updated to point to the destination host. This indicates to
    the source host's resource tracker that the allocation of this instance
    does not need to be tracked as an instance but as an outbound migration
    instead. However for the source host's resource tracker to do that it,
    needs to use the instance.old_flavor. Unfortunately the
    instance.old_flavor is only set during finish_resize() on the
    destination host. (resize_instance cast to the finish_resize). So it is
    possible that a running resize_instance() set the instance.host to point
    to the destination and then before the finish_resize could set the
    old_flavor an update_available_resources periodic runs on the source
    host. This causes that the allocation of this instance is not tracked as
    an instance as the instance.host point to the destination but it is not
    tracked as a migration either as the instance.old_flavor is not yet set.
    So the allocation on the source host is simply dropped by the periodic
    job.
    
    When such migration is confirmed the confirm_resize() tries to drop
    the same resource allocation again but fails as the pinned CPUs of the
    instance already freed.
    
    When such migration is reverted instead, then revert succeeds but the
    source host resource allocation will not contain the resource allocation
    of the instance until the next update_available_resources periodic runs
    and corrects it.
    
    This does not affect resources tracked exclusively in placement (e.g.
    VCPU, MEMORY_MB, DISK_GB) but it does affect NUMA related resource that
    are still tracked in the resource tracker (e.g. huge pages, pinned
    CPUs).
    
    This patch moves the instance.old_flavor setting to the source node to
    the same transaction that sets the instance.host to point to the
    destination host. Hence solving the race condition.
    
    Change-Id: Ic0d6c59147abe5e094e8f13e0d3523b178daeba9
    Closes-Bug: #1944759


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1944759

Title:
  confirm resize fails with CPUUnpinningInvalid

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Nova has a race condition between resize_instance() compute manager
  call and the update_available_resources periodic job. If they overlap
  at the right place, when resize_instance calls finish_resize, then
  periodic job will not track the migration nor the instance on the
  source host. It causes that the PCPU allocation on the source host is
  dropped in the resource tracker (not in placement). Then when the
  resize is confirmed nova tries to free the pinned cpus again on the
  source host and fails with CPUUnpinningInvalid as they are already
  freed.

  I've pushed a reproduction test:
  https://review.opendev.org/c/openstack/nova/+/810763

  It is reproducible at least on master, xena, wallaby, and victoria

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1944759/+subscriptions

References

[Bug 1944759] [NEW] confirm resize fails with CPUUnpinningInvalid
From: Balazs Gibizer, 2021-09-23