yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1944759] [NEW] confirm resize fails with CPUUnpinningInvalid

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Balazs Gibizer <1944759@xxxxxxxxxxxxxxxxxx>
Date: Thu, 23 Sep 2021 18:11:52 -0000
Reply-to: Bug 1944759 <1944759@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Public bug reported:

Nova has a race condition between resize_instance() compute manager call
and the update_available_resources periodic job. If they overlap at the
right place, when resize_instance calls finish_resize, then periodic job
will not track the migration nor the instance on the source host. It
causes that the PCPU allocation on the source host is dropped in the
resource tracker (not in placement). Then when the resize is confirmed
nova tries to free the pinned cpus again on the source host and fails
with CPUUnpinningInvalid as they are already freed.

I've pushed a reproduction test:
https://review.opendev.org/c/openstack/nova/+/810763

It is reproducible at least on master, xena, wallaby, and victoria

** Affects: nova
     Importance: Medium
     Assignee: Balazs Gibizer (balazs-gibizer)
         Status: New


** Tags: compute numa race-condition resize

** Changed in: nova
     Assignee: (unassigned) => Balazs Gibizer (balazs-gibizer)

** Changed in: nova
   Importance: Undecided => Medium

** Description changed:

  Nova has a race condition between resize_instance() compute manager call
  and the update_available_resources periodic job. If they overlap at the
  right place, when resize_instance calls finish_resize, then periodic job
  will not track the migration nor the instance on the source host. It
  causes that the PCPU allocation on the source host is dropped in the
  resource tracker (not in placement). Then when the resize is confirmed
  nova tries to free the pinned cpus again on the source host and fails
  with CPUUnpinningInvalid as they are already freed.
  
  I will push a reproduction test soon.
+ 
+ It is reproducible at least on master, xena, wallaby, and victoria

** Tags added: compute numa race-condition resize

** Description changed:

  Nova has a race condition between resize_instance() compute manager call
  and the update_available_resources periodic job. If they overlap at the
  right place, when resize_instance calls finish_resize, then periodic job
  will not track the migration nor the instance on the source host. It
  causes that the PCPU allocation on the source host is dropped in the
  resource tracker (not in placement). Then when the resize is confirmed
  nova tries to free the pinned cpus again on the source host and fails
  with CPUUnpinningInvalid as they are already freed.
  
- I will push a reproduction test soon.
+ I've pushed a reproduction test:
+ https://review.opendev.org/c/openstack/nova/+/810763
  
  It is reproducible at least on master, xena, wallaby, and victoria

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1944759

Title:
  confirm resize fails with CPUUnpinningInvalid

Status in OpenStack Compute (nova):
  New

Bug description:
  Nova has a race condition between resize_instance() compute manager
  call and the update_available_resources periodic job. If they overlap
  at the right place, when resize_instance calls finish_resize, then
  periodic job will not track the migration nor the instance on the
  source host. It causes that the PCPU allocation on the source host is
  dropped in the resource tracker (not in placement). Then when the
  resize is confirmed nova tries to free the pinned cpus again on the
  source host and fails with CPUUnpinningInvalid as they are already
  freed.

  I've pushed a reproduction test:
  https://review.opendev.org/c/openstack/nova/+/810763

  It is reproducible at least on master, xena, wallaby, and victoria

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1944759/+subscriptions

Follow ups

[Bug 1944759] Re: [SRU] confirm resize fails with CPUUnpinningInvalid
From: Edward Hope-Morley, 2025-01-06
[Bug 1944759] Re: [SRU] confirm resize fails with CPUUnpinningInvalid
From: Rodrigo Barbieri, 2024-11-27
[Bug 1944759] Re: [SRU] confirm resize fails with CPUUnpinningInvalid
From: Launchpad Bug Tracker, 2024-11-25
[Bug 1944759] Re: [SRU] confirm resize fails with CPUUnpinningInvalid
From: James Page, 2024-06-17
[Bug 1944759] Re: [SRU] confirm resize fails with CPUUnpinningInvalid
From: Rodrigo Barbieri, 2024-05-07
[Bug 1944759] Re: confirm resize fails with CPUUnpinningInvalid
From: OpenStack Infra, 2021-09-30