← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1952915] Re: Resize/migrate intermittently fails to revert for instances with dedicated CPUs

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/820549
Committed: https://opendev.org/openstack/nova/commit/32c1044d86a8d02712c8e3abdf8b3e4cff234a9c
Submitter: "Zuul (22348)"
Branch:    master

commit 32c1044d86a8d02712c8e3abdf8b3e4cff234a9c
Author: Balazs Gibizer <balazs.gibizer@xxxxxxxx>
Date:   Mon Dec 6 17:06:51 2021 +0100

    [rt] Apply migration context for incoming migrations
    
    There is a race condition between an incoming resize and an
    update_available_resource periodic in the resource tracker. The race
    window starts when the resize_instance RPC finishes  and ends when the
    finish_resize compute RPC finally applies the migration context on the
    instance.
    
    In the race window, if the update_available_resource periodic is run on
    the destination node, then it will see the instance as being tracked on
    this host as the instance.node is already pointing to the dest. But the
    instance.numa_topology still points to the source host topology as the
    migration context is not applied yet. This leads to CPU pinning error if
    the source topology does not fit to the dest topology. Also it stops the
    periodic task and leaves the tracker in an inconsistent state. The
    inconsistent state only cleanup up after the periodic is run outside of
    the race window.
    
    This patch applies the migration context temporarily to the specific
    instances during the periodic to keep resource accounting correct.
    
    Change-Id: Icaad155e22c9e2d86e464a0deb741c73f0dfb28a
    Closes-Bug: #1953359
    Closes-Bug: #1952915


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1952915

Title:
  Resize/migrate intermittently fails to revert for instances with
  dedicated CPUs

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  -----------------
  Reverting a cold-migration/resize on an instance with dedicated CPU policy fails intermittently with status 400 and a "fault" message similar to the following:

  "CPU set to unpin [8] must be a subset of pinned CPU set []"

  The message above resembles what was reported on
  https://bugs.launchpad.net/nova/+bug/1879878 and
  https://bugs.launchpad.net/nova/+bug/1944759. A similar race-condition
  between the resize operation and the
  ComputeManager.update_available_resource() periodic task is suspected,
  but at a different point of the operation.

  Steps to Reproduce
  ------------------
  1. Create a flavor with property hw:cpu_policy=dedicated
  2. Create an instance with the new flavor
  3. Issue a cold-migration (or a resize with migration) to the new instance and wait for it to finish
  4. Issue a resize-revert to the migrated instance and check its state

  Expected Behavior
  ------------------
  Instance migration/resize should be reverted successfully.

  Actual Behavior
  ----------------
  The error described occurs intermittently, causing the revert to fail, and leaving the instance on the destination node.

  Seems to occur with small frequency, but may cause the revert
  operation to fail, without the possibility for a retry. This makes it
  more difficult to get the instance back to its previous
  node/configuration (see "Workaround" section). Verified that it
  happened both for cold-migration and resize.

  From the tests performed on a two-node system, with
  update_resources_interval=60, it occurred at approximately 1-3% of the
  trials.

  Environment
  --------------------
  Needs to have at least two compute nodes for migration. Reproduced the issue on a two-node system and on a multi-node system with dedicated storage.

  Tested with branch stable/ussuri. Latest commit:
  6667fcb92bfaf03a8a274dc26806c137aace6b49.

  Also added some custom changes for testing/debugging purposes, listed
  below:

  - To check if the issue still occurred with the fix from https://bugs.launchpad.net/nova/+bug/1944759:
      AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5655, added:
          instance.old_flavor = instance.flavor

  - Additional log messages for debugging. To give a better context to the "Timestamp/Logs" section, the most relevant ones are listed below:
      (Task update_available_resource started)
      AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L9841, added:
          LOG.warning("====== Started periodic task ======")

      (Instance saved with host changed to destination host, and old flavor set to original one (due to #1944759 fix))
      AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5657, added:
          LOG.warning("====== Set instance host=dest and old_flavor=flavor ======")

      (Reached final resize step on destination)
      AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5869, added:
          LOG.warning("====== Request reached finish_resize on dest compute ======")

      (Instance configuration (such as NUMA topology) updated with values for the destination host)
      AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/manager.py#L5795, added:
          LOG.warning("====== Set instance old_flavor=flavor, flavor=new_flavor and added _new ======")

      (Migration context created with given original and selected NUMA topologies)
      AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/resource_tracker.py#L310, added:
          LOG.info("====== Migration old topology: %(old_topology)s, ======", {'old_topology': mig_context.old_numa_topology}, instance=instance)
          LOG.info("====== Migration new topology: %(new_topology)s, ======", {'new_topology': mig_context.new_numa_topology}, instance=instance)

      (Unpinning host CPUs from resource_tracker's inventory)
      AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/virt/hardware.py#L2247, added:
          LOG.warning(f"===== Unpinning CPUs {pinned_cpus} from {new_cell.pinned_cpus} ======")

      (Pinning host CPUs to resource_tracker's inventory)
      AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/virt/hardware.py#L2253, added:
          LOG.warning(f"===== Pinning CPUs {pinned_cpus} to {new_cell.pinned_cpus} ======")

      (About to drop move claim for destination host during migrate-revert)
      AFTER https://github.com/openstack/nova/blob/6667fcb92bfaf03a8a274dc26806c137aace6b49/nova/compute/resource_tracker.py#L565, added:
          LOG.warning("====== Migration status reverted, dropping move claim ======")

  Last Pass
  ---------
  Due to intermittency, most revert attempts are successful. Not sure if there was a time when this scenario would always pass. Similar issues are https://bugs.launchpad.net/nova/+bug/1879878 and https://bugs.launchpad.net/nova/+bug/1944759. The fix from the former was present in the branch by the time the issue was observed. The fix from the latter was added manually for testing, and the issue persisted with it.

  Timestamp/Logs
  --------------
  For reference, the logs below were captured for the following timeline of events (on a multi-node system with dedicated storage):
  1. (16:47:10) Instance is created using dedicated CPU policy (with $INSTANCE_ID representing its ID)
  2. (16:48:41) Command "nova migrate --poll $INSTANCE_ID" is issued to migrate the instance from node "compute-0" to "compute-1"
  3. (16:48:59) Migration finishes, and instance moves to "compute-1"
  4. (16:49:01) Command "openstack server resize revert $INSTANCE_ID" is issued to move the instance back to "compute-0"
  5. (16:49:06) Instance moves to "ERROR" state with a "fault" message similar to the one in the description
  6. (16:53:56) Instance moves back to "ACTIVE" state on "compute-1"

  In this timespan, two exceptions occur: one between events 2 and 3, at
  16:48:53; and another between events 4 and 5, at 16:49:06. Some of the
  relevant logs are shown below (some less relevant info, such as
  request IDs, and repeated dates were omitted for clarity):

  ==========
  First exception (on nova_compute for compute-1):

  2021-11-25T16:48:44.810551851Z stdout F 1306408 INFO nova.compute.resource_tracker  [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] ====== Migration old topology: InstanceNUMATopology: instance_uuid: None emulator_threads_policy: None InstanceNUMACell (id: 1) cpus: 0 cpu_pinning: {0: 15} reserved: None memory: 1024 pagesize: 1048576 cpu_topology: VirtCPUTopology(cores=1,sockets=1,threads=1) cpu_policy: dedicated cpu_thread_policy: None
  [...]
  2021-11-25T16:48:44.812188907Z stdout F 1306408 INFO nova.compute.resource_tracker  [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] ====== Migration new topology: InstanceNUMATopology: instance_uuid: None emulator_threads_policy: None InstanceNUMACell (id: 0) cpus: 0 cpu_pinning: {0: 8} reserved: None memory: 1024 pagesize: 1048576 cpu_topology: VirtCPUTopology(cores=1,sockets=1,threads=1) cpu_policy: dedicated cpu_thread_policy: None
  [...]
  2021-11-25T16:48:49.791057254Z stdout F 1182154 WARNING nova.compute.manager  ====== Set instance host=dest and old_flavor=flavor ======
  [...]
  2021-11-25T16:48:49.833988502Z stdout F 1306408 WARNING nova.compute.manager  ====== Request reached finish_resize on dest compute ======
  [...]
  2021-11-25T16:48:51.798085129Z stdout F 1306408 WARNING nova.compute.manager  ====== Started periodic task ======
  [...]
  2021-11-25T16:48:53.267609465Z stdout F 1306408 WARNING nova.virt.hardware  ===== Pinning CPUs {15} to CoercedSet() ======
  2021-11-25T16:48:53.268609833Z stdout F 1306408 ERROR nova.compute.manager  Error updating resources for node compute-1.: nova.exception.CPUPinningUnknown: CPU set to pin [15] must be a subset of known CPU set []
  2021-11-25T16:48:53.26862307Z stdout F 1306408 ERROR nova.compute.manager Traceback (most recent call last):
  2021-11-25T16:48:53.268629118Z stdout F 1306408 ERROR nova.compute.managerFile "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/manager.py", line 9957, in _update_available_resource_for_node
  2021-11-25T16:48:53.268633867Z stdout F 1306408 ERROR nova.compute.managerstartup=startup)
  2021-11-25T16:48:53.26863832Z stdout F 1306408 ERROR nova.compute.managerFile "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 943, in update_available_resource
  [...]
  2021-11-25T16:48:53.268690168Z stdout F 1306408 ERROR nova.compute.managernodename, sign=sign)
  2021-11-25T16:48:53.268694117Z stdout F 1306408 ERROR nova.compute.managerFile "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1400, in _update_usage
  2021-11-25T16:48:53.268698192Z stdout F 1306408 ERROR nova.compute.managerhost_numa_topology, instance_numa_topology, free)._to_json()
  2021-11-25T16:48:53.268702169Z stdout F 1306408 ERROR nova.compute.managerFile "/var/lib/openstack/lib/python3.6/site-packages/nova/virt/hardware.py", line 2397, in numa_usage_from_instance_numa
  2021-11-25T16:48:53.268708832Z stdout F 1306408 ERROR nova.compute.managernew_cell.pin_cpus(pinned_cpus)
  2021-11-25T16:48:53.268712874Z stdout F 1306408 ERROR nova.compute.managerFile "/var/lib/openstack/lib/python3.6/site-packages/nova/objects/numa.py", line 87, in pin_cpus
  2021-11-25T16:48:53.268717207Z stdout F 1306408 ERROR nova.compute.manageravailable=list(self.pcpuset))
  2021-11-25T16:48:53.268721519Z stdout F 1306408 ERROR nova.compute.manager nova.exception.CPUPinningUnknown: CPU set to pin [15] must be a subset of known CPU set []
  [...]
  2021-11-25T16:48:53.365606052Z stdout F 1306408 WARNING nova.compute.manager  ====== Set instance old_flavor=flavor, flavor=new_flavor and added _new ======

  ==========
  Second exception (on nova_compute for compute-1):

  2021-11-25T16:49:03.934849386Z stdout F 1306408 INFO nova.compute.manager  [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] Revert resize.
  [...]
  2021-11-25T16:49:06.686180122Z stdout F 1306408 WARNING nova.compute.resource_tracker  ====== Migration status reverted, dropping move claim ======
  [...]
  2021-11-25T16:49:06.71669438Z stdout F 1306408 WARNING nova.virt.hardware  ===== Unpinning CPUs {8} from CoercedSet() ======
  2021-11-25T16:49:06.724405196Z stdout F 1306408 ERROR nova.compute.manager  [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] Setting instance vm_state to ERROR: nova.exception.CPUUnpinningInvalid: CPU set to unpin [8] must be a subset of pinned CPU set []
  2021-11-25T16:49:06.724419611Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] Traceback (most recent call last):
  2021-11-25T16:49:06.724425341Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]   File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/manager.py", line 10198, in _error_out_instance_on_exception
  2021-11-25T16:49:06.724430797Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]     yield
  2021-11-25T16:49:06.724435067Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]   File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/manager.py", line 4985, in revert_resize
  2021-11-25T16:49:06.724441603Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]     self.rt.drop_move_claim_at_dest(context, instance, migration)
  2021-11-25T16:49:06.72444586Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]   File "/var/lib/openstack/lib/python3.6/site-packages/oslo_concurrency/lockutils.py", line 359, in inner
  2021-11-25T16:49:06.724449845Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]     return f(*args, **kwargs)
  2021-11-25T16:49:06.724453921Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]   File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 619, in drop_move_claim_at_dest
  2021-11-25T16:49:06.724457832Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]     prefix='new_')
  2021-11-25T16:49:06.724479578Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]   File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 682, in _drop_move_claim
  2021-11-25T16:49:06.724484245Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]     self._update_usage(usage, nodename, sign=-1)
  2021-11-25T16:49:06.724488243Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]   File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/resource_tracker.py", line 1400, in _update_usage
  2021-11-25T16:49:06.724492348Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]     host_numa_topology, instance_numa_topology, free)._to_json()
  2021-11-25T16:49:06.72449618Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]   File "/var/lib/openstack/lib/python3.6/site-packages/nova/virt/hardware.py", line 2390, in numa_usage_from_instance_numa
  2021-11-25T16:49:06.724505037Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]     new_cell.unpin_cpus(pinned_cpus)
  2021-11-25T16:49:06.724509637Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]   File "/var/lib/openstack/lib/python3.6/site-packages/nova/objects/numa.py", line 104, in unpin_cpus
  2021-11-25T16:49:06.724513661Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]     self.pinned_cpus))
  2021-11-25T16:49:06.724517574Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] nova.exception.CPUUnpinningInvalid: CPU set to unpin [8] must be a subset of pinned CPU set []
  2021-11-25T16:49:06.724521412Z stdout F 1306408 ERROR nova.compute.manager [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86]
  2021-11-25T16:49:06.729548497Z stdout F 1306408 ERROR nova.compute.manager  [instance: 49a1eb03-bf66-4f69-b26f-c340fe626e86] Setting instance vm_state to ERROR: nova.exception.CPUUnpinningInvalid: CPU set to unpin [8] must be a subset of pinned CPU set []

  Workaround
  ----------
  For migration: migrate instance back to source host after failure. Original source host may be retrieved from the migration list. Requires being able to specify it in the migrate request (admin-only, API version >= 2.56). After the revert failure, the instance should reboot on the destination host in ACTIVE state after about 3-4 minutes.

  For resize: recreate the instance using the original flavor. The
  original source host may also be passed as a parameter (admin-only,
  API version >= 2.74).

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1952915/+subscriptions



References