← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1917055] Re: the pinned_cpus inventory of a compute node gets out of sync if placement returns HTTP 500 for some GET /resource_providers/<uuid> quiery

 

*** This bug is a duplicate of bug 1729621 ***
    https://bugs.launchpad.net/bugs/1729621

Hm, It seems to be very similar to
https://bugs.launchpad.net/nova/+bug/1729621 marking this as duplicate

** Changed in: nova
       Status: New => Invalid

** This bug has been marked a duplicate of bug 1729621
   Inconsistent value for vcpu_used

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1917055

Title:
  the pinned_cpus inventory of a compute node gets out of sync if
  placement returns HTTP 500 for some GET /resource_providers/<uuid>
  quiery

Status in OpenStack Compute (nova):
  Invalid
Status in OpenStack Compute (nova) pike series:
  Triaged

Bug description:
  This issue was detected in Pike deployment with triplicated control
  plane services. There was 3 placement services running behind a load
  balancer service the cluster. One of the placement service got
  misconfigure causing that the request forwarded to it was resulted in
  HTTP 500. While the other placement services worked fine.

  The cluster was actively used to boot VMs with flavors having
  hw:cpu_policy='dedicated' extra spec (a.k.a with cpu pinning).

  In the following unlucky situation the pinned_cpus information in the
  compute_node.numa_topology got empty even though there was VMs running
  on the compute host with pinned CPUs. This allowed the next VM boot to
  select a PCPU that was already used by another VM on the same compute.

  Sequence of events:

  1) VM booted on the compute selected PCPU 18 and the numa_topology pinned_cpus was updated with PCPU id 18
  2) A periodic update_available_resource job was started but failed to finish due to placement responded HTTP 500 when the resource tracker tried to read the compute RP from placement (see traceback at the end of the bugreport).
  3) At this point the pinned_cpus filed of the compute_node.numa_topology was already reset to [] and saved to the DB. As the periodic job was interrupted the pcpu usage was never updated in the compute_node during this periodic turn
  4) A new VM boot request is received and saw that every PCPU on the compute is free and selected PCPU 18 for the new VM.
  5) The next update_available_resource periodic job detected that there are overlapping pinned CPUs but it had no way to heal it, so it just logged an error. (see below).

  Triage on stable/pike:

  The placement failure happened at [1] during the periodic job. But
  before that the pinned_cpus already reset[2] and saved to the DB[3] by
  the periodic job triggering an rt._update() early at[6][7]. In the
  normal case after [1] there is a whole bunch of usage calculation
  based on the runnning instances[4] that populates pinned_cpus and then
  there is another compute node DB save[5] that saves the final
  inventory. So if the process in interrupted at [1] then only the
  available resources are saved to the DB but the usage is not.

  
  [1] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L906
  [2] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L654
  [3] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L901
  [4] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L768
  [5] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L800
  [6]https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L751
  [7] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L586

  Stack trace at 2)

  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager Traceback (most recent call last):
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 7090, in update_available_resource_for_node
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     periodic=True)
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 754, in update_available_resource
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     periodic=periodic)
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     return f(*args, **kwargs)
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 793, in _update_available_resource
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     self._init_compute_node(context, resources)
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 611, in _init_compute_node
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     self._update(context, cn)
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 960, in _update
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     inv_data,
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 68, in set_inventory_for_provider
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     inv_data,
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     return getattr(self.instance, __name)(*args, **kwargs)
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 797, in set_inventory_for_provider
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     self._ensure_resource_provider(rp_uuid, rp_name)
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 522, in _ensure_resource_provider
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     rp = self._get_resource_provider(uuid)
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 56, in wrapper
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     return f(self, *a, **k)
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 423, in _get_resource_provider
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager     raise exception.ResourceProviderRetrievalFailed(uuid=uuid)
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager ResourceProviderRetrievalFailed: Failed to get resource provider with UUID 5a87bae9-81b4-4a00-bb80-bc0b378b4471
  2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager

  Stack trace at 5)
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager Traceback (most recent call last):
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 7090, in update_available_resource_for_node
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager     periodic=True)
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 754, in update_available_resource
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager     periodic=periodic)
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager     return f(*args, **kwargs)
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 816, in _update_available_resource
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager     startup=startup)
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1284, in _update_usage_from_instances
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager     require_allocation_refresh=require_allocation_refresh)
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1170, in _update_usage_from_instance
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager     sign=sign)
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 997, in _update_usage
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager     cn, usage, free)
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/virt/hardware.py", line 1808, in get_host_numa_usage_from_instance
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager     host_numa_topology, instance_numa_topology, free=free))
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/virt/hardware.py", line 1657, in numa_usage_from_instances
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager     newcell.pin_cpus(pinned_cpus)
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/objects/numa.py", line 91, in pin_cpus
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager     self.pinned_cpus))
  2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager CPUPinningInvalid: CPU set to pin [18] must be a subset of free CPU set [2, 3, 4, 5, 6, 7, 10, 11, 15, 16, 17, 19, 42, 43, 44, 45, 46, 47, 50, 51, 55, 56, 57, 58, 59]

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1917055/+subscriptions


References