yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #85270
[Bug 1917055] Re: the pinned_cpus inventory of a compute node gets out of sync if placement returns HTTP 500 for some GET /resource_providers/<uuid> quiery
*** This bug is a duplicate of bug 1729621 ***
https://bugs.launchpad.net/bugs/1729621
Hm, It seems to be very similar to
https://bugs.launchpad.net/nova/+bug/1729621 marking this as duplicate
** Changed in: nova
Status: New => Invalid
** This bug has been marked a duplicate of bug 1729621
Inconsistent value for vcpu_used
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1917055
Title:
the pinned_cpus inventory of a compute node gets out of sync if
placement returns HTTP 500 for some GET /resource_providers/<uuid>
quiery
Status in OpenStack Compute (nova):
Invalid
Status in OpenStack Compute (nova) pike series:
Triaged
Bug description:
This issue was detected in Pike deployment with triplicated control
plane services. There was 3 placement services running behind a load
balancer service the cluster. One of the placement service got
misconfigure causing that the request forwarded to it was resulted in
HTTP 500. While the other placement services worked fine.
The cluster was actively used to boot VMs with flavors having
hw:cpu_policy='dedicated' extra spec (a.k.a with cpu pinning).
In the following unlucky situation the pinned_cpus information in the
compute_node.numa_topology got empty even though there was VMs running
on the compute host with pinned CPUs. This allowed the next VM boot to
select a PCPU that was already used by another VM on the same compute.
Sequence of events:
1) VM booted on the compute selected PCPU 18 and the numa_topology pinned_cpus was updated with PCPU id 18
2) A periodic update_available_resource job was started but failed to finish due to placement responded HTTP 500 when the resource tracker tried to read the compute RP from placement (see traceback at the end of the bugreport).
3) At this point the pinned_cpus filed of the compute_node.numa_topology was already reset to [] and saved to the DB. As the periodic job was interrupted the pcpu usage was never updated in the compute_node during this periodic turn
4) A new VM boot request is received and saw that every PCPU on the compute is free and selected PCPU 18 for the new VM.
5) The next update_available_resource periodic job detected that there are overlapping pinned CPUs but it had no way to heal it, so it just logged an error. (see below).
Triage on stable/pike:
The placement failure happened at [1] during the periodic job. But
before that the pinned_cpus already reset[2] and saved to the DB[3] by
the periodic job triggering an rt._update() early at[6][7]. In the
normal case after [1] there is a whole bunch of usage calculation
based on the runnning instances[4] that populates pinned_cpus and then
there is another compute node DB save[5] that saves the final
inventory. So if the process in interrupted at [1] then only the
available resources are saved to the DB but the usage is not.
[1] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L906
[2] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L654
[3] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L901
[4] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L768
[5] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L800
[6]https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L751
[7] https://github.com/openstack/nova/blob/4887f10a3124ec0c50ecc7df3c264420b5960aca/nova/compute/resource_tracker.py#L586
Stack trace at 2)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager Traceback (most recent call last):
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 7090, in update_available_resource_for_node
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager periodic=True)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 754, in update_available_resource
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager periodic=periodic)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager return f(*args, **kwargs)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 793, in _update_available_resource
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager self._init_compute_node(context, resources)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 611, in _init_compute_node
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager self._update(context, cn)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 960, in _update
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager inv_data,
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 68, in set_inventory_for_provider
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager inv_data,
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/__init__.py", line 37, in __run_method
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager return getattr(self.instance, __name)(*args, **kwargs)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 797, in set_inventory_for_provider
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager self._ensure_resource_provider(rp_uuid, rp_name)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 522, in _ensure_resource_provider
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager rp = self._get_resource_provider(uuid)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 56, in wrapper
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager return f(self, *a, **k)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/scheduler/client/report.py", line 423, in _get_resource_provider
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager raise exception.ResourceProviderRetrievalFailed(uuid=uuid)
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager ResourceProviderRetrievalFailed: Failed to get resource provider with UUID 5a87bae9-81b4-4a00-bb80-bc0b378b4471
2021-02-23 19:07:48.449 3569 ERROR nova.compute.manager
Stack trace at 5)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager Traceback (most recent call last):
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 7090, in update_available_resource_for_node
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager periodic=True)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 754, in update_available_resource
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager periodic=periodic)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 271, in inner
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager return f(*args, **kwargs)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 816, in _update_available_resource
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager startup=startup)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1284, in _update_usage_from_instances
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager require_allocation_refresh=require_allocation_refresh)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1170, in _update_usage_from_instance
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager sign=sign)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 997, in _update_usage
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager cn, usage, free)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/hardware.py", line 1808, in get_host_numa_usage_from_instance
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager host_numa_topology, instance_numa_topology, free=free))
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/hardware.py", line 1657, in numa_usage_from_instances
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager newcell.pin_cpus(pinned_cpus)
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/objects/numa.py", line 91, in pin_cpus
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager self.pinned_cpus))
2021-02-23 19:08:49.515 3569 ERROR nova.compute.manager CPUPinningInvalid: CPU set to pin [18] must be a subset of free CPU set [2, 3, 4, 5, 6, 7, 10, 11, 15, 16, 17, 19, 42, 43, 44, 45, 46, 47, 50, 51, 55, 56, 57, 58, 59]
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1917055/+subscriptions
References