yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #90108
[Bug 1995153] [NEW] `socket` PCI NUMA policy doesn't work if another instance is booted first on the same host
Public bug reported:
Disclaimer: I haven't reproduced this in a functional test, but based on
the traceback that I gathered from a real environment as well as the
fact that the proposed fix actually fixes this, I think my theory is
correct
Description
===========
`socket` PCI NUMA policy doesn't work if another instance is booted first on the same host
Steps to reproduce
==================
1. Boot any instance.
2. Boot an instance with the `socket` PCI NUMA policy on the same host.
Expected result
===============
`socket` instance boots.
Actual result
=============
Instance creation fails with
Details: Fault: {'code': 500, 'created': '2022-10-28T20:17:31Z',
'message': 'NotImplementedError'}. Server boot request ID:
req-e3fd15d7-fb79-440f-b2f3-e6b2a5505e56.
Environment
===========
Originally reported as part of QE verification of [1], so stable/wallaby.
Additional info
===============
Playing around with the whitebox test for the socket policy [2] on a wallaby deployment, I noticed that the `socket` field in the compute.numa_topology column was being switched to `null` then back to its correct value (0 or 1).
I added logging of the stack trace to the resource tracker _udpate()
method right before it calls compute_node.save(), and found that `null`
was getting saved when an instance was being booted or deleted. Example
of a traceback:
File "/usr/lib/python3.9/site-packages/nova/utils.py", line 686, in context_wrapper\n func(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2126, in _locked_do_build_and_run_instance\n result = self._do_build_and_run_instance(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 63, in wrapped\n return f(self, context, *args, **kw)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 154, in decorated_function\n return function(self, context, *args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/utils.py", line 1434, in decorated_function\n return function(self, context, *args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 200, in decorated_function\n return function(self, context, *args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2232, in _do_build_and_run_instance\n self._build_and_run_instance(context, instance, image,\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2383, in _build_and_run_instance\n with self.rt.instance_claim(context, instance, node, allocs,\n'
' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 197, in instance_claim\n self._update(elevated, cn)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 1247, in _update\n LOG.debug(\'artom: %s\', traceback.format_stack())\n'] _update /usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py:1247
Similarly for delete:
2022-10-28 21:57:27.091 2 DEBUG nova.compute.resource_tracker [req-c9fa718c-983e-416c-bc87-9564b8747294 d6d16a793ab74fe6a0b5594d037d3165 599a6777a45d46a09a7e233a926b7675 - default default] artom: [' File "/usr/lib/python3.9/site-packages
/eventlet/greenpool.py", line 88, in _spawn_n_impl\n func(*args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/futurist/_green.py", line 71, in __call__\n self.work.run()\n', ' File "/usr/lib/python3.9/site-packages/futur
ist/_utils.py", line 49, in run\n result = self.fn(*self.args, **self.kwargs)\n', ' File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming\n res = self.dispatcher.dispatch(message)\n',
' File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py"
, line 229, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 63, in wrapped\n return f(self, context, *args, **kw)\n', ' File "/usr/lib/python3.9/site-
packages/nova/compute/manager.py", line 154, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/utils.py", line 1434, in decorated_function\n return fun
ction(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 200, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packag
es/nova/compute/manager.py", line 3072, in terminate_instance\n do_terminate_instance(instance, bdms)\n', ' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n', '
File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3060, in do_terminate_instance\n self._delete_instance(context, instance, bdms)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3024,
in _delete_instance\n self._complete_deletion(context, instance)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 828, in _complete_deletion\n self._update_resource_tracker(context, instance)\n', ' File
"/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 596, in _update_resource_tracker\n self.rt.update_usage(context, instance, instance.node)\n', ' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", li
ne 360, in inner\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 658, in update_usage\n self._update(context.elevated(), self.compute_nodes[nodename])\n', ' File "/u
sr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 1247, in _update\n LOG.debug(\'artom: %s\', traceback.format_stack())\n'] _update /usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py:1247
On the other hand, the resource tracker's periodic resource update task
was saving the socket correctly:
2022-10-28 21:57:59.794 2 DEBUG nova.compute.resource_tracker
[req-31329b8b-0de4-4b30-b2a1-dcd4d62369b4 - - - - -] artom: [' File
"/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221, in
main\n result = function(*args, **kwargs)\n', ' File
"/usr/lib/python3.9/site-packages/oslo_service/loopingcall.py", line
150, in _run_loop\n result = func(*self.args, **self.kw)\n', ' File
"/usr/lib/python3.9/site-packages/nova/service.py", line 307, in
periodic_tasks\n return self.manager.periodic_tasks(ctxt,
raise_on_error=raise_on_error)\n', ' File "/usr/lib/python3.9/site-
packages/nova/manager.py", line 104, in periodic_tasks\n return
self.run_periodic_tasks(context, raise_on_error=raise_on_error)\n', '
File "/usr/lib/python3.9/site-packages/oslo_service/periodic_task.py",
line 216, in run_periodic_tasks\n task(self, context)\n', ' File
"/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 10026,
in update_available_resource\n
self._update_available_resource_for_node(context, nodename,\n', ' File
"/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 9935,
in _update_available_resource_for_node\n
self.rt.update_available_resource(context, nodename,\n', ' File
"/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py",
line 896, in update_available_resource\n
self._update_available_resource(context, resources, startup=startup)\n',
' File "/usr/lib/python3.9/site-
packages/oslo_concurrency/lockutils.py", line 360, in inner\n return
f(*args, **kwargs)\n', ' File "/usr/lib/python3.9/site-
packages/nova/compute/resource_tracker.py", line 1003, in
_update_available_resource\n self._update(context, cn,
startup=startup)\n', ' File "/usr/lib/python3.9/site-
packages/nova/compute/resource_tracker.py", line 1247, in _update\n
LOG.debug(\'artom: %s\', traceback.format_stack())\n'] _update
/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py:1247
Not included in the above for brevity is the log line showing what was
actually being saved, you'll just have to trust me on this one ;)
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1883554
[2] https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/851447
** Affects: nova
Importance: Undecided
Status: In Progress
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1995153
Title:
`socket` PCI NUMA policy doesn't work if another instance is booted
first on the same host
Status in OpenStack Compute (nova):
In Progress
Bug description:
Disclaimer: I haven't reproduced this in a functional test, but based
on the traceback that I gathered from a real environment as well as
the fact that the proposed fix actually fixes this, I think my theory
is correct
Description
===========
`socket` PCI NUMA policy doesn't work if another instance is booted first on the same host
Steps to reproduce
==================
1. Boot any instance.
2. Boot an instance with the `socket` PCI NUMA policy on the same host.
Expected result
===============
`socket` instance boots.
Actual result
=============
Instance creation fails with
Details: Fault: {'code': 500, 'created': '2022-10-28T20:17:31Z',
'message': 'NotImplementedError'}. Server boot request ID:
req-e3fd15d7-fb79-440f-b2f3-e6b2a5505e56.
Environment
===========
Originally reported as part of QE verification of [1], so stable/wallaby.
Additional info
===============
Playing around with the whitebox test for the socket policy [2] on a wallaby deployment, I noticed that the `socket` field in the compute.numa_topology column was being switched to `null` then back to its correct value (0 or 1).
I added logging of the stack trace to the resource tracker _udpate()
method right before it calls compute_node.save(), and found that
`null` was getting saved when an instance was being booted or deleted.
Example of a traceback:
File "/usr/lib/python3.9/site-packages/nova/utils.py", line 686, in context_wrapper\n func(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2126, in _locked_do_build_and_run_instance\n result = self._do_build_and_run_instance(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 63, in wrapped\n return f(self, context, *args, **kw)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 154, in decorated_function\n return function(self, context, *args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/utils.py", line 1434, in decorated_function\n return function(self, context, *args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 200, in decorated_function\n return function(self, context, *args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2232, in _do_build_and_run_instance\n self._build_and_run_instance(context, instance, image,\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 2383, in _build_and_run_instance\n with self.rt.instance_claim(context, instance, node, allocs,\n'
' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 197, in instance_claim\n self._update(elevated, cn)\n'
' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 1247, in _update\n LOG.debug(\'artom: %s\', traceback.format_stack())\n'] _update /usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py:1247
Similarly for delete:
2022-10-28 21:57:27.091 2 DEBUG nova.compute.resource_tracker [req-c9fa718c-983e-416c-bc87-9564b8747294 d6d16a793ab74fe6a0b5594d037d3165 599a6777a45d46a09a7e233a926b7675 - default default] artom: [' File "/usr/lib/python3.9/site-packages
/eventlet/greenpool.py", line 88, in _spawn_n_impl\n func(*args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/futurist/_green.py", line 71, in __call__\n self.work.run()\n', ' File "/usr/lib/python3.9/site-packages/futur
ist/_utils.py", line 49, in run\n result = self.fn(*self.args, **self.kwargs)\n', ' File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming\n res = self.dispatcher.dispatch(message)\n',
' File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py"
, line 229, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 63, in wrapped\n return f(self, context, *args, **kw)\n', ' File "/usr/lib/python3.9/site-
packages/nova/compute/manager.py", line 154, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/utils.py", line 1434, in decorated_function\n return fun
ction(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 200, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packag
es/nova/compute/manager.py", line 3072, in terminate_instance\n do_terminate_instance(instance, bdms)\n', ' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 360, in inner\n return f(*args, **kwargs)\n', '
File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3060, in do_terminate_instance\n self._delete_instance(context, instance, bdms)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3024,
in _delete_instance\n self._complete_deletion(context, instance)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 828, in _complete_deletion\n self._update_resource_tracker(context, instance)\n', ' File
"/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 596, in _update_resource_tracker\n self.rt.update_usage(context, instance, instance.node)\n', ' File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", li
ne 360, in inner\n return f(*args, **kwargs)\n', ' File "/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 658, in update_usage\n self._update(context.elevated(), self.compute_nodes[nodename])\n', ' File "/u
sr/lib/python3.9/site-packages/nova/compute/resource_tracker.py", line 1247, in _update\n LOG.debug(\'artom: %s\', traceback.format_stack())\n'] _update /usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py:1247
On the other hand, the resource tracker's periodic resource update
task was saving the socket correctly:
2022-10-28 21:57:59.794 2 DEBUG nova.compute.resource_tracker
[req-31329b8b-0de4-4b30-b2a1-dcd4d62369b4 - - - - -] artom: [' File
"/usr/lib/python3.9/site-packages/eventlet/greenthread.py", line 221,
in main\n result = function(*args, **kwargs)\n', ' File
"/usr/lib/python3.9/site-packages/oslo_service/loopingcall.py", line
150, in _run_loop\n result = func(*self.args, **self.kw)\n', '
File "/usr/lib/python3.9/site-packages/nova/service.py", line 307, in
periodic_tasks\n return self.manager.periodic_tasks(ctxt,
raise_on_error=raise_on_error)\n', ' File "/usr/lib/python3.9/site-
packages/nova/manager.py", line 104, in periodic_tasks\n return
self.run_periodic_tasks(context, raise_on_error=raise_on_error)\n', '
File "/usr/lib/python3.9/site-packages/oslo_service/periodic_task.py",
line 216, in run_periodic_tasks\n task(self, context)\n', ' File
"/usr/lib/python3.9/site-packages/nova/compute/manager.py", line
10026, in update_available_resource\n
self._update_available_resource_for_node(context, nodename,\n', '
File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line
9935, in _update_available_resource_for_node\n
self.rt.update_available_resource(context, nodename,\n', ' File
"/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py",
line 896, in update_available_resource\n
self._update_available_resource(context, resources,
startup=startup)\n', ' File "/usr/lib/python3.9/site-
packages/oslo_concurrency/lockutils.py", line 360, in inner\n
return f(*args, **kwargs)\n', ' File "/usr/lib/python3.9/site-
packages/nova/compute/resource_tracker.py", line 1003, in
_update_available_resource\n self._update(context, cn,
startup=startup)\n', ' File "/usr/lib/python3.9/site-
packages/nova/compute/resource_tracker.py", line 1247, in _update\n
LOG.debug(\'artom: %s\', traceback.format_stack())\n'] _update
/usr/lib/python3.9/site-packages/nova/compute/resource_tracker.py:1247
Not included in the above for brevity is the log line showing what was
actually being saved, you'll just have to trust me on this one ;)
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1883554
[2] https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/851447
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1995153/+subscriptions
Follow ups