yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #94735
[Bug 2084238] [NEW] Cold-Migration fails when pci_request has nulled request_id
Public bug reported:
We run OpenStack 2023.1 deployed via kolla.
After upgrading from Zed -> 2023.1 we are not able to migrate various instances which have pci devices attached to it (Nvidia T4 GPU).
Nova-scheduler throws this Exception during pci filtering:
Exception during message handling: TypeError: startswith first arg must be str or a tuple of str, not NoneType
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 244, in inner
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 224, in select_destinations
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server selections = self._select_destinations(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 251, in _select_destinations
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server selections = self._schedule(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 388, in _schedule
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server hosts = self._get_sorted_hosts(spec_obj, hosts, num)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 672, in _get_sorted_hosts
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server filtered_hosts = self.host_manager.get_filtered_hosts(host_states,
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/host_manager.py", line 617, in get_filtered_hosts
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self.filter_handler.get_filtered_objects(self.enabled_filters,
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 89, in get_filtered
_objects
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server list_objs = list(objs)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 44, in filter_all
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if self._filter_one(obj, spec_obj):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 51, in _filter_one
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self.host_passes(obj, spec)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 60, in host_passes
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server good_candidates = self.filter_candidates(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 81, in filter_candidates
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if filter_func(candidate):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 62, in <lambda>
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server lambda candidate: host_state.pci_stats.support_requests(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 775, in support_requests
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server stats.apply_requests(requests, provider_mapping, numa_cells)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 907, in apply_requests
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server rp_uuids = self._get_rp_uuids_for_request(provider_mapping, r)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 871, in _get_rp_uuids_for_request
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return [
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 874, in <listcomp>
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if group_id.startswith(request.request_id)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server TypeError: startswith first arg must be str or a tuple of str, not NoneType
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server
The problematic code lies here:
https://opendev.org/openstack/nova/src/commit/1d788f11890385658f9485a281c1beeede94a830/nova/pci/stats.py#L874
There are cases, where request_id has never been populated for various
instances with pci devices:
MariaDB [nova]> select instance_uuid, request_id from pci_devices;
+--------------------------------------+--------------------------------------+
| instance_uuid | request_id |
+--------------------------------------+--------------------------------------+
| NULL | NULL |
| NULL | NULL |
| deeafa5e-86a4-4e4e-9172-574d0a3629fc | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL |
| a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| 80967831-104b-4619-9415-f819e458b307 | NULL |
| d9701926-2e83-4cab-9e37-54ffa0309a22 | NULL |
| af160e3d-a4aa-418f-b9ff-eaa20ec1d947 | c277bea1-5c4c-40d1-812f-f8c680689214 |
| 1dbf0831-e4a6-4073-b501-ce9d9d598937 | ed6eab10-b0e6-48b8-be60-dad8c0553c8b |
Checking the following queries, a request_id is either missing or set to null for a given instance:
[nova] select pci_requests from instance_extra where instance_uuid='<INSTANCE_UUID>' \G;
[nova_api] select spec from request_specs where instance_uuid='<INSTANCE_UUID>' \G;
Freshly spawned instances do not suffer from a missing pci request_id.
Some of the problematic instances are old, spawned during the Train release.
Instances spawned during the Zed release have request_id set and are able to migrate.
We are able to workaround this issue by adding a newly generated
request_id to the corresponding tables.
** Affects: nova
Importance: Undecided
Status: New
** Tags: migration
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2084238
Title:
Cold-Migration fails when pci_request has nulled request_id
Status in OpenStack Compute (nova):
New
Bug description:
We run OpenStack 2023.1 deployed via kolla.
After upgrading from Zed -> 2023.1 we are not able to migrate various instances which have pci devices attached to it (Nvidia T4 GPU).
Nova-scheduler throws this Exception during pci filtering:
Exception during message handling: TypeError: startswith first arg must be str or a tuple of str, not NoneType
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 244, in inner
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 224, in select_destinations
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server selections = self._select_destinations(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 251, in _select_destinations
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server selections = self._schedule(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 388, in _schedule
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server hosts = self._get_sorted_hosts(spec_obj, hosts, num)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 672, in _get_sorted_hosts
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server filtered_hosts = self.host_manager.get_filtered_hosts(host_states,
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/host_manager.py", line 617, in get_filtered_hosts
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self.filter_handler.get_filtered_objects(self.enabled_filters,
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 89, in get_filtered
_objects
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server list_objs = list(objs)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 44, in filter_all
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if self._filter_one(obj, spec_obj):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 51, in _filter_one
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return self.host_passes(obj, spec)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 60, in host_passes
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server good_candidates = self.filter_candidates(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 81, in filter_candidates
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if filter_func(candidate):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 62, in <lambda>
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server lambda candidate: host_state.pci_stats.support_requests(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 775, in support_requests
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server stats.apply_requests(requests, provider_mapping, numa_cells)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 907, in apply_requests
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server rp_uuids = self._get_rp_uuids_for_request(provider_mapping, r)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 871, in _get_rp_uuids_for_request
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server return [
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 874, in <listcomp>
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server if group_id.startswith(request.request_id)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server TypeError: startswith first arg must be str or a tuple of str, not NoneType
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server
The problematic code lies here:
https://opendev.org/openstack/nova/src/commit/1d788f11890385658f9485a281c1beeede94a830/nova/pci/stats.py#L874
There are cases, where request_id has never been populated for various
instances with pci devices:
MariaDB [nova]> select instance_uuid, request_id from pci_devices;
+--------------------------------------+--------------------------------------+
| instance_uuid | request_id |
+--------------------------------------+--------------------------------------+
| NULL | NULL |
| NULL | NULL |
| deeafa5e-86a4-4e4e-9172-574d0a3629fc | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL |
| a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| 80967831-104b-4619-9415-f819e458b307 | NULL |
| d9701926-2e83-4cab-9e37-54ffa0309a22 | NULL |
| af160e3d-a4aa-418f-b9ff-eaa20ec1d947 | c277bea1-5c4c-40d1-812f-f8c680689214 |
| 1dbf0831-e4a6-4073-b501-ce9d9d598937 | ed6eab10-b0e6-48b8-be60-dad8c0553c8b |
Checking the following queries, a request_id is either missing or set to null for a given instance:
[nova] select pci_requests from instance_extra where instance_uuid='<INSTANCE_UUID>' \G;
[nova_api] select spec from request_specs where instance_uuid='<INSTANCE_UUID>' \G;
Freshly spawned instances do not suffer from a missing pci request_id.
Some of the problematic instances are old, spawned during the Train release.
Instances spawned during the Zed release have request_id set and are able to migrate.
We are able to workaround this issue by adding a newly generated
request_id to the corresponding tables.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2084238/+subscriptions