← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2084238] [NEW] Cold-Migration fails when pci_request has nulled request_id

 

Public bug reported:

We run OpenStack 2023.1 deployed via kolla.
After upgrading from Zed -> 2023.1 we are not able to migrate various instances which have pci devices attached to it (Nvidia T4 GPU).

Nova-scheduler throws this Exception during pci filtering:


Exception during message handling: TypeError: startswith first arg must be str or a tuple of str, not NoneType
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 244, in inner
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return func(*args, **kwargs)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 224, in select_destinations
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     selections = self._select_destinations(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 251, in _select_destinations
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     selections = self._schedule(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 388, in _schedule
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     hosts = self._get_sorted_hosts(spec_obj, hosts, num)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 672, in _get_sorted_hosts
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     filtered_hosts = self.host_manager.get_filtered_hosts(host_states,
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/host_manager.py", line 617, in get_filtered_hosts
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return self.filter_handler.get_filtered_objects(self.enabled_filters,
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 89, in get_filtered
_objects
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     list_objs = list(objs)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 44, in filter_all
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if self._filter_one(obj, spec_obj):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 51, in _filter_one
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return self.host_passes(obj, spec)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 60, in host_passes
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     good_candidates = self.filter_candidates(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 81, in filter_candidates
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if filter_func(candidate):
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 62, in <lambda>
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     lambda candidate: host_state.pci_stats.support_requests(
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 775, in support_requests
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     stats.apply_requests(requests, provider_mapping, numa_cells)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 907, in apply_requests
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     rp_uuids = self._get_rp_uuids_for_request(provider_mapping, r)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 871, in _get_rp_uuids_for_request
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return [
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 874, in <listcomp>
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if group_id.startswith(request.request_id)
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server TypeError: startswith first arg must be str or a tuple of str, not NoneType
2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server


The problematic code lies here:
https://opendev.org/openstack/nova/src/commit/1d788f11890385658f9485a281c1beeede94a830/nova/pci/stats.py#L874

There are cases, where request_id has never been populated for various
instances with pci devices:

MariaDB [nova]> select instance_uuid, request_id from pci_devices;                                                            
+--------------------------------------+--------------------------------------+
| instance_uuid                        | request_id                           |
+--------------------------------------+--------------------------------------+
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| deeafa5e-86a4-4e4e-9172-574d0a3629fc | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL                                 |
| a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| NULL                                 | NULL                                 |
| 80967831-104b-4619-9415-f819e458b307 | NULL                                 |
| d9701926-2e83-4cab-9e37-54ffa0309a22 | NULL                                 |
| af160e3d-a4aa-418f-b9ff-eaa20ec1d947 | c277bea1-5c4c-40d1-812f-f8c680689214 |
| 1dbf0831-e4a6-4073-b501-ce9d9d598937 | ed6eab10-b0e6-48b8-be60-dad8c0553c8b |


Checking the following queries, a request_id is either missing or set to null for a given instance:
[nova] select pci_requests from instance_extra where instance_uuid='<INSTANCE_UUID>' \G;
[nova_api] select spec from request_specs where instance_uuid='<INSTANCE_UUID>'  \G;


Freshly spawned instances do not suffer from a missing pci request_id.
Some of the problematic instances are old, spawned during the Train release.
Instances spawned during the Zed release have request_id set and are able to migrate.

We are able to workaround this issue by adding a newly generated
request_id to the corresponding tables.

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: migration

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2084238

Title:
  Cold-Migration fails when pci_request has nulled request_id

Status in OpenStack Compute (nova):
  New

Bug description:
  We run OpenStack 2023.1 deployed via kolla.
  After upgrading from Zed -> 2023.1 we are not able to migrate various instances which have pci devices attached to it (Nvidia T4 GPU).

  Nova-scheduler throws this Exception during pci filtering:

  
  Exception during message handling: TypeError: startswith first arg must be str or a tuple of str, not NoneType
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 244, in inner
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return func(*args, **kwargs)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 224, in select_destinations
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     selections = self._select_destinations(
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 251, in _select_destinations
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     selections = self._schedule(
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 388, in _schedule
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     hosts = self._get_sorted_hosts(spec_obj, hosts, num)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/manager.py", line 672, in _get_sorted_hosts
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     filtered_hosts = self.host_manager.get_filtered_hosts(host_states,
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/host_manager.py", line 617, in get_filtered_hosts
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return self.filter_handler.get_filtered_objects(self.enabled_filters,
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 89, in get_filtered
  _objects
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     list_objs = list(objs)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/filters.py", line 44, in filter_all
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if self._filter_one(obj, spec_obj):
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 51, in _filter_one
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return self.host_passes(obj, spec)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 60, in host_passes
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     good_candidates = self.filter_candidates(
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/__init__.py", line 81, in filter_candidates
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if filter_func(candidate):
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/scheduler/filters/pci_passthrough_filter.py", line 62, in <lambda>
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     lambda candidate: host_state.pci_stats.support_requests(
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 775, in support_requests
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     stats.apply_requests(requests, provider_mapping, numa_cells)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 907, in apply_requests
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     rp_uuids = self._get_rp_uuids_for_request(provider_mapping, r)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 871, in _get_rp_uuids_for_request
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     return [
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server   File "/var/lib/kolla/venv/lib/python3.10/site-packages/nova/pci/stats.py", line 874, in <listcomp>
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server     if group_id.startswith(request.request_id)
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server TypeError: startswith first arg must be str or a tuple of str, not NoneType
  2024-10-10 22:15:20.729 25 ERROR oslo_messaging.rpc.server


  The problematic code lies here:
  https://opendev.org/openstack/nova/src/commit/1d788f11890385658f9485a281c1beeede94a830/nova/pci/stats.py#L874

  There are cases, where request_id has never been populated for various
  instances with pci devices:

  MariaDB [nova]> select instance_uuid, request_id from pci_devices;                                                            
  +--------------------------------------+--------------------------------------+
  | instance_uuid                        | request_id                           |
  +--------------------------------------+--------------------------------------+
  | NULL                                 | NULL                                 |
  | NULL                                 | NULL                                 |
  | deeafa5e-86a4-4e4e-9172-574d0a3629fc | NULL                                 |
  | NULL                                 | NULL                                 |
  | NULL                                 | NULL                                 |
  | NULL                                 | NULL                                 |
  | a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL                                 |
  | a6cb03f8-f990-44e0-8eb3-fa4f79a33e17 | NULL                                 |
  | NULL                                 | NULL                                 |
  | NULL                                 | NULL                                 |
  | NULL                                 | NULL                                 |
  | NULL                                 | NULL                                 |
  | NULL                                 | NULL                                 |
  | NULL                                 | NULL                                 |
  | 80967831-104b-4619-9415-f819e458b307 | NULL                                 |
  | d9701926-2e83-4cab-9e37-54ffa0309a22 | NULL                                 |
  | af160e3d-a4aa-418f-b9ff-eaa20ec1d947 | c277bea1-5c4c-40d1-812f-f8c680689214 |
  | 1dbf0831-e4a6-4073-b501-ce9d9d598937 | ed6eab10-b0e6-48b8-be60-dad8c0553c8b |

  
  Checking the following queries, a request_id is either missing or set to null for a given instance:
  [nova] select pci_requests from instance_extra where instance_uuid='<INSTANCE_UUID>' \G;
  [nova_api] select spec from request_specs where instance_uuid='<INSTANCE_UUID>'  \G;

  
  Freshly spawned instances do not suffer from a missing pci request_id.
  Some of the problematic instances are old, spawned during the Train release.
  Instances spawned during the Zed release have request_id set and are able to migrate.

  We are able to workaround this issue by adding a newly generated
  request_id to the corresponding tables.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2084238/+subscriptions