← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2022967] [NEW] instance_extra corrupts on N-1 cells upgrade

 

Public bug reported:

We upgraded a large cellsv2 deployment from Train (nova 20.6.1) to Ussuri (nova 21.2.5.dev27) where the cell0 control plane is upgraded
and the cell controllers are all on the same nova version.
We only left the nova-compute nodes running at the prior version to do a upgrade cell by cell.

But now we realized we got the nova-conductor reporting errors like

ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - - -] Error updating resources for node sc9-1-hv329: oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting value: line 1 column 1 (char 0)
Jun 05 13:02:36 sc9-1-hv329 nova-compute[40856]: ['Traceback (most recent call last):\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/conductor/manager.py", line 139, in _object_dispatch\n    return getattr(target, method)(*args, **kwargs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper\n    result = fn(cls, context, *args, **kwargs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1333, in get_by_host_and_node\n    expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1238, in _make_instance_list\n    expected_attrs=expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 441, in _from_db_object\n    expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 502, in _extra_attributes_from_db_object\n    db_inst[\'extra\'].get(\'resources\'))\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1025, in _load_resources\n    jsonutils.loads(db_resources))\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_serialization/jsonutils.py", line 249, in loads\n    return json.loads(encodeutils.safe_decode(s, encoding), **kwargs)\n', '  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads\n    return _default_decoder.decode(s)\n', '  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n', '  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode\n    raise JSONDecodeError("Expecting value", s, err.value) from None\n', 'json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n'].

This error now prevents nova-compute from starting instances once they are stopped.
So far we tracked it down to a table nova.instance_extra corruption at the individual cell level when looking pre vs post upgrade.
The corruption seem to happen within the keypairs and following columns of the table indicating a shift in a python class/structure.

Pre Upgrade

MariaDB [(none)]> select * from nova.instance_extra where instance_uuid = 'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
       created_at: 2023-06-02 20:42:11
       updated_at: 2023-06-02 20:43:48
       deleted_at: NULL
          deleted: 0
               id: 260958
    instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
    numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
     pci_requests: []
           flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
       vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
migration_context: NULL
         keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]}}
  device_metadata: NULL
    trusted_certs: NULL
           vpmems: NULL
        resources: NULL

Post Cell0 and cell controller upgrade:

After stop the instance_extra got corrupted (keypairs columns and following) and you can no longer start it unless you fix the table back to the previous state
This is post cell controller upgrade with running nova-compute at train, a restart of the serivce doesn't change the situation

MariaDB [(none)]> select * from nova.instance_extra where instance_uuid = 'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
*************************** 1. row ***************************
       created_at: 2023-06-02 20:42:11
       updated_at: 2023-06-05 17:19:51
       deleted_at: NULL
          deleted: 0
               id: 260958
    instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
    numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
     pci_requests: []
           flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
       vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
migration_context: NULL
         keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]¤ƒ
  device_metadata: NULL
    trusted_certs: NULL
           vpmems: +‚΂bƒ$=
                            W€ûd      €      ™°EJ‹™°Jô¨€   c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
        resources: nceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov

At this point we are accelerating the nova-compute upgrades to see if that fixes it
If that is the case then N-1 is not working with respect to a cellsv2 deployment.
So far we haven't found the issue in the code yet and would appreciate feedback where to look

** Affects: nova
     Importance: Undecided
         Status: New

** Description changed:

  We upgraded a large cellsv2 deployment from Train (nova 20.6.1) to Ussuri (nova 21.2.5.dev27) where the cell0 control plane is upgraded
  and the cell controllers are all on the same nova version.
  We only left the nova-compute nodes running at the prior version to do a upgrade cell by cell.
  
  But now we realized we got the nova-conductor reporting errors like
  
- ```
  ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - - -] Error updating resources for node us01odc-sc9-1-hv329.us01-odc.synopsys.com.: oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting value: line 1 column 1 (char 0)
  Jun 05 13:02:36 sc9-1-hv329 nova-compute[40856]: ['Traceback (most recent call last):\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/conductor/manager.py", line 139, in _object_dispatch\n    return getattr(target, method)(*args, **kwargs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper\n    result = fn(cls, context, *args, **kwargs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1333, in get_by_host_and_node\n    expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1238, in _make_instance_list\n    expected_attrs=expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 441, in _from_db_object\n    expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 502, in _extra_attributes_from_db_object\n    db_inst[\'extra\'].get(\'resources\'))\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1025, in _load_resources\n    jsonutils.loads(db_resources))\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_serialization/jsonutils.py", line 249, in loads\n    return json.loads(encodeutils.safe_decode(s, encoding), **kwargs)\n', '  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads\n    return _default_decoder.decode(s)\n', '  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n', '  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode\n    raise JSONDecodeError("Expecting value", s, err.value) from None\n', 'json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n'].
- ```
+ 
  
  This error now prevents nova-compute from starting instances once they are stopped.
  So far we tracked it down to a table nova.instance_extra corruption at the individual cell level when looking pre vs post upgrade.
- The corruption seem to happen within the keypairs and following colums of the table indicating a shift in a python class/structure.
+ The corruption seem to happen within the keypairs and following columns of the table indicating a shift in a python class/structure.
  
  Pre Upgrade
  
- ```
+ 
  MariaDB [(none)]> select * from nova.instance_extra where instance_uuid = 'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
  *************************** 1. row ***************************
-        created_at: 2023-06-02 20:42:11
-        updated_at: 2023-06-02 20:43:48
-        deleted_at: NULL
-           deleted: 0
-                id: 260958
-     instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
-     numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
-      pci_requests: []
-            flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
-        vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
+        created_at: 2023-06-02 20:42:11
+        updated_at: 2023-06-02 20:43:48
+        deleted_at: NULL
+           deleted: 0
+                id: 260958
+     instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
+     numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
+      pci_requests: []
+            flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
+        vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
  migration_context: NULL
-          keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]}}
-   device_metadata: NULL
-     trusted_certs: NULL
-            vpmems: NULL
-         resources: NULL
- ```
+          keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]}}
+   device_metadata: NULL
+     trusted_certs: NULL
+            vpmems: NULL
+         resources: NULL
+ 
  
  Post Cell0 and cell controller upgrade:
  
- After stop the instance_extra got corrupted (keypairs colums and following) and you can no longer start it unless you fix the table back to the previous state
+ After stop the instance_extra got corrupted (keypairs columns and following) and you can no longer start it unless you fix the table back to the previous state
  This is post cell controller upgrade with running nova-compute at train, a restart of the serivce doesn't change the situation
  
- ```
+ 
  MariaDB [(none)]> select * from nova.instance_extra where instance_uuid = 'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
  *************************** 1. row ***************************
-        created_at: 2023-06-02 20:42:11
-        updated_at: 2023-06-05 17:19:51
-        deleted_at: NULL
-           deleted: 0
-                id: 260958
-     instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
-     numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
-      pci_requests: []
-            flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
-        vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
+        created_at: 2023-06-02 20:42:11
+        updated_at: 2023-06-05 17:19:51
+        deleted_at: NULL
+           deleted: 0
+                id: 260958
+     instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
+     numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
+      pci_requests: []
+            flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
+        vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
  migration_context: NULL
-          keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]¤ƒ
-   device_metadata: NULL
-     trusted_certs: NULL
-            vpmems: +‚΂bƒ$=
-                             W€ûd      €      ™°EJ‹™°Jô¨€   c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
-         resources: nceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov
- ```
+          keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]¤ƒ
+   device_metadata: NULL
+     trusted_certs: NULL
+            vpmems: +‚΂bƒ$=
+                             W€ûd      €      ™°EJ‹™°Jô¨€   c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
+         resources: nceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov
+ 
  
  At this point we are accelerating the nova-compute upgrades to see if that fixes it
  If that is the case then N-1 is not working with respect to a cellsv2 deployment.
  So far we haven't found the issue in the code yet and would appreciate feedback where to look

** Description changed:

  We upgraded a large cellsv2 deployment from Train (nova 20.6.1) to Ussuri (nova 21.2.5.dev27) where the cell0 control plane is upgraded
  and the cell controllers are all on the same nova version.
  We only left the nova-compute nodes running at the prior version to do a upgrade cell by cell.
  
  But now we realized we got the nova-conductor reporting errors like
  
- ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - - -] Error updating resources for node us01odc-sc9-1-hv329.us01-odc.synopsys.com.: oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting value: line 1 column 1 (char 0)
+ ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - - -] Error updating resources for node sc9-1-hv329: oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting value: line 1 column 1 (char 0)
  Jun 05 13:02:36 sc9-1-hv329 nova-compute[40856]: ['Traceback (most recent call last):\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/conductor/manager.py", line 139, in _object_dispatch\n    return getattr(target, method)(*args, **kwargs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper\n    result = fn(cls, context, *args, **kwargs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1333, in get_by_host_and_node\n    expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1238, in _make_instance_list\n    expected_attrs=expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 441, in _from_db_object\n    expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 502, in _extra_attributes_from_db_object\n    db_inst[\'extra\'].get(\'resources\'))\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1025, in _load_resources\n    jsonutils.loads(db_resources))\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_serialization/jsonutils.py", line 249, in loads\n    return json.loads(encodeutils.safe_decode(s, encoding), **kwargs)\n', '  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads\n    return _default_decoder.decode(s)\n', '  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n', '  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode\n    raise JSONDecodeError("Expecting value", s, err.value) from None\n', 'json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n'].
- 
  
  This error now prevents nova-compute from starting instances once they are stopped.
  So far we tracked it down to a table nova.instance_extra corruption at the individual cell level when looking pre vs post upgrade.
  The corruption seem to happen within the keypairs and following columns of the table indicating a shift in a python class/structure.
  
  Pre Upgrade
- 
  
  MariaDB [(none)]> select * from nova.instance_extra where instance_uuid = 'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
  *************************** 1. row ***************************
         created_at: 2023-06-02 20:42:11
         updated_at: 2023-06-02 20:43:48
         deleted_at: NULL
            deleted: 0
                 id: 260958
      instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
      numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
       pci_requests: []
             flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
         vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
  migration_context: NULL
           keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]}}
    device_metadata: NULL
      trusted_certs: NULL
             vpmems: NULL
          resources: NULL
  
- 
  Post Cell0 and cell controller upgrade:
  
  After stop the instance_extra got corrupted (keypairs columns and following) and you can no longer start it unless you fix the table back to the previous state
  This is post cell controller upgrade with running nova-compute at train, a restart of the serivce doesn't change the situation
- 
  
  MariaDB [(none)]> select * from nova.instance_extra where instance_uuid = 'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
  *************************** 1. row ***************************
         created_at: 2023-06-02 20:42:11
         updated_at: 2023-06-05 17:19:51
         deleted_at: NULL
            deleted: 0
                 id: 260958
      instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
      numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
       pci_requests: []
             flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
         vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
  migration_context: NULL
           keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]¤ƒ
    device_metadata: NULL
      trusted_certs: NULL
             vpmems: +‚΂bƒ$=
                              W€ûd      €      ™°EJ‹™°Jô¨€   c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
          resources: nceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov
  
- 
  At this point we are accelerating the nova-compute upgrades to see if that fixes it
  If that is the case then N-1 is not working with respect to a cellsv2 deployment.
  So far we haven't found the issue in the code yet and would appreciate feedback where to look

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2022967

Title:
  instance_extra corrupts on N-1 cells upgrade

Status in OpenStack Compute (nova):
  New

Bug description:
  We upgraded a large cellsv2 deployment from Train (nova 20.6.1) to Ussuri (nova 21.2.5.dev27) where the cell0 control plane is upgraded
  and the cell controllers are all on the same nova version.
  We only left the nova-compute nodes running at the prior version to do a upgrade cell by cell.

  But now we realized we got the nova-conductor reporting errors like

  ERROR nova.compute.manager [req-967855b9-6938-4ca0-b7b9-dcf0f5af9402 - - - - -] Error updating resources for node sc9-1-hv329: oslo_messaging.rpc.client.RemoteError: Remote error: JSONDecodeError Expecting value: line 1 column 1 (char 0)
  Jun 05 13:02:36 sc9-1-hv329 nova-compute[40856]: ['Traceback (most recent call last):\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/conductor/manager.py", line 139, in _object_dispatch\n    return getattr(target, method)(*args, **kwargs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_versionedobjects/base.py", line 184, in wrapper\n    result = fn(cls, context, *args, **kwargs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1333, in get_by_host_and_node\n    expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1238, in _make_instance_list\n    expected_attrs=expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 441, in _from_db_object\n    expected_attrs)\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 502, in _extra_attributes_from_db_object\n    db_inst[\'extra\'].get(\'resources\'))\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/nova/objects/instance.py", line 1025, in _load_resources\n    jsonutils.loads(db_resources))\n', '  File "/openstack/venvs/nova-21.2.13.dev6/lib/python3.6/site-packages/oslo_serialization/jsonutils.py", line 249, in loads\n    return json.loads(encodeutils.safe_decode(s, encoding), **kwargs)\n', '  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads\n    return _default_decoder.decode(s)\n', '  File "/usr/lib/python3.6/json/decoder.py", line 339, in decode\n    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n', '  File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode\n    raise JSONDecodeError("Expecting value", s, err.value) from None\n', 'json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)\n'].

  This error now prevents nova-compute from starting instances once they are stopped.
  So far we tracked it down to a table nova.instance_extra corruption at the individual cell level when looking pre vs post upgrade.
  The corruption seem to happen within the keypairs and following columns of the table indicating a shift in a python class/structure.

  Pre Upgrade

  MariaDB [(none)]> select * from nova.instance_extra where instance_uuid = 'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
  *************************** 1. row ***************************
         created_at: 2023-06-02 20:42:11
         updated_at: 2023-06-02 20:43:48
         deleted_at: NULL
            deleted: 0
                 id: 260958
      instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
      numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
       pci_requests: []
             flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
         vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
  migration_context: NULL
           keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]}}
    device_metadata: NULL
      trusted_certs: NULL
             vpmems: NULL
          resources: NULL

  Post Cell0 and cell controller upgrade:

  After stop the instance_extra got corrupted (keypairs columns and following) and you can no longer start it unless you fix the table back to the previous state
  This is post cell controller upgrade with running nova-compute at train, a restart of the serivce doesn't change the situation

  MariaDB [(none)]> select * from nova.instance_extra where instance_uuid = 'bd3ad637-3291-454b-95e3-d498ce0f81bd'\G
  *************************** 1. row ***************************
         created_at: 2023-06-02 20:42:11
         updated_at: 2023-06-05 17:19:51
         deleted_at: NULL
            deleted: 0
                 id: 260958
      instance_uuid: bd3ad637-3291-454b-95e3-d498ce0f81bd
      numa_topology: {"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_object.name": "InstanceNUMACell", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 0, "cpuset": [0], "memory": 512, "pagesize": null, "cpu_topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "cpu_pinning_raw": {"0": 17}, "cpu_policy": "dedicated", "cpu_thread_policy": "prefer", "cpuset_reserved": null}, "nova_object.changes": ["cpu_topology", "cpuset_reserved", "cpu_pinning_raw", "id", "pagesize"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}
       pci_requests: []
             flavor: {"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 611, "name": "g.tiny.single_core", "memory_mb": 512, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "flavorid": "d3c79ed9-a6d8-4fb9-88a0-c70739c90c36", "swap": 0, "rxtx_factor": 1.0, "vcpu_weight": 0, "disabled": false, "is_public": true, "extra_specs": {"aggregate_instance_extra_specs:selection": "batch.x86_64.2x2", "hw:cpu_policy": "dedicated", "hw:cpu_threads": "1", "hw:cpu_thread_policy": "prefer"}, "description": null, "created_at": "2021-04-27T21:26:16Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}
         vcpu_model: {"nova_object.name": "VirtCPUModel", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"arch": null, "vendor": null, "topology": {"nova_object.name": "VirtCPUTopology", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"sockets": 1, "cores": 1, "threads": 1}, "nova_object.changes": ["threads", "cores", "sockets"]}, "features": [], "mode": "host-passthrough", "model": null, "match": "exact"}, "nova_object.changes": ["features", "vendor", "topology", "model", "mode", "match", "arch"]}
  migration_context: NULL
           keypairs: {"nova_object.name": "KeyPairList", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"objects": [{"nova_object.name": "KeyPair", "nova_object.namespace": "nova", "nova_object.version": "1.4", "nova_object.data": {"id": 10, "name": "rpc_support", "user_id": "5f1bf3f91c2d4ab7b46c13441dc0952f", "fingerprint": "c5:79:2a:70:6a:f9:32:65:16:39:d4:45:9f:d1:86:21", "public_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCelh7W66McTxWeCM+eqRlxtRse8sTHLA+6vzmeNX4b+dyVwvuhVFt4xd0nr42CVx8pz7dfZUgeVUSLoURFvvTNPpt2TTn1gITFHUga0hwiGkVWtpz0y4pVzyNDQZUUMHbuLGU+E++8RHVkxTplhclTD57+fhGZdu8VV1Rh8ZL+UStKqlY1YUDP1NubJ8kMhUbllYXeCa3pC5L+vA0svHVe/Or1hV2Ls7xtYVFdlgrwKmJ8lNi4yJZOW02f/b3YcsFTjAe+ic2RK2HGhDOGxD11ALBFT8SF419mMq+m14eXiOfG6jbavzWCrMBGXTi/gwBqRHslNAqpu7TcsvCyIIP7 root@mcp-ctrl01\n", "type": "ssh", "created_at": "2019-11-13T02:10:53Z", "updated_at": null, "deleted_at": null, "deleted": false}}]¤ƒ
    device_metadata: NULL
      trusted_certs: NULL
             vpmems: +‚΂bƒ$=
                              W€ûd      €      ™°EJ‹™°Jô¨€   c6ed9384-dc62-4c88-b9f4-fb3eee03b025{"nova_object.name": "Insta
          resources: nceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nov

  At this point we are accelerating the nova-compute upgrades to see if that fixes it
  If that is the case then N-1 is not working with respect to a cellsv2 deployment.
  So far we haven't found the issue in the code yet and would appreciate feedback where to look

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2022967/+subscriptions