← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2049030] [NEW] request_spec out of sync with instance details

 

Public bug reported:

Description
===========

We have an OpenStack Zed running based on the Ubuntu cloud archive
packages. We regularly live-migrate instances between hypervisors for
host maintenance, but suddenly had issues with many instances. Nova
correctly refused to migrate an instance from an AMD CPU to an Intel
CPU. We have different flavors for different CPU kinds and scheduler
filter that limit these flavors to specific hosts, but here the
scheduler failed.

I picked one failing instance and started debugging. The flavor was
`a2.medium` (ID: 982), a flavor running on AMD CPUs, but our scheduler
filter was called with a request spec containing an `e2.medium` (ID:
831) flavor (Intel CPU). It therefore filtered for the wrong hosts, and
Nova aborted the live migration because of an invalid target.

The request spec the filter received looked like this:

    RequestSpec(
        availability_zone=None,
        flavor=Flavor(831),
        force_hosts=None,
        force_nodes=None,
        id=485269,
        ignore_hosts=[...],
        image=ImageMeta(35c8d5e3-2791-4565-9c19-291869fde98d),
        instance_group=None,
        instance_uuid=982c3ead-59b1-4acd-876b-d55166d8e7f0,
        is_bfv=False,
        limits=SchedulerLimits,
        network_metadata=NetworkMetadata,
        num_instances=1,
        numa_topology=None,
        pci_requests=InstancePCIRequests,
        project_id='16e980bb63b4415694dd2130f5977b8b',
        request_level_params=RequestLevelParams,
        requested_destination=Destination,
        requested_networks=NetworkRequestList,
        requested_resources=[],
        retry=None,
        scheduler_hints={},
        security_groups=SecurityGroupList,
        user_id='d7451e969b3f4229bd1869ed9ad591f3'
    )

Despite that, the instance itself is listed with the correct flavor:

    $ openstack server show 982c3ead-59b1-4acd-876b-d55166d8e7f0 | grep flavor
    | flavor | disk='80', ephemeral='0', original_name='a2.medium', ram='8192', swap='0', vcpus='4'

Here is an except of our flavors:

    MariaDB [novaapi]> select id,flavorid,name,vcpus,memory_mb,root_gb from flavors where name LIKE 'a2.%' OR name LIKE 'e2.%';
    +------+----------+------------+-------+-----------+---------+
    | id   | flavorid | name       | vcpus | memory_mb | root_gb |
    +------+----------+------------+-------+-----------+---------+
    |  807 | 022030   | e2.micro   |     2 |      2048 |      25 |
    |  813 | 022010   | e2.nano    |     1 |       512 |      10 |
    |  819 | 022060   | e2.large   |     4 |     12288 |      40 |
    |  822 | 022020   | e2.tiny    |     2 |      1024 |      20 |
    |  825 | 022070   | e2.xlarge  |     4 |     16384 |      60 |
    |  828 | 022080   | e2.2xlarge |     8 |     32768 |      80 |
    |  831 | 022050   | e2.medium  |     4 |      8192 |      25 |
    |  834 | 022040   | e2.small   |     2 |      4096 |      25 |
    |  980 | 022090   | e2.4xlarge |    16 |     65536 |     160 |
    |  982 | 026050   | a2.medium  |     4 |      8192 |      80 |
    | 1003 | 026040   | a2.small   |     2 |      4096 |      40 |
    | 1005 | 026060   | a2.large   |     6 |     12288 |     120 |
    | 1008 | 026070   | a2.xlarge  |     8 |     16384 |     160 |
    | 1011 | 026090   | a2.4xlarge |    32 |     65536 |     640 |
    | 1014 | 026080   | a2.2xlarge |    16 |     32768 |     320 |
    +------+----------+------------+-------+-----------+---------+

The instance also has the correct flavor listed in Horizon, CLI and in
the MySQL database (instance_type_id). It is running (according to the
database and actual) on the correct compute host (compute-a2b1).

    MariaDB [nova]> select * from instances where uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
    *************************** 1. row ***************************
        ...
             launched_on: compute-t2a3
        instance_type_id: 982  <==========================================
                    uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
        ...
                    node: compute-a2b1.cld.domain.tld

Yet, nova-scheduler runs the filter with the wrong request_spec object
shown above. I've followed the `request_spec` via source code and print
statements through nova-scheduler, nova-conductor, to nova-api, where it
was loaded in
https://github.com/openstack/nova/blob/stable/zed/nova/compute/api.py#L5496.
Here, the request spec is loaded already with the wrong flavor ID (831
instead of 982). A look at the database confirmed that:

    MariaDB [novaapi]> select * from request_specs where instance_uuid = '7f83337d-88a9-4f49-a4b0-cc0495ea698a' \G
    *************************** 1. row ***************************
        created_at: 2024-01-10 11:19:30
        updated_at: NULL
                id: 499312
     instance_uuid: 7f83337d-88a9-4f49-a4b0-cc0495ea698a
              spec: {
                      "nova_object.name": "RequestSpec",
                      "nova_object.namespace": "nova",
                      "nova_object.version": "1.14",
                      "nova_object.data": {
                        "image": {
                          "nova_object.name": "ImageMeta",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.8",
                          "nova_object.data": {
                            "id": "4cd37a9e-7bd6-443d-83f5-1b96f7ff005d",
                            "name": "ubuntu-22.04",
                            "status": "active",
                            "checksum": "f4a9b90d378d90fdbf66b2ad3afe4da7",
                            "owner": "b18c2da2dbfa45138fb6077eafb2aa51",
                            "size": 2361393152,
                            "container_format": "bare",
                            "disk_format": "raw",
                            "created_at": "2023-12-12T02:06:42Z",
                            "updated_at": "2023-12-12T02:11:15Z",
                            "min_ram": 128,
                            "min_disk": 5,
                            "properties": {
                              "nova_object.name": "ImageMetaProps",
                              "nova_object.namespace": "nova",
                              "nova_object.version": "1.31",
                              "nova_object.data": {
                                "hw_architecture": "x86_64",
                                "hw_disk_bus": "scsi",
                                "hw_firmware_type": "uefi",
                                "hw_qemu_guest_agent": true,
                                "hw_scsi_model": "virtio-scsi",
                                "hw_vm_mode": "hvm",
                                "img_hv_type": "kvm",
                                "os_admin_user": "ubuntu",
                                "os_distro": "ubuntu",
                                "os_require_quiesce": true,
                                "os_type": "linux"
                              },
                              "nova_object.changes": [
                                "hw_disk_bus",
                                "hw_architecture",
                                "hw_vm_mode",
                                "os_type",
                                "hw_qemu_guest_agent",
                                "os_admin_user",
                                "os_distro",
                                "hw_firmware_type",
                                "hw_scsi_model",
                                "os_require_quiesce",
                                "img_hv_type"
                              ]
                            }
                          },
                          "nova_object.changes": [
                            "updated_at",
                            "min_ram",
                            "size",
                            "id",
                            "properties",
                            "status",
                            "disk_format",
                            "created_at",
                            "name",
                            "owner",
                            "checksum",
                            "min_disk",
                            "container_format"
                          ]
                        },
                        "numa_topology": null,
                        "pci_requests": {
                          "nova_object.name": "InstancePCIRequests",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.1",
                          "nova_object.data": { "requests": [] },
                          "nova_object.changes": ["requests"]
                        },
                        "project_id": "61454faef1234faa86673d8b7760938a",
                        "user_id": "13ef2628df2b4eba875934d148d2cd26",
                        "availability_zone": null,
                        "flavor": {
                          "nova_object.name": "Flavor",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.2",
                          "nova_object.data": {
                            "id": 982,
                            "name": "a2.medium",
                            "memory_mb": 8192,
                            "vcpus": 4,
                            "root_gb": 80,
                            "ephemeral_gb": 0,
                            "flavorid": "026050",
                            "swap": 0,
                            "rxtx_factor": 1.0,
                            "vcpu_weight": 0,
                            "disabled": false,
                            "is_public": true,
                            "extra_specs": {
                              "hw:cpu_max_sockets": "1",
                              "hw:cpu_policy": "shared",
                              "os:secure_boot": "disabled",
                              "quota:cpu_shares": "400"
                            },
                            "description": null,
                            "created_at": "2023-03-29T15:06:03Z",
                            "updated_at": null,
                            "deleted_at": null,
                            "deleted": false
                          },
                          "nova_object.changes": ["extra_specs"]
                        },
                        "num_instances": 1,
                        "ignore_hosts": null,
                        "force_hosts": null,
                        "force_nodes": null,
                        "requested_destination": null,
                        "retry": null,
                        "limits": {
                          "nova_object.name": "SchedulerLimits",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.0",
                          "nova_object.data": {
                            "numa_topology": null,
                            "vcpu": null,
                            "disk_gb": null,
                            "memory_mb": null
                          },
                          "nova_object.changes": ["vcpu", "numa_topology", "disk_gb", "memory_mb"]
                        },
                        "instance_group": null,
                        "scheduler_hints": {},
                        "instance_uuid": "7f83337d-88a9-4f49-a4b0-cc0495ea698a",
                        "security_groups": {
                          "nova_object.name": "SecurityGroupList",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.1",
                          "nova_object.data": {
                            "objects": [
                              {
                                "nova_object.name": "SecurityGroup",
                                "nova_object.namespace": "nova",
                                "nova_object.version": "1.2",
                                "nova_object.data": {
                                  "uuid": "5d94b29d-fdb1-4633-92df-8fa47e79864b"
                                },
                                "nova_object.changes": ["uuid"]
                              }
                            ]
                          },
                          "nova_object.changes": ["objects"]
                        },
                        "is_bfv": false,
                        "requested_resources": []
                      },
                      "nova_object.changes": [
                        "image",
                        "is_bfv",
                        "requested_destination",
                        "security_groups",
                        "force_nodes",
                        "num_instances",
                        "retry",
                        "numa_topology",
                        "instance_group",
                        "limits",
                        "instance_uuid",
                        "availability_zone",
                        "user_id",
                        "requested_resources",
                        "force_hosts",
                        "ignore_hosts",
                        "project_id",
                        "pci_requests",
                        "scheduler_hints",
                        "flavor"
                      ]
                    }
    1 row in set (0.000 sec)

I further identified the `instance_extra` data to be out of sync, with a
"new" flavor present:

    MariaDB [nova]> select * from instance_extra where instance_uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
    *************************** 1. row ***************************
        created_at: 2023-11-06 14:44:23
        updated_at: 2024-01-10 11:26:50
        deleted_at: NULL
           deleted: 0
                id: 442795
     instance_uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
     numa_topology: NULL
      pci_requests: []
            flavor: {
                      "cur": {
                        "nova_object.name": "Flavor",
                        "nova_object.namespace": "nova",
                        "nova_object.version": "1.2",
                        "nova_object.data": {
                          "id": 982,
                          "name": "a2.medium",
                          "memory_mb": 8192,
                          "vcpus": 4,
                          "root_gb": 80,
                          "ephemeral_gb": 0,
                          "flavorid": "026050",
                          "swap": 0,
                          "rxtx_factor": 1.0,
                          "vcpu_weight": 0,
                          "disabled": false,
                          "is_public": true,
                          "extra_specs": {
                            "hw:cpu_max_sockets": "1",
                            "hw:cpu_policy": "shared",
                            "os:secure_boot": "disabled",
                            "quota:cpu_shares": "400"
                          },
                          "description": null,
                          "created_at": "2023-03-29T15:06:03Z",
                          "updated_at": null,
                          "deleted_at": null,
                          "deleted": false
                        },
                        "nova_object.changes": ["extra_specs"]
                      },
                      "old": null,
                      "new": {
                        "nova_object.name": "Flavor",
                        "nova_object.namespace": "nova",
                        "nova_object.version": "1.2",
                        "nova_object.data": {
                          "id": 831,
                          "name": "e2.medium",
                          "memory_mb": 8192,
                          "vcpus": 4,
                          "root_gb": 25,
                          "ephemeral_gb": 0,
                          "flavorid": "022050",
                          "swap": 0,
                          "rxtx_factor": 1.0,
                          "vcpu_weight": 0,
                          "disabled": false,
                          "is_public": true,
                          "extra_specs": {
                            "hw:cpu_max_sockets": "1",
                            "hw:cpu_policy": "shared",
                            "quota:cpu_shares": "400"
                          },
                          "description": null,
                          "created_at": "2022-08-03T11:11:49Z",
                          "updated_at": null,
                          "deleted_at": null,
                          "deleted": false
                        },
                        "nova_object.changes": ["extra_specs"]
                      }
                    }
    device_metadata: NULL
      trusted_certs: NULL
             vpmems: NULL
          resources: NULL
    1 row in set (0.000 sec)

It appears that a user tried to resize the instance before, which failed
(no idea why yet), and `instance_extra` as well as the `request_spec`
data wasn't reverted correctly:

    $ openstack server migration list --server 982c3ead-59b1-4acd-876b-d55166d8e7f0
    +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
    |     Id | UUID                      | Source Node               | Dest Node                 | Source Compute | Dest Compute | Dest Host    | Status    | Server UUID               | Old Flavor | New Flavor | Type           | Created At                | Updated At                 |
    +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
    | 138991 | 7d0f464b-2fea-49b0-87e1-  | None                      | None                      | compute-a2b1   | None         | None         | error     | 982c3ead-59b1-4acd-876b-  |        982 |        982 | live-migration | 2024-01-                  | 2024-01-09T09:56:50.000000 |
    |        | 596624ca03cc              |                           |                           |                |              |              |           | d55166d8e7f0              |            |            |                | 09T09:56:45.000000        |                            |
    | 138931 | e1eeef36-f4e9-4f2a-adc6-  | compute-                  | compute-                  | compute-a2b1   | compute-t2b2 | XXXXXXXXXXXX | error     | 982c3ead-59b1-4acd-876b-  |        982 |        831 | resize         | 2024-01-                  | 2024-01-08T14:11:14.000000 |
    |        | 985413cea4ed              | a2b1.cld.domain.tld       | t2b2.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 08T14:11:13.000000        |                            |
    | 138062 | c12a2ae0-969f-454f-b6d0-  | compute-                  | compute-                  | compute-t2a3   | compute-a2b1 | XXXXXXXXXXXX | confirmed | 982c3ead-59b1-4acd-876b-  |        831 |        982 | resize         | 2023-12-                  | 2023-12-04T21:18:04.000000 |
    |        | fa79381ba29f              | t2a3.cld.domain.tld       | a2b1.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 04T21:17:44.000000        |                            |
    | 137966 | 12e3cffe-caeb-44ce-       | compute-                  | compute-                  | compute-t2c1   | compute-t2a3 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-12-                  | 2023-12-02T16:35:19.000000 |
    |        | ac5a-baa0aa17d6e1         | t2c1.cld.domain.tld       | t2a3.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 02T16:14:13.000000        |                            |
    | 137732 | 2ce96b71-143e-46cf-a7b2-  | compute-                  | compute-                  | compute-t2a3   | compute-t2c1 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-12-                  | 2023-12-01T09:24:13.000000 |
    |        | 822b2308f60a              | t2a3.cld.domain.tld       | t2c1.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 01T09:23:32.000000        |                            |
    | 137286 | 1c6c0e7d-cf33-4522-9710-  | compute-                  | compute-                  | compute-t2c3   | compute-t2a3 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-11-                  | 2023-11-30T00:34:06.000000 |
    |        | e372392f3dad              | t2c3.cld.domain.tld       | t2a3.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 29T23:42:59.000000        |                            |
    | 137013 | 4afff1ec-08fd-4995-8642-  | compute-                  | compute-                  | compute-t2a3   | compute-t2c3 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-11-                  | 2023-11-28T09:55:35.000000 |
    |        | b8341a169efb              | t2a3.cld.domain.tld       | t2c3.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 28T09:54:53.000000        |                            |
    | 135478 | cf28ff61-87d7-49d8-97b5-  | compute-                  | compute-                  | compute-t2c3   | compute-t2a3 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-11-                  | 2023-11-09T06:52:46.000000 |
    |        | 10382163d158              | t2c3.cld.domain.tld       | t2a3.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 09T06:48:44.000000        |                            |
    | 135244 | 30f87d2e-ffc5-43ea-       | compute-                  | compute-                  | compute-t2a3   | compute-t2c3 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-11-                  | 2023-11-08T20:49:47.000000 |
    |        | ae16-ade2c9a553b3         | t2a3.cld.domain.tld       | t2c3.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 08T20:49:01.000000        |                            |
    +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+

Yet, even the live-migration tried later lists the correct flavor ID.

My problem isn't much about the bug that data is inconsistent,
especially on failures. We know that this often happens with each
OpenStack version, and had to fix the database many times before.

Our problem here is the complexity of fixing the inconsistencies because
most are serialized Python objects.

Are there any tools or commands, automatic or manual, to check and fix
these request spec data inconsistencies? Maybe similar to the
heal_placements command?

At the moment, I cannot even tell how many instances are affected. We
use scheduler filters to isolated users and projects between hosts too,
even if technically compatible. Therefore, inconsistent data like that
would not fail a live migration, but our security and isolation
boundaries. I need to manually check each instance.

Steps to reproduce
==================

I cannot provide commands that produce this data inconsistency yet, but
when manually introduced to a system, live migrations can fail because
the scheduler makes wrong decisions.

Expected result
===============

All places where nova stores the flavor details should be kept in sync,
or at least be fixable/resyncable on failures.

Actual result
=============

Scheduler are run with wrong data, violating scheduling constraints,
such as compatibility, security and isolation boundaries. Is the case of
compatibility, other operations, such as live migrations, will fail. In
other cases, no apparent error might happen.

Environment
===========

1. Exact version of OpenStack you are running. See the following
   list for all releases: http://docs.openstack.org/releases/

    ii  nova-common                      3:25.2.1-0ubuntu1                                    all          OpenStack Compute - common files
    ii  nova-conductor                   3:25.2.1-0ubuntu1                                    all          OpenStack Compute - conductor service
    ii  nova-scheduler                   3:25.2.1-0ubuntu1                                    all          OpenStack Compute - virtual machine scheduler
    ii  nova-spiceproxy                  3:25.2.1-0ubuntu1                                    all          OpenStack Compute - spice html5 proxy
    ii  python3-nova                     3:25.2.1-0ubuntu1                                    all          OpenStack Compute Python 3 libraries

2. Which hypervisor did you use?
   Libvirt + KVM

2. Which storage type did you use?
   Ceph 17.2.7-1focal, local qcow2 disks

3. Which networking type did you use?
   Neutron ML2/LXB

Logs & Configs
==============

The tool *sosreport* has support for some OpenStack projects.
It's worth having a look at it. For example, if you want to collect
the logs of a compute node you would execute:

   $ sudo sosreport -o openstack_nova --batch

on that compute node. Attach the logs to this bug report. Please
consider that these logs need to be collected in "DEBUG" mode.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2049030

Title:
  request_spec out of sync with instance details

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  We have an OpenStack Zed running based on the Ubuntu cloud archive
  packages. We regularly live-migrate instances between hypervisors for
  host maintenance, but suddenly had issues with many instances. Nova
  correctly refused to migrate an instance from an AMD CPU to an Intel
  CPU. We have different flavors for different CPU kinds and scheduler
  filter that limit these flavors to specific hosts, but here the
  scheduler failed.

  I picked one failing instance and started debugging. The flavor was
  `a2.medium` (ID: 982), a flavor running on AMD CPUs, but our scheduler
  filter was called with a request spec containing an `e2.medium` (ID:
  831) flavor (Intel CPU). It therefore filtered for the wrong hosts,
  and Nova aborted the live migration because of an invalid target.

  The request spec the filter received looked like this:

      RequestSpec(
          availability_zone=None,
          flavor=Flavor(831),
          force_hosts=None,
          force_nodes=None,
          id=485269,
          ignore_hosts=[...],
          image=ImageMeta(35c8d5e3-2791-4565-9c19-291869fde98d),
          instance_group=None,
          instance_uuid=982c3ead-59b1-4acd-876b-d55166d8e7f0,
          is_bfv=False,
          limits=SchedulerLimits,
          network_metadata=NetworkMetadata,
          num_instances=1,
          numa_topology=None,
          pci_requests=InstancePCIRequests,
          project_id='16e980bb63b4415694dd2130f5977b8b',
          request_level_params=RequestLevelParams,
          requested_destination=Destination,
          requested_networks=NetworkRequestList,
          requested_resources=[],
          retry=None,
          scheduler_hints={},
          security_groups=SecurityGroupList,
          user_id='d7451e969b3f4229bd1869ed9ad591f3'
      )

  Despite that, the instance itself is listed with the correct flavor:

      $ openstack server show 982c3ead-59b1-4acd-876b-d55166d8e7f0 | grep flavor
      | flavor | disk='80', ephemeral='0', original_name='a2.medium', ram='8192', swap='0', vcpus='4'

  Here is an except of our flavors:

      MariaDB [novaapi]> select id,flavorid,name,vcpus,memory_mb,root_gb from flavors where name LIKE 'a2.%' OR name LIKE 'e2.%';
      +------+----------+------------+-------+-----------+---------+
      | id   | flavorid | name       | vcpus | memory_mb | root_gb |
      +------+----------+------------+-------+-----------+---------+
      |  807 | 022030   | e2.micro   |     2 |      2048 |      25 |
      |  813 | 022010   | e2.nano    |     1 |       512 |      10 |
      |  819 | 022060   | e2.large   |     4 |     12288 |      40 |
      |  822 | 022020   | e2.tiny    |     2 |      1024 |      20 |
      |  825 | 022070   | e2.xlarge  |     4 |     16384 |      60 |
      |  828 | 022080   | e2.2xlarge |     8 |     32768 |      80 |
      |  831 | 022050   | e2.medium  |     4 |      8192 |      25 |
      |  834 | 022040   | e2.small   |     2 |      4096 |      25 |
      |  980 | 022090   | e2.4xlarge |    16 |     65536 |     160 |
      |  982 | 026050   | a2.medium  |     4 |      8192 |      80 |
      | 1003 | 026040   | a2.small   |     2 |      4096 |      40 |
      | 1005 | 026060   | a2.large   |     6 |     12288 |     120 |
      | 1008 | 026070   | a2.xlarge  |     8 |     16384 |     160 |
      | 1011 | 026090   | a2.4xlarge |    32 |     65536 |     640 |
      | 1014 | 026080   | a2.2xlarge |    16 |     32768 |     320 |
      +------+----------+------------+-------+-----------+---------+

  The instance also has the correct flavor listed in Horizon, CLI and in
  the MySQL database (instance_type_id). It is running (according to the
  database and actual) on the correct compute host (compute-a2b1).

      MariaDB [nova]> select * from instances where uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
      *************************** 1. row ***************************
          ...
               launched_on: compute-t2a3
          instance_type_id: 982  <==========================================
                      uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
          ...
                      node: compute-a2b1.cld.domain.tld

  Yet, nova-scheduler runs the filter with the wrong request_spec object
  shown above. I've followed the `request_spec` via source code and
  print statements through nova-scheduler, nova-conductor, to nova-api,
  where it was loaded in
  https://github.com/openstack/nova/blob/stable/zed/nova/compute/api.py#L5496.
  Here, the request spec is loaded already with the wrong flavor ID (831
  instead of 982). A look at the database confirmed that:

      MariaDB [novaapi]> select * from request_specs where instance_uuid = '7f83337d-88a9-4f49-a4b0-cc0495ea698a' \G
      *************************** 1. row ***************************
          created_at: 2024-01-10 11:19:30
          updated_at: NULL
                  id: 499312
       instance_uuid: 7f83337d-88a9-4f49-a4b0-cc0495ea698a
                spec: {
                        "nova_object.name": "RequestSpec",
                        "nova_object.namespace": "nova",
                        "nova_object.version": "1.14",
                        "nova_object.data": {
                          "image": {
                            "nova_object.name": "ImageMeta",
                            "nova_object.namespace": "nova",
                            "nova_object.version": "1.8",
                            "nova_object.data": {
                              "id": "4cd37a9e-7bd6-443d-83f5-1b96f7ff005d",
                              "name": "ubuntu-22.04",
                              "status": "active",
                              "checksum": "f4a9b90d378d90fdbf66b2ad3afe4da7",
                              "owner": "b18c2da2dbfa45138fb6077eafb2aa51",
                              "size": 2361393152,
                              "container_format": "bare",
                              "disk_format": "raw",
                              "created_at": "2023-12-12T02:06:42Z",
                              "updated_at": "2023-12-12T02:11:15Z",
                              "min_ram": 128,
                              "min_disk": 5,
                              "properties": {
                                "nova_object.name": "ImageMetaProps",
                                "nova_object.namespace": "nova",
                                "nova_object.version": "1.31",
                                "nova_object.data": {
                                  "hw_architecture": "x86_64",
                                  "hw_disk_bus": "scsi",
                                  "hw_firmware_type": "uefi",
                                  "hw_qemu_guest_agent": true,
                                  "hw_scsi_model": "virtio-scsi",
                                  "hw_vm_mode": "hvm",
                                  "img_hv_type": "kvm",
                                  "os_admin_user": "ubuntu",
                                  "os_distro": "ubuntu",
                                  "os_require_quiesce": true,
                                  "os_type": "linux"
                                },
                                "nova_object.changes": [
                                  "hw_disk_bus",
                                  "hw_architecture",
                                  "hw_vm_mode",
                                  "os_type",
                                  "hw_qemu_guest_agent",
                                  "os_admin_user",
                                  "os_distro",
                                  "hw_firmware_type",
                                  "hw_scsi_model",
                                  "os_require_quiesce",
                                  "img_hv_type"
                                ]
                              }
                            },
                            "nova_object.changes": [
                              "updated_at",
                              "min_ram",
                              "size",
                              "id",
                              "properties",
                              "status",
                              "disk_format",
                              "created_at",
                              "name",
                              "owner",
                              "checksum",
                              "min_disk",
                              "container_format"
                            ]
                          },
                          "numa_topology": null,
                          "pci_requests": {
                            "nova_object.name": "InstancePCIRequests",
                            "nova_object.namespace": "nova",
                            "nova_object.version": "1.1",
                            "nova_object.data": { "requests": [] },
                            "nova_object.changes": ["requests"]
                          },
                          "project_id": "61454faef1234faa86673d8b7760938a",
                          "user_id": "13ef2628df2b4eba875934d148d2cd26",
                          "availability_zone": null,
                          "flavor": {
                            "nova_object.name": "Flavor",
                            "nova_object.namespace": "nova",
                            "nova_object.version": "1.2",
                            "nova_object.data": {
                              "id": 982,
                              "name": "a2.medium",
                              "memory_mb": 8192,
                              "vcpus": 4,
                              "root_gb": 80,
                              "ephemeral_gb": 0,
                              "flavorid": "026050",
                              "swap": 0,
                              "rxtx_factor": 1.0,
                              "vcpu_weight": 0,
                              "disabled": false,
                              "is_public": true,
                              "extra_specs": {
                                "hw:cpu_max_sockets": "1",
                                "hw:cpu_policy": "shared",
                                "os:secure_boot": "disabled",
                                "quota:cpu_shares": "400"
                              },
                              "description": null,
                              "created_at": "2023-03-29T15:06:03Z",
                              "updated_at": null,
                              "deleted_at": null,
                              "deleted": false
                            },
                            "nova_object.changes": ["extra_specs"]
                          },
                          "num_instances": 1,
                          "ignore_hosts": null,
                          "force_hosts": null,
                          "force_nodes": null,
                          "requested_destination": null,
                          "retry": null,
                          "limits": {
                            "nova_object.name": "SchedulerLimits",
                            "nova_object.namespace": "nova",
                            "nova_object.version": "1.0",
                            "nova_object.data": {
                              "numa_topology": null,
                              "vcpu": null,
                              "disk_gb": null,
                              "memory_mb": null
                            },
                            "nova_object.changes": ["vcpu", "numa_topology", "disk_gb", "memory_mb"]
                          },
                          "instance_group": null,
                          "scheduler_hints": {},
                          "instance_uuid": "7f83337d-88a9-4f49-a4b0-cc0495ea698a",
                          "security_groups": {
                            "nova_object.name": "SecurityGroupList",
                            "nova_object.namespace": "nova",
                            "nova_object.version": "1.1",
                            "nova_object.data": {
                              "objects": [
                                {
                                  "nova_object.name": "SecurityGroup",
                                  "nova_object.namespace": "nova",
                                  "nova_object.version": "1.2",
                                  "nova_object.data": {
                                    "uuid": "5d94b29d-fdb1-4633-92df-8fa47e79864b"
                                  },
                                  "nova_object.changes": ["uuid"]
                                }
                              ]
                            },
                            "nova_object.changes": ["objects"]
                          },
                          "is_bfv": false,
                          "requested_resources": []
                        },
                        "nova_object.changes": [
                          "image",
                          "is_bfv",
                          "requested_destination",
                          "security_groups",
                          "force_nodes",
                          "num_instances",
                          "retry",
                          "numa_topology",
                          "instance_group",
                          "limits",
                          "instance_uuid",
                          "availability_zone",
                          "user_id",
                          "requested_resources",
                          "force_hosts",
                          "ignore_hosts",
                          "project_id",
                          "pci_requests",
                          "scheduler_hints",
                          "flavor"
                        ]
                      }
      1 row in set (0.000 sec)

  I further identified the `instance_extra` data to be out of sync, with
  a "new" flavor present:

      MariaDB [nova]> select * from instance_extra where instance_uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
      *************************** 1. row ***************************
          created_at: 2023-11-06 14:44:23
          updated_at: 2024-01-10 11:26:50
          deleted_at: NULL
             deleted: 0
                  id: 442795
       instance_uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
       numa_topology: NULL
        pci_requests: []
              flavor: {
                        "cur": {
                          "nova_object.name": "Flavor",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.2",
                          "nova_object.data": {
                            "id": 982,
                            "name": "a2.medium",
                            "memory_mb": 8192,
                            "vcpus": 4,
                            "root_gb": 80,
                            "ephemeral_gb": 0,
                            "flavorid": "026050",
                            "swap": 0,
                            "rxtx_factor": 1.0,
                            "vcpu_weight": 0,
                            "disabled": false,
                            "is_public": true,
                            "extra_specs": {
                              "hw:cpu_max_sockets": "1",
                              "hw:cpu_policy": "shared",
                              "os:secure_boot": "disabled",
                              "quota:cpu_shares": "400"
                            },
                            "description": null,
                            "created_at": "2023-03-29T15:06:03Z",
                            "updated_at": null,
                            "deleted_at": null,
                            "deleted": false
                          },
                          "nova_object.changes": ["extra_specs"]
                        },
                        "old": null,
                        "new": {
                          "nova_object.name": "Flavor",
                          "nova_object.namespace": "nova",
                          "nova_object.version": "1.2",
                          "nova_object.data": {
                            "id": 831,
                            "name": "e2.medium",
                            "memory_mb": 8192,
                            "vcpus": 4,
                            "root_gb": 25,
                            "ephemeral_gb": 0,
                            "flavorid": "022050",
                            "swap": 0,
                            "rxtx_factor": 1.0,
                            "vcpu_weight": 0,
                            "disabled": false,
                            "is_public": true,
                            "extra_specs": {
                              "hw:cpu_max_sockets": "1",
                              "hw:cpu_policy": "shared",
                              "quota:cpu_shares": "400"
                            },
                            "description": null,
                            "created_at": "2022-08-03T11:11:49Z",
                            "updated_at": null,
                            "deleted_at": null,
                            "deleted": false
                          },
                          "nova_object.changes": ["extra_specs"]
                        }
                      }
      device_metadata: NULL
        trusted_certs: NULL
               vpmems: NULL
            resources: NULL
      1 row in set (0.000 sec)

  It appears that a user tried to resize the instance before, which
  failed (no idea why yet), and `instance_extra` as well as the
  `request_spec` data wasn't reverted correctly:

      $ openstack server migration list --server 982c3ead-59b1-4acd-876b-d55166d8e7f0
      +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
      |     Id | UUID                      | Source Node               | Dest Node                 | Source Compute | Dest Compute | Dest Host    | Status    | Server UUID               | Old Flavor | New Flavor | Type           | Created At                | Updated At                 |
      +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
      | 138991 | 7d0f464b-2fea-49b0-87e1-  | None                      | None                      | compute-a2b1   | None         | None         | error     | 982c3ead-59b1-4acd-876b-  |        982 |        982 | live-migration | 2024-01-                  | 2024-01-09T09:56:50.000000 |
      |        | 596624ca03cc              |                           |                           |                |              |              |           | d55166d8e7f0              |            |            |                | 09T09:56:45.000000        |                            |
      | 138931 | e1eeef36-f4e9-4f2a-adc6-  | compute-                  | compute-                  | compute-a2b1   | compute-t2b2 | XXXXXXXXXXXX | error     | 982c3ead-59b1-4acd-876b-  |        982 |        831 | resize         | 2024-01-                  | 2024-01-08T14:11:14.000000 |
      |        | 985413cea4ed              | a2b1.cld.domain.tld       | t2b2.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 08T14:11:13.000000        |                            |
      | 138062 | c12a2ae0-969f-454f-b6d0-  | compute-                  | compute-                  | compute-t2a3   | compute-a2b1 | XXXXXXXXXXXX | confirmed | 982c3ead-59b1-4acd-876b-  |        831 |        982 | resize         | 2023-12-                  | 2023-12-04T21:18:04.000000 |
      |        | fa79381ba29f              | t2a3.cld.domain.tld       | a2b1.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 04T21:17:44.000000        |                            |
      | 137966 | 12e3cffe-caeb-44ce-       | compute-                  | compute-                  | compute-t2c1   | compute-t2a3 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-12-                  | 2023-12-02T16:35:19.000000 |
      |        | ac5a-baa0aa17d6e1         | t2c1.cld.domain.tld       | t2a3.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 02T16:14:13.000000        |                            |
      | 137732 | 2ce96b71-143e-46cf-a7b2-  | compute-                  | compute-                  | compute-t2a3   | compute-t2c1 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-12-                  | 2023-12-01T09:24:13.000000 |
      |        | 822b2308f60a              | t2a3.cld.domain.tld       | t2c1.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 01T09:23:32.000000        |                            |
      | 137286 | 1c6c0e7d-cf33-4522-9710-  | compute-                  | compute-                  | compute-t2c3   | compute-t2a3 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-11-                  | 2023-11-30T00:34:06.000000 |
      |        | e372392f3dad              | t2c3.cld.domain.tld       | t2a3.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 29T23:42:59.000000        |                            |
      | 137013 | 4afff1ec-08fd-4995-8642-  | compute-                  | compute-                  | compute-t2a3   | compute-t2c3 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-11-                  | 2023-11-28T09:55:35.000000 |
      |        | b8341a169efb              | t2a3.cld.domain.tld       | t2c3.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 28T09:54:53.000000        |                            |
      | 135478 | cf28ff61-87d7-49d8-97b5-  | compute-                  | compute-                  | compute-t2c3   | compute-t2a3 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-11-                  | 2023-11-09T06:52:46.000000 |
      |        | 10382163d158              | t2c3.cld.domain.tld       | t2a3.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 09T06:48:44.000000        |                            |
      | 135244 | 30f87d2e-ffc5-43ea-       | compute-                  | compute-                  | compute-t2a3   | compute-t2c3 | None         | completed | 982c3ead-59b1-4acd-876b-  |        831 |        831 | live-migration | 2023-11-                  | 2023-11-08T20:49:47.000000 |
      |        | ae16-ade2c9a553b3         | t2a3.cld.domain.tld       | t2c3.cld.domain.tld       |                |              |              |           | d55166d8e7f0              |            |            |                | 08T20:49:01.000000        |                            |
      +--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+

  Yet, even the live-migration tried later lists the correct flavor ID.

  My problem isn't much about the bug that data is inconsistent,
  especially on failures. We know that this often happens with each
  OpenStack version, and had to fix the database many times before.

  Our problem here is the complexity of fixing the inconsistencies
  because most are serialized Python objects.

  Are there any tools or commands, automatic or manual, to check and fix
  these request spec data inconsistencies? Maybe similar to the
  heal_placements command?

  At the moment, I cannot even tell how many instances are affected. We
  use scheduler filters to isolated users and projects between hosts
  too, even if technically compatible. Therefore, inconsistent data like
  that would not fail a live migration, but our security and isolation
  boundaries. I need to manually check each instance.

  Steps to reproduce
  ==================

  I cannot provide commands that produce this data inconsistency yet,
  but when manually introduced to a system, live migrations can fail
  because the scheduler makes wrong decisions.

  Expected result
  ===============

  All places where nova stores the flavor details should be kept in
  sync, or at least be fixable/resyncable on failures.

  Actual result
  =============

  Scheduler are run with wrong data, violating scheduling constraints,
  such as compatibility, security and isolation boundaries. Is the case
  of compatibility, other operations, such as live migrations, will
  fail. In other cases, no apparent error might happen.

  Environment
  ===========

  1. Exact version of OpenStack you are running. See the following
     list for all releases: http://docs.openstack.org/releases/

      ii  nova-common                      3:25.2.1-0ubuntu1                                    all          OpenStack Compute - common files
      ii  nova-conductor                   3:25.2.1-0ubuntu1                                    all          OpenStack Compute - conductor service
      ii  nova-scheduler                   3:25.2.1-0ubuntu1                                    all          OpenStack Compute - virtual machine scheduler
      ii  nova-spiceproxy                  3:25.2.1-0ubuntu1                                    all          OpenStack Compute - spice html5 proxy
      ii  python3-nova                     3:25.2.1-0ubuntu1                                    all          OpenStack Compute Python 3 libraries

  2. Which hypervisor did you use?
     Libvirt + KVM

  2. Which storage type did you use?
     Ceph 17.2.7-1focal, local qcow2 disks

  3. Which networking type did you use?
     Neutron ML2/LXB

  Logs & Configs
  ==============

  The tool *sosreport* has support for some OpenStack projects.
  It's worth having a look at it. For example, if you want to collect
  the logs of a compute node you would execute:

     $ sudo sosreport -o openstack_nova --batch

  on that compute node. Attach the logs to this bug report. Please
  consider that these logs need to be collected in "DEBUG" mode.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2049030/+subscriptions