yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #93278
[Bug 2049030] [NEW] request_spec out of sync with instance details
Public bug reported:
Description
===========
We have an OpenStack Zed running based on the Ubuntu cloud archive
packages. We regularly live-migrate instances between hypervisors for
host maintenance, but suddenly had issues with many instances. Nova
correctly refused to migrate an instance from an AMD CPU to an Intel
CPU. We have different flavors for different CPU kinds and scheduler
filter that limit these flavors to specific hosts, but here the
scheduler failed.
I picked one failing instance and started debugging. The flavor was
`a2.medium` (ID: 982), a flavor running on AMD CPUs, but our scheduler
filter was called with a request spec containing an `e2.medium` (ID:
831) flavor (Intel CPU). It therefore filtered for the wrong hosts, and
Nova aborted the live migration because of an invalid target.
The request spec the filter received looked like this:
RequestSpec(
availability_zone=None,
flavor=Flavor(831),
force_hosts=None,
force_nodes=None,
id=485269,
ignore_hosts=[...],
image=ImageMeta(35c8d5e3-2791-4565-9c19-291869fde98d),
instance_group=None,
instance_uuid=982c3ead-59b1-4acd-876b-d55166d8e7f0,
is_bfv=False,
limits=SchedulerLimits,
network_metadata=NetworkMetadata,
num_instances=1,
numa_topology=None,
pci_requests=InstancePCIRequests,
project_id='16e980bb63b4415694dd2130f5977b8b',
request_level_params=RequestLevelParams,
requested_destination=Destination,
requested_networks=NetworkRequestList,
requested_resources=[],
retry=None,
scheduler_hints={},
security_groups=SecurityGroupList,
user_id='d7451e969b3f4229bd1869ed9ad591f3'
)
Despite that, the instance itself is listed with the correct flavor:
$ openstack server show 982c3ead-59b1-4acd-876b-d55166d8e7f0 | grep flavor
| flavor | disk='80', ephemeral='0', original_name='a2.medium', ram='8192', swap='0', vcpus='4'
Here is an except of our flavors:
MariaDB [novaapi]> select id,flavorid,name,vcpus,memory_mb,root_gb from flavors where name LIKE 'a2.%' OR name LIKE 'e2.%';
+------+----------+------------+-------+-----------+---------+
| id | flavorid | name | vcpus | memory_mb | root_gb |
+------+----------+------------+-------+-----------+---------+
| 807 | 022030 | e2.micro | 2 | 2048 | 25 |
| 813 | 022010 | e2.nano | 1 | 512 | 10 |
| 819 | 022060 | e2.large | 4 | 12288 | 40 |
| 822 | 022020 | e2.tiny | 2 | 1024 | 20 |
| 825 | 022070 | e2.xlarge | 4 | 16384 | 60 |
| 828 | 022080 | e2.2xlarge | 8 | 32768 | 80 |
| 831 | 022050 | e2.medium | 4 | 8192 | 25 |
| 834 | 022040 | e2.small | 2 | 4096 | 25 |
| 980 | 022090 | e2.4xlarge | 16 | 65536 | 160 |
| 982 | 026050 | a2.medium | 4 | 8192 | 80 |
| 1003 | 026040 | a2.small | 2 | 4096 | 40 |
| 1005 | 026060 | a2.large | 6 | 12288 | 120 |
| 1008 | 026070 | a2.xlarge | 8 | 16384 | 160 |
| 1011 | 026090 | a2.4xlarge | 32 | 65536 | 640 |
| 1014 | 026080 | a2.2xlarge | 16 | 32768 | 320 |
+------+----------+------------+-------+-----------+---------+
The instance also has the correct flavor listed in Horizon, CLI and in
the MySQL database (instance_type_id). It is running (according to the
database and actual) on the correct compute host (compute-a2b1).
MariaDB [nova]> select * from instances where uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
*************************** 1. row ***************************
...
launched_on: compute-t2a3
instance_type_id: 982 <==========================================
uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
...
node: compute-a2b1.cld.domain.tld
Yet, nova-scheduler runs the filter with the wrong request_spec object
shown above. I've followed the `request_spec` via source code and print
statements through nova-scheduler, nova-conductor, to nova-api, where it
was loaded in
https://github.com/openstack/nova/blob/stable/zed/nova/compute/api.py#L5496.
Here, the request spec is loaded already with the wrong flavor ID (831
instead of 982). A look at the database confirmed that:
MariaDB [novaapi]> select * from request_specs where instance_uuid = '7f83337d-88a9-4f49-a4b0-cc0495ea698a' \G
*************************** 1. row ***************************
created_at: 2024-01-10 11:19:30
updated_at: NULL
id: 499312
instance_uuid: 7f83337d-88a9-4f49-a4b0-cc0495ea698a
spec: {
"nova_object.name": "RequestSpec",
"nova_object.namespace": "nova",
"nova_object.version": "1.14",
"nova_object.data": {
"image": {
"nova_object.name": "ImageMeta",
"nova_object.namespace": "nova",
"nova_object.version": "1.8",
"nova_object.data": {
"id": "4cd37a9e-7bd6-443d-83f5-1b96f7ff005d",
"name": "ubuntu-22.04",
"status": "active",
"checksum": "f4a9b90d378d90fdbf66b2ad3afe4da7",
"owner": "b18c2da2dbfa45138fb6077eafb2aa51",
"size": 2361393152,
"container_format": "bare",
"disk_format": "raw",
"created_at": "2023-12-12T02:06:42Z",
"updated_at": "2023-12-12T02:11:15Z",
"min_ram": 128,
"min_disk": 5,
"properties": {
"nova_object.name": "ImageMetaProps",
"nova_object.namespace": "nova",
"nova_object.version": "1.31",
"nova_object.data": {
"hw_architecture": "x86_64",
"hw_disk_bus": "scsi",
"hw_firmware_type": "uefi",
"hw_qemu_guest_agent": true,
"hw_scsi_model": "virtio-scsi",
"hw_vm_mode": "hvm",
"img_hv_type": "kvm",
"os_admin_user": "ubuntu",
"os_distro": "ubuntu",
"os_require_quiesce": true,
"os_type": "linux"
},
"nova_object.changes": [
"hw_disk_bus",
"hw_architecture",
"hw_vm_mode",
"os_type",
"hw_qemu_guest_agent",
"os_admin_user",
"os_distro",
"hw_firmware_type",
"hw_scsi_model",
"os_require_quiesce",
"img_hv_type"
]
}
},
"nova_object.changes": [
"updated_at",
"min_ram",
"size",
"id",
"properties",
"status",
"disk_format",
"created_at",
"name",
"owner",
"checksum",
"min_disk",
"container_format"
]
},
"numa_topology": null,
"pci_requests": {
"nova_object.name": "InstancePCIRequests",
"nova_object.namespace": "nova",
"nova_object.version": "1.1",
"nova_object.data": { "requests": [] },
"nova_object.changes": ["requests"]
},
"project_id": "61454faef1234faa86673d8b7760938a",
"user_id": "13ef2628df2b4eba875934d148d2cd26",
"availability_zone": null,
"flavor": {
"nova_object.name": "Flavor",
"nova_object.namespace": "nova",
"nova_object.version": "1.2",
"nova_object.data": {
"id": 982,
"name": "a2.medium",
"memory_mb": 8192,
"vcpus": 4,
"root_gb": 80,
"ephemeral_gb": 0,
"flavorid": "026050",
"swap": 0,
"rxtx_factor": 1.0,
"vcpu_weight": 0,
"disabled": false,
"is_public": true,
"extra_specs": {
"hw:cpu_max_sockets": "1",
"hw:cpu_policy": "shared",
"os:secure_boot": "disabled",
"quota:cpu_shares": "400"
},
"description": null,
"created_at": "2023-03-29T15:06:03Z",
"updated_at": null,
"deleted_at": null,
"deleted": false
},
"nova_object.changes": ["extra_specs"]
},
"num_instances": 1,
"ignore_hosts": null,
"force_hosts": null,
"force_nodes": null,
"requested_destination": null,
"retry": null,
"limits": {
"nova_object.name": "SchedulerLimits",
"nova_object.namespace": "nova",
"nova_object.version": "1.0",
"nova_object.data": {
"numa_topology": null,
"vcpu": null,
"disk_gb": null,
"memory_mb": null
},
"nova_object.changes": ["vcpu", "numa_topology", "disk_gb", "memory_mb"]
},
"instance_group": null,
"scheduler_hints": {},
"instance_uuid": "7f83337d-88a9-4f49-a4b0-cc0495ea698a",
"security_groups": {
"nova_object.name": "SecurityGroupList",
"nova_object.namespace": "nova",
"nova_object.version": "1.1",
"nova_object.data": {
"objects": [
{
"nova_object.name": "SecurityGroup",
"nova_object.namespace": "nova",
"nova_object.version": "1.2",
"nova_object.data": {
"uuid": "5d94b29d-fdb1-4633-92df-8fa47e79864b"
},
"nova_object.changes": ["uuid"]
}
]
},
"nova_object.changes": ["objects"]
},
"is_bfv": false,
"requested_resources": []
},
"nova_object.changes": [
"image",
"is_bfv",
"requested_destination",
"security_groups",
"force_nodes",
"num_instances",
"retry",
"numa_topology",
"instance_group",
"limits",
"instance_uuid",
"availability_zone",
"user_id",
"requested_resources",
"force_hosts",
"ignore_hosts",
"project_id",
"pci_requests",
"scheduler_hints",
"flavor"
]
}
1 row in set (0.000 sec)
I further identified the `instance_extra` data to be out of sync, with a
"new" flavor present:
MariaDB [nova]> select * from instance_extra where instance_uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
*************************** 1. row ***************************
created_at: 2023-11-06 14:44:23
updated_at: 2024-01-10 11:26:50
deleted_at: NULL
deleted: 0
id: 442795
instance_uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
numa_topology: NULL
pci_requests: []
flavor: {
"cur": {
"nova_object.name": "Flavor",
"nova_object.namespace": "nova",
"nova_object.version": "1.2",
"nova_object.data": {
"id": 982,
"name": "a2.medium",
"memory_mb": 8192,
"vcpus": 4,
"root_gb": 80,
"ephemeral_gb": 0,
"flavorid": "026050",
"swap": 0,
"rxtx_factor": 1.0,
"vcpu_weight": 0,
"disabled": false,
"is_public": true,
"extra_specs": {
"hw:cpu_max_sockets": "1",
"hw:cpu_policy": "shared",
"os:secure_boot": "disabled",
"quota:cpu_shares": "400"
},
"description": null,
"created_at": "2023-03-29T15:06:03Z",
"updated_at": null,
"deleted_at": null,
"deleted": false
},
"nova_object.changes": ["extra_specs"]
},
"old": null,
"new": {
"nova_object.name": "Flavor",
"nova_object.namespace": "nova",
"nova_object.version": "1.2",
"nova_object.data": {
"id": 831,
"name": "e2.medium",
"memory_mb": 8192,
"vcpus": 4,
"root_gb": 25,
"ephemeral_gb": 0,
"flavorid": "022050",
"swap": 0,
"rxtx_factor": 1.0,
"vcpu_weight": 0,
"disabled": false,
"is_public": true,
"extra_specs": {
"hw:cpu_max_sockets": "1",
"hw:cpu_policy": "shared",
"quota:cpu_shares": "400"
},
"description": null,
"created_at": "2022-08-03T11:11:49Z",
"updated_at": null,
"deleted_at": null,
"deleted": false
},
"nova_object.changes": ["extra_specs"]
}
}
device_metadata: NULL
trusted_certs: NULL
vpmems: NULL
resources: NULL
1 row in set (0.000 sec)
It appears that a user tried to resize the instance before, which failed
(no idea why yet), and `instance_extra` as well as the `request_spec`
data wasn't reverted correctly:
$ openstack server migration list --server 982c3ead-59b1-4acd-876b-d55166d8e7f0
+--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
| 138991 | 7d0f464b-2fea-49b0-87e1- | None | None | compute-a2b1 | None | None | error | 982c3ead-59b1-4acd-876b- | 982 | 982 | live-migration | 2024-01- | 2024-01-09T09:56:50.000000 |
| | 596624ca03cc | | | | | | | d55166d8e7f0 | | | | 09T09:56:45.000000 | |
| 138931 | e1eeef36-f4e9-4f2a-adc6- | compute- | compute- | compute-a2b1 | compute-t2b2 | XXXXXXXXXXXX | error | 982c3ead-59b1-4acd-876b- | 982 | 831 | resize | 2024-01- | 2024-01-08T14:11:14.000000 |
| | 985413cea4ed | a2b1.cld.domain.tld | t2b2.cld.domain.tld | | | | | d55166d8e7f0 | | | | 08T14:11:13.000000 | |
| 138062 | c12a2ae0-969f-454f-b6d0- | compute- | compute- | compute-t2a3 | compute-a2b1 | XXXXXXXXXXXX | confirmed | 982c3ead-59b1-4acd-876b- | 831 | 982 | resize | 2023-12- | 2023-12-04T21:18:04.000000 |
| | fa79381ba29f | t2a3.cld.domain.tld | a2b1.cld.domain.tld | | | | | d55166d8e7f0 | | | | 04T21:17:44.000000 | |
| 137966 | 12e3cffe-caeb-44ce- | compute- | compute- | compute-t2c1 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-12- | 2023-12-02T16:35:19.000000 |
| | ac5a-baa0aa17d6e1 | t2c1.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 02T16:14:13.000000 | |
| 137732 | 2ce96b71-143e-46cf-a7b2- | compute- | compute- | compute-t2a3 | compute-t2c1 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-12- | 2023-12-01T09:24:13.000000 |
| | 822b2308f60a | t2a3.cld.domain.tld | t2c1.cld.domain.tld | | | | | d55166d8e7f0 | | | | 01T09:23:32.000000 | |
| 137286 | 1c6c0e7d-cf33-4522-9710- | compute- | compute- | compute-t2c3 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-30T00:34:06.000000 |
| | e372392f3dad | t2c3.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 29T23:42:59.000000 | |
| 137013 | 4afff1ec-08fd-4995-8642- | compute- | compute- | compute-t2a3 | compute-t2c3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-28T09:55:35.000000 |
| | b8341a169efb | t2a3.cld.domain.tld | t2c3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 28T09:54:53.000000 | |
| 135478 | cf28ff61-87d7-49d8-97b5- | compute- | compute- | compute-t2c3 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-09T06:52:46.000000 |
| | 10382163d158 | t2c3.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 09T06:48:44.000000 | |
| 135244 | 30f87d2e-ffc5-43ea- | compute- | compute- | compute-t2a3 | compute-t2c3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-08T20:49:47.000000 |
| | ae16-ade2c9a553b3 | t2a3.cld.domain.tld | t2c3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 08T20:49:01.000000 | |
+--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
Yet, even the live-migration tried later lists the correct flavor ID.
My problem isn't much about the bug that data is inconsistent,
especially on failures. We know that this often happens with each
OpenStack version, and had to fix the database many times before.
Our problem here is the complexity of fixing the inconsistencies because
most are serialized Python objects.
Are there any tools or commands, automatic or manual, to check and fix
these request spec data inconsistencies? Maybe similar to the
heal_placements command?
At the moment, I cannot even tell how many instances are affected. We
use scheduler filters to isolated users and projects between hosts too,
even if technically compatible. Therefore, inconsistent data like that
would not fail a live migration, but our security and isolation
boundaries. I need to manually check each instance.
Steps to reproduce
==================
I cannot provide commands that produce this data inconsistency yet, but
when manually introduced to a system, live migrations can fail because
the scheduler makes wrong decisions.
Expected result
===============
All places where nova stores the flavor details should be kept in sync,
or at least be fixable/resyncable on failures.
Actual result
=============
Scheduler are run with wrong data, violating scheduling constraints,
such as compatibility, security and isolation boundaries. Is the case of
compatibility, other operations, such as live migrations, will fail. In
other cases, no apparent error might happen.
Environment
===========
1. Exact version of OpenStack you are running. See the following
list for all releases: http://docs.openstack.org/releases/
ii nova-common 3:25.2.1-0ubuntu1 all OpenStack Compute - common files
ii nova-conductor 3:25.2.1-0ubuntu1 all OpenStack Compute - conductor service
ii nova-scheduler 3:25.2.1-0ubuntu1 all OpenStack Compute - virtual machine scheduler
ii nova-spiceproxy 3:25.2.1-0ubuntu1 all OpenStack Compute - spice html5 proxy
ii python3-nova 3:25.2.1-0ubuntu1 all OpenStack Compute Python 3 libraries
2. Which hypervisor did you use?
Libvirt + KVM
2. Which storage type did you use?
Ceph 17.2.7-1focal, local qcow2 disks
3. Which networking type did you use?
Neutron ML2/LXB
Logs & Configs
==============
The tool *sosreport* has support for some OpenStack projects.
It's worth having a look at it. For example, if you want to collect
the logs of a compute node you would execute:
$ sudo sosreport -o openstack_nova --batch
on that compute node. Attach the logs to this bug report. Please
consider that these logs need to be collected in "DEBUG" mode.
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2049030
Title:
request_spec out of sync with instance details
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
We have an OpenStack Zed running based on the Ubuntu cloud archive
packages. We regularly live-migrate instances between hypervisors for
host maintenance, but suddenly had issues with many instances. Nova
correctly refused to migrate an instance from an AMD CPU to an Intel
CPU. We have different flavors for different CPU kinds and scheduler
filter that limit these flavors to specific hosts, but here the
scheduler failed.
I picked one failing instance and started debugging. The flavor was
`a2.medium` (ID: 982), a flavor running on AMD CPUs, but our scheduler
filter was called with a request spec containing an `e2.medium` (ID:
831) flavor (Intel CPU). It therefore filtered for the wrong hosts,
and Nova aborted the live migration because of an invalid target.
The request spec the filter received looked like this:
RequestSpec(
availability_zone=None,
flavor=Flavor(831),
force_hosts=None,
force_nodes=None,
id=485269,
ignore_hosts=[...],
image=ImageMeta(35c8d5e3-2791-4565-9c19-291869fde98d),
instance_group=None,
instance_uuid=982c3ead-59b1-4acd-876b-d55166d8e7f0,
is_bfv=False,
limits=SchedulerLimits,
network_metadata=NetworkMetadata,
num_instances=1,
numa_topology=None,
pci_requests=InstancePCIRequests,
project_id='16e980bb63b4415694dd2130f5977b8b',
request_level_params=RequestLevelParams,
requested_destination=Destination,
requested_networks=NetworkRequestList,
requested_resources=[],
retry=None,
scheduler_hints={},
security_groups=SecurityGroupList,
user_id='d7451e969b3f4229bd1869ed9ad591f3'
)
Despite that, the instance itself is listed with the correct flavor:
$ openstack server show 982c3ead-59b1-4acd-876b-d55166d8e7f0 | grep flavor
| flavor | disk='80', ephemeral='0', original_name='a2.medium', ram='8192', swap='0', vcpus='4'
Here is an except of our flavors:
MariaDB [novaapi]> select id,flavorid,name,vcpus,memory_mb,root_gb from flavors where name LIKE 'a2.%' OR name LIKE 'e2.%';
+------+----------+------------+-------+-----------+---------+
| id | flavorid | name | vcpus | memory_mb | root_gb |
+------+----------+------------+-------+-----------+---------+
| 807 | 022030 | e2.micro | 2 | 2048 | 25 |
| 813 | 022010 | e2.nano | 1 | 512 | 10 |
| 819 | 022060 | e2.large | 4 | 12288 | 40 |
| 822 | 022020 | e2.tiny | 2 | 1024 | 20 |
| 825 | 022070 | e2.xlarge | 4 | 16384 | 60 |
| 828 | 022080 | e2.2xlarge | 8 | 32768 | 80 |
| 831 | 022050 | e2.medium | 4 | 8192 | 25 |
| 834 | 022040 | e2.small | 2 | 4096 | 25 |
| 980 | 022090 | e2.4xlarge | 16 | 65536 | 160 |
| 982 | 026050 | a2.medium | 4 | 8192 | 80 |
| 1003 | 026040 | a2.small | 2 | 4096 | 40 |
| 1005 | 026060 | a2.large | 6 | 12288 | 120 |
| 1008 | 026070 | a2.xlarge | 8 | 16384 | 160 |
| 1011 | 026090 | a2.4xlarge | 32 | 65536 | 640 |
| 1014 | 026080 | a2.2xlarge | 16 | 32768 | 320 |
+------+----------+------------+-------+-----------+---------+
The instance also has the correct flavor listed in Horizon, CLI and in
the MySQL database (instance_type_id). It is running (according to the
database and actual) on the correct compute host (compute-a2b1).
MariaDB [nova]> select * from instances where uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
*************************** 1. row ***************************
...
launched_on: compute-t2a3
instance_type_id: 982 <==========================================
uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
...
node: compute-a2b1.cld.domain.tld
Yet, nova-scheduler runs the filter with the wrong request_spec object
shown above. I've followed the `request_spec` via source code and
print statements through nova-scheduler, nova-conductor, to nova-api,
where it was loaded in
https://github.com/openstack/nova/blob/stable/zed/nova/compute/api.py#L5496.
Here, the request spec is loaded already with the wrong flavor ID (831
instead of 982). A look at the database confirmed that:
MariaDB [novaapi]> select * from request_specs where instance_uuid = '7f83337d-88a9-4f49-a4b0-cc0495ea698a' \G
*************************** 1. row ***************************
created_at: 2024-01-10 11:19:30
updated_at: NULL
id: 499312
instance_uuid: 7f83337d-88a9-4f49-a4b0-cc0495ea698a
spec: {
"nova_object.name": "RequestSpec",
"nova_object.namespace": "nova",
"nova_object.version": "1.14",
"nova_object.data": {
"image": {
"nova_object.name": "ImageMeta",
"nova_object.namespace": "nova",
"nova_object.version": "1.8",
"nova_object.data": {
"id": "4cd37a9e-7bd6-443d-83f5-1b96f7ff005d",
"name": "ubuntu-22.04",
"status": "active",
"checksum": "f4a9b90d378d90fdbf66b2ad3afe4da7",
"owner": "b18c2da2dbfa45138fb6077eafb2aa51",
"size": 2361393152,
"container_format": "bare",
"disk_format": "raw",
"created_at": "2023-12-12T02:06:42Z",
"updated_at": "2023-12-12T02:11:15Z",
"min_ram": 128,
"min_disk": 5,
"properties": {
"nova_object.name": "ImageMetaProps",
"nova_object.namespace": "nova",
"nova_object.version": "1.31",
"nova_object.data": {
"hw_architecture": "x86_64",
"hw_disk_bus": "scsi",
"hw_firmware_type": "uefi",
"hw_qemu_guest_agent": true,
"hw_scsi_model": "virtio-scsi",
"hw_vm_mode": "hvm",
"img_hv_type": "kvm",
"os_admin_user": "ubuntu",
"os_distro": "ubuntu",
"os_require_quiesce": true,
"os_type": "linux"
},
"nova_object.changes": [
"hw_disk_bus",
"hw_architecture",
"hw_vm_mode",
"os_type",
"hw_qemu_guest_agent",
"os_admin_user",
"os_distro",
"hw_firmware_type",
"hw_scsi_model",
"os_require_quiesce",
"img_hv_type"
]
}
},
"nova_object.changes": [
"updated_at",
"min_ram",
"size",
"id",
"properties",
"status",
"disk_format",
"created_at",
"name",
"owner",
"checksum",
"min_disk",
"container_format"
]
},
"numa_topology": null,
"pci_requests": {
"nova_object.name": "InstancePCIRequests",
"nova_object.namespace": "nova",
"nova_object.version": "1.1",
"nova_object.data": { "requests": [] },
"nova_object.changes": ["requests"]
},
"project_id": "61454faef1234faa86673d8b7760938a",
"user_id": "13ef2628df2b4eba875934d148d2cd26",
"availability_zone": null,
"flavor": {
"nova_object.name": "Flavor",
"nova_object.namespace": "nova",
"nova_object.version": "1.2",
"nova_object.data": {
"id": 982,
"name": "a2.medium",
"memory_mb": 8192,
"vcpus": 4,
"root_gb": 80,
"ephemeral_gb": 0,
"flavorid": "026050",
"swap": 0,
"rxtx_factor": 1.0,
"vcpu_weight": 0,
"disabled": false,
"is_public": true,
"extra_specs": {
"hw:cpu_max_sockets": "1",
"hw:cpu_policy": "shared",
"os:secure_boot": "disabled",
"quota:cpu_shares": "400"
},
"description": null,
"created_at": "2023-03-29T15:06:03Z",
"updated_at": null,
"deleted_at": null,
"deleted": false
},
"nova_object.changes": ["extra_specs"]
},
"num_instances": 1,
"ignore_hosts": null,
"force_hosts": null,
"force_nodes": null,
"requested_destination": null,
"retry": null,
"limits": {
"nova_object.name": "SchedulerLimits",
"nova_object.namespace": "nova",
"nova_object.version": "1.0",
"nova_object.data": {
"numa_topology": null,
"vcpu": null,
"disk_gb": null,
"memory_mb": null
},
"nova_object.changes": ["vcpu", "numa_topology", "disk_gb", "memory_mb"]
},
"instance_group": null,
"scheduler_hints": {},
"instance_uuid": "7f83337d-88a9-4f49-a4b0-cc0495ea698a",
"security_groups": {
"nova_object.name": "SecurityGroupList",
"nova_object.namespace": "nova",
"nova_object.version": "1.1",
"nova_object.data": {
"objects": [
{
"nova_object.name": "SecurityGroup",
"nova_object.namespace": "nova",
"nova_object.version": "1.2",
"nova_object.data": {
"uuid": "5d94b29d-fdb1-4633-92df-8fa47e79864b"
},
"nova_object.changes": ["uuid"]
}
]
},
"nova_object.changes": ["objects"]
},
"is_bfv": false,
"requested_resources": []
},
"nova_object.changes": [
"image",
"is_bfv",
"requested_destination",
"security_groups",
"force_nodes",
"num_instances",
"retry",
"numa_topology",
"instance_group",
"limits",
"instance_uuid",
"availability_zone",
"user_id",
"requested_resources",
"force_hosts",
"ignore_hosts",
"project_id",
"pci_requests",
"scheduler_hints",
"flavor"
]
}
1 row in set (0.000 sec)
I further identified the `instance_extra` data to be out of sync, with
a "new" flavor present:
MariaDB [nova]> select * from instance_extra where instance_uuid = '982c3ead-59b1-4acd-876b-d55166d8e7f0' \G
*************************** 1. row ***************************
created_at: 2023-11-06 14:44:23
updated_at: 2024-01-10 11:26:50
deleted_at: NULL
deleted: 0
id: 442795
instance_uuid: 982c3ead-59b1-4acd-876b-d55166d8e7f0
numa_topology: NULL
pci_requests: []
flavor: {
"cur": {
"nova_object.name": "Flavor",
"nova_object.namespace": "nova",
"nova_object.version": "1.2",
"nova_object.data": {
"id": 982,
"name": "a2.medium",
"memory_mb": 8192,
"vcpus": 4,
"root_gb": 80,
"ephemeral_gb": 0,
"flavorid": "026050",
"swap": 0,
"rxtx_factor": 1.0,
"vcpu_weight": 0,
"disabled": false,
"is_public": true,
"extra_specs": {
"hw:cpu_max_sockets": "1",
"hw:cpu_policy": "shared",
"os:secure_boot": "disabled",
"quota:cpu_shares": "400"
},
"description": null,
"created_at": "2023-03-29T15:06:03Z",
"updated_at": null,
"deleted_at": null,
"deleted": false
},
"nova_object.changes": ["extra_specs"]
},
"old": null,
"new": {
"nova_object.name": "Flavor",
"nova_object.namespace": "nova",
"nova_object.version": "1.2",
"nova_object.data": {
"id": 831,
"name": "e2.medium",
"memory_mb": 8192,
"vcpus": 4,
"root_gb": 25,
"ephemeral_gb": 0,
"flavorid": "022050",
"swap": 0,
"rxtx_factor": 1.0,
"vcpu_weight": 0,
"disabled": false,
"is_public": true,
"extra_specs": {
"hw:cpu_max_sockets": "1",
"hw:cpu_policy": "shared",
"quota:cpu_shares": "400"
},
"description": null,
"created_at": "2022-08-03T11:11:49Z",
"updated_at": null,
"deleted_at": null,
"deleted": false
},
"nova_object.changes": ["extra_specs"]
}
}
device_metadata: NULL
trusted_certs: NULL
vpmems: NULL
resources: NULL
1 row in set (0.000 sec)
It appears that a user tried to resize the instance before, which
failed (no idea why yet), and `instance_extra` as well as the
`request_spec` data wasn't reverted correctly:
$ openstack server migration list --server 982c3ead-59b1-4acd-876b-d55166d8e7f0
+--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Server UUID | Old Flavor | New Flavor | Type | Created At | Updated At |
+--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
| 138991 | 7d0f464b-2fea-49b0-87e1- | None | None | compute-a2b1 | None | None | error | 982c3ead-59b1-4acd-876b- | 982 | 982 | live-migration | 2024-01- | 2024-01-09T09:56:50.000000 |
| | 596624ca03cc | | | | | | | d55166d8e7f0 | | | | 09T09:56:45.000000 | |
| 138931 | e1eeef36-f4e9-4f2a-adc6- | compute- | compute- | compute-a2b1 | compute-t2b2 | XXXXXXXXXXXX | error | 982c3ead-59b1-4acd-876b- | 982 | 831 | resize | 2024-01- | 2024-01-08T14:11:14.000000 |
| | 985413cea4ed | a2b1.cld.domain.tld | t2b2.cld.domain.tld | | | | | d55166d8e7f0 | | | | 08T14:11:13.000000 | |
| 138062 | c12a2ae0-969f-454f-b6d0- | compute- | compute- | compute-t2a3 | compute-a2b1 | XXXXXXXXXXXX | confirmed | 982c3ead-59b1-4acd-876b- | 831 | 982 | resize | 2023-12- | 2023-12-04T21:18:04.000000 |
| | fa79381ba29f | t2a3.cld.domain.tld | a2b1.cld.domain.tld | | | | | d55166d8e7f0 | | | | 04T21:17:44.000000 | |
| 137966 | 12e3cffe-caeb-44ce- | compute- | compute- | compute-t2c1 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-12- | 2023-12-02T16:35:19.000000 |
| | ac5a-baa0aa17d6e1 | t2c1.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 02T16:14:13.000000 | |
| 137732 | 2ce96b71-143e-46cf-a7b2- | compute- | compute- | compute-t2a3 | compute-t2c1 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-12- | 2023-12-01T09:24:13.000000 |
| | 822b2308f60a | t2a3.cld.domain.tld | t2c1.cld.domain.tld | | | | | d55166d8e7f0 | | | | 01T09:23:32.000000 | |
| 137286 | 1c6c0e7d-cf33-4522-9710- | compute- | compute- | compute-t2c3 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-30T00:34:06.000000 |
| | e372392f3dad | t2c3.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 29T23:42:59.000000 | |
| 137013 | 4afff1ec-08fd-4995-8642- | compute- | compute- | compute-t2a3 | compute-t2c3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-28T09:55:35.000000 |
| | b8341a169efb | t2a3.cld.domain.tld | t2c3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 28T09:54:53.000000 | |
| 135478 | cf28ff61-87d7-49d8-97b5- | compute- | compute- | compute-t2c3 | compute-t2a3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-09T06:52:46.000000 |
| | 10382163d158 | t2c3.cld.domain.tld | t2a3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 09T06:48:44.000000 | |
| 135244 | 30f87d2e-ffc5-43ea- | compute- | compute- | compute-t2a3 | compute-t2c3 | None | completed | 982c3ead-59b1-4acd-876b- | 831 | 831 | live-migration | 2023-11- | 2023-11-08T20:49:47.000000 |
| | ae16-ade2c9a553b3 | t2a3.cld.domain.tld | t2c3.cld.domain.tld | | | | | d55166d8e7f0 | | | | 08T20:49:01.000000 | |
+--------+---------------------------+---------------------------+---------------------------+----------------+--------------+--------------+-----------+---------------------------+------------+------------+----------------+---------------------------+----------------------------+
Yet, even the live-migration tried later lists the correct flavor ID.
My problem isn't much about the bug that data is inconsistent,
especially on failures. We know that this often happens with each
OpenStack version, and had to fix the database many times before.
Our problem here is the complexity of fixing the inconsistencies
because most are serialized Python objects.
Are there any tools or commands, automatic or manual, to check and fix
these request spec data inconsistencies? Maybe similar to the
heal_placements command?
At the moment, I cannot even tell how many instances are affected. We
use scheduler filters to isolated users and projects between hosts
too, even if technically compatible. Therefore, inconsistent data like
that would not fail a live migration, but our security and isolation
boundaries. I need to manually check each instance.
Steps to reproduce
==================
I cannot provide commands that produce this data inconsistency yet,
but when manually introduced to a system, live migrations can fail
because the scheduler makes wrong decisions.
Expected result
===============
All places where nova stores the flavor details should be kept in
sync, or at least be fixable/resyncable on failures.
Actual result
=============
Scheduler are run with wrong data, violating scheduling constraints,
such as compatibility, security and isolation boundaries. Is the case
of compatibility, other operations, such as live migrations, will
fail. In other cases, no apparent error might happen.
Environment
===========
1. Exact version of OpenStack you are running. See the following
list for all releases: http://docs.openstack.org/releases/
ii nova-common 3:25.2.1-0ubuntu1 all OpenStack Compute - common files
ii nova-conductor 3:25.2.1-0ubuntu1 all OpenStack Compute - conductor service
ii nova-scheduler 3:25.2.1-0ubuntu1 all OpenStack Compute - virtual machine scheduler
ii nova-spiceproxy 3:25.2.1-0ubuntu1 all OpenStack Compute - spice html5 proxy
ii python3-nova 3:25.2.1-0ubuntu1 all OpenStack Compute Python 3 libraries
2. Which hypervisor did you use?
Libvirt + KVM
2. Which storage type did you use?
Ceph 17.2.7-1focal, local qcow2 disks
3. Which networking type did you use?
Neutron ML2/LXB
Logs & Configs
==============
The tool *sosreport* has support for some OpenStack projects.
It's worth having a look at it. For example, if you want to collect
the logs of a compute node you would execute:
$ sudo sosreport -o openstack_nova --batch
on that compute node. Attach the logs to this bug report. Please
consider that these logs need to be collected in "DEBUG" mode.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2049030/+subscriptions