yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #78643
[Bug 1830747] Re: Error 500 trying to migrate an instance after wrong request_spec
This might explain what's happening during a cold migration.
Conductor creates a legacy filter_properties dict here:
https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/tasks/migrate.py#L172
If the spec has an instance_group it will call here:
https://github.com/openstack/nova/blob/stable/rocky/nova/objects/request_spec.py#L397
and _to_legacy_group_info sets these values in the filter_properties
dict:
return {'group_updated': True,
'group_hosts': set(self.instance_group.hosts),
'group_policies': set([self.instance_group.policy]),
'group_members': set(self.instance_group.members)}
Note there is no group_uuid.
Those filter_properties are passed to the prep_resize method on the dest
compute:
https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/tasks/migrate.py#L304
zigo said he hit this:
https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L4272
(10:03:07 AM) zigo: 2019-05-28 15:02:35.534 30706 ERROR
nova.compute.manager [instance: ae6f8afe-9c64-4aaf-90e8-be8175fee8e4]
nova.exception.UnableToMigrateToSelf: Unable to migrate instance
(ae6f8afe-9c64-4aaf-90e8-be8175fee8e4) to current host
(clint1-compute-5.infomaniak.ch).
which will trigger a reschedule here:
https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L4348
The _reschedule_resize_or_reraise method will setup the parameters for
the resize_instance compute task RPC API (conductor) method:
https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L4378-L4379
Note that in Rocky the RequestSpec is not passed back to conductor on
the reschedule, only the filter_properties:
https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L1452
We only started passing the RequestSpec from compute to conductor on
reschedule starting in Stein: https://review.opendev.org/#/c/582417/
Without the request spec we get here in conductor:
https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/manager.py#L307
Note that was pass in the filter_properties but no instance_group to
RequestSpec.from_components.
And because there is no instance_group but there are filter_properties,
we call _populate_group_info here:
https://github.com/openstack/nova/blob/stable/rocky/nova/objects/request_spec.py#L442
Which means we get into this block that sets the
RequestSpec.instance_group with no uuid:
https://github.com/openstack/nova/blob/stable/rocky/nova/objects/request_spec.py#L228
Then we eventually RPC cast off to prep_resize on the next host to try
for the cold migration and save the request_spec changes here:
https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/manager.py#L356
Which is how later attempts to use that request spec to migrate the
instance blow up when loading it from the DB because
spec.instance_group.uuid is not set.
** Changed in: nova
Importance: Undecided => High
** Also affects: nova/queens
Importance: Undecided
Status: New
** Also affects: nova/stein
Importance: Undecided
Status: New
** Also affects: nova/ocata
Importance: Undecided
Status: New
** Also affects: nova/rocky
Importance: Undecided
Status: New
** Also affects: nova/pike
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1830747
Title:
Error 500 trying to migrate an instance after wrong request_spec
Status in OpenStack Compute (nova):
In Progress
Status in OpenStack Compute (nova) ocata series:
New
Status in OpenStack Compute (nova) pike series:
New
Status in OpenStack Compute (nova) queens series:
New
Status in OpenStack Compute (nova) rocky series:
New
Status in OpenStack Compute (nova) stein series:
New
Bug description:
We've started an instance last Wednesday, and the compute where it ran
failed (maybe hardware issue?). Since the networking looked wrong (ie:
missing network interfaces), I tried to migrate the instance.
According to Matt, it looked like the request_spec entry for the
instance is wrong:
<mriedem> my guess is something like this happened: 1. create server in a group, 2. cold migrate the server which fails on host A and does a reschedule to host B which maybe also fails (would be good to know if previous cold migration attempts failed with reschedules), 3. try to cold migrate again which fails with the instance_group.uuid thing
<mriedem> the reschedule might be the key b/c like i said conductor has to rebuild a request spec and i think that's probably where we're doing a partial build of the request spec but missing the group uuid
Here's what I had in my novaapidb:
{
"nova_object.name": "RequestSpec",
"nova_object.version": "1.11",
"nova_object.data": {
"ignore_hosts": null,
"requested_destination": null,
"instance_uuid": "2098b550-c749-460a-a44e-5932535993a9",
"num_instances": 1,
"image": {
"nova_object.name": "ImageMeta",
"nova_object.version": "1.8",
"nova_object.data": {
"min_disk": 40,
"disk_format": "raw",
"min_ram": 0,
"container_format": "bare",
"properties": {
"nova_object.name": "ImageMetaProps",
"nova_object.version": "1.20",
"nova_object.data": {},
"nova_object.namespace": "nova"
}
},
"nova_object.namespace": "nova",
"nova_object.changes": [
"properties",
"min_ram",
"container_format",
"disk_format",
"min_disk"
]
},
"availability_zone": "AZ3",
"flavor": {
"nova_object.name": "Flavor",
"nova_object.version": "1.2",
"nova_object.data": {
"id": 28,
"name": "cpu2-ram6-disk40",
"is_public": true,
"rxtx_factor": 1,
"deleted_at": null,
"root_gb": 40,
"vcpus": 2,
"memory_mb": 6144,
"disabled": false,
"extra_specs": {},
"updated_at": null,
"flavorid": "e29f3ee9-3f07-46d2-b2e2-efa4950edc95",
"deleted": false,
"swap": 0,
"description": null,
"created_at": "2019-02-07T07:48:21Z",
"vcpu_weight": 0,
"ephemeral_gb": 0
},
"nova_object.namespace": "nova"
},
"force_hosts": null,
"retry": null,
"instance_group": {
"nova_object.name": "InstanceGroup",
"nova_object.version": "1.11",
"nova_object.data": {
"members": null,
"hosts": null,
"policy": "anti-affinity"
},
"nova_object.namespace": "nova",
"nova_object.changes": [
"policy",
"members",
"hosts"
]
},
"scheduler_hints": {
"group": [
"295c99ea-2db6-469a-877f-454a3903a8d8"
]
},
"limits": {
"nova_object.name": "SchedulerLimits",
"nova_object.version": "1.0",
"nova_object.data": {
"disk_gb": null,
"numa_topology": null,
"memory_mb": null,
"vcpu": null
},
"nova_object.namespace": "nova",
"nova_object.changes": [
"disk_gb",
"vcpu",
"memory_mb",
"numa_topology"
]
},
"force_nodes": null,
"project_id": "1bf4dbb3d2c746658f462bf8e59ec6be",
"user_id": "255cca4584c24b16a684e3e8322b436b",
"numa_topology": null,
"is_bfv": false,
"pci_requests": {
"nova_object.name": "InstancePCIRequests",
"nova_object.version": "1.1",
"nova_object.data": {
"instance_uuid": "2098b550-c749-460a-a44e-5932535993a9",
"requests": []
},
"nova_object.namespace": "nova"
}
},
"nova_object.namespace": "nova",
"nova_object.changes": [
"ignore_hosts",
"requested_destination",
"num_instances",
"image",
"availability_zone",
"instance_uuid",
"flavor",
"scheduler_hints",
"pci_requests",
"instance_group",
"limits",
"project_id",
"user_id",
"numa_topology",
"is_bfv",
"retry"
]
}
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1830747/+subscriptions
References