yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1830747] Re: Error 500 trying to migrate an instance after wrong request_spec

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Tue, 28 May 2019 15:52:34 -0000
Reply-to: Bug 1830747 <1830747@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
This might explain what's happening during a cold migration.

Conductor creates a legacy filter_properties dict here:

https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/tasks/migrate.py#L172

If the spec has an instance_group it will call here:

https://github.com/openstack/nova/blob/stable/rocky/nova/objects/request_spec.py#L397

and _to_legacy_group_info sets these values in the filter_properties
dict:

        return {'group_updated': True,
                'group_hosts': set(self.instance_group.hosts),
                'group_policies': set([self.instance_group.policy]),
                'group_members': set(self.instance_group.members)}

Note there is no group_uuid.

Those filter_properties are passed to the prep_resize method on the dest
compute:

https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/tasks/migrate.py#L304

zigo said he hit this:

https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L4272

(10:03:07 AM) zigo: 2019-05-28 15:02:35.534 30706 ERROR
nova.compute.manager [instance: ae6f8afe-9c64-4aaf-90e8-be8175fee8e4]
nova.exception.UnableToMigrateToSelf: Unable to migrate instance
(ae6f8afe-9c64-4aaf-90e8-be8175fee8e4) to current host
(clint1-compute-5.infomaniak.ch).

which will trigger a reschedule here:

https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L4348

The _reschedule_resize_or_reraise method will setup the parameters for
the resize_instance compute task RPC API (conductor) method:

https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L4378-L4379

Note that in Rocky the RequestSpec is not passed back to conductor on
the reschedule, only the filter_properties:

https://github.com/openstack/nova/blob/stable/rocky/nova/compute/manager.py#L1452

We only started passing the RequestSpec from compute to conductor on
reschedule starting in Stein: https://review.opendev.org/#/c/582417/

Without the request spec we get here in conductor:

https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/manager.py#L307

Note that was pass in the filter_properties but no instance_group to
RequestSpec.from_components.

And because there is no instance_group but there are filter_properties,
we call _populate_group_info here:

https://github.com/openstack/nova/blob/stable/rocky/nova/objects/request_spec.py#L442

Which means we get into this block that sets the
RequestSpec.instance_group with no uuid:

https://github.com/openstack/nova/blob/stable/rocky/nova/objects/request_spec.py#L228

Then we eventually RPC cast off to prep_resize on the next host to try
for the cold migration and save the request_spec changes here:

https://github.com/openstack/nova/blob/stable/rocky/nova/conductor/manager.py#L356

Which is how later attempts to use that request spec to migrate the
instance blow up when loading it from the DB because
spec.instance_group.uuid is not set.

** Changed in: nova
   Importance: Undecided => High

** Also affects: nova/queens
   Importance: Undecided
       Status: New

** Also affects: nova/stein
   Importance: Undecided
       Status: New

** Also affects: nova/ocata
   Importance: Undecided
       Status: New

** Also affects: nova/rocky
   Importance: Undecided
       Status: New

** Also affects: nova/pike
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1830747

Title:
  Error 500 trying to migrate an instance after wrong request_spec

Status in OpenStack Compute (nova):
  In Progress
Status in OpenStack Compute (nova) ocata series:
  New
Status in OpenStack Compute (nova) pike series:
  New
Status in OpenStack Compute (nova) queens series:
  New
Status in OpenStack Compute (nova) rocky series:
  New
Status in OpenStack Compute (nova) stein series:
  New

Bug description:
  We've started an instance last Wednesday, and the compute where it ran
  failed (maybe hardware issue?). Since the networking looked wrong (ie:
  missing network interfaces), I tried to migrate the instance.

  According to Matt, it looked like the request_spec entry for the
  instance is wrong:

  <mriedem> my guess is something like this happened: 1. create server in a group, 2. cold migrate the server which fails on host A and does a reschedule to host B which maybe also fails (would be good to know if previous cold migration attempts failed with reschedules), 3. try to cold migrate again which fails with the instance_group.uuid thing
  <mriedem> the reschedule might be the key b/c like i said conductor has to rebuild a request spec and i think that's probably where we're doing a partial build of the request spec but missing the group uuid

  Here's what I had in my novaapidb:

  {
    "nova_object.name": "RequestSpec",
    "nova_object.version": "1.11",
    "nova_object.data": {
      "ignore_hosts": null,
      "requested_destination": null,
      "instance_uuid": "2098b550-c749-460a-a44e-5932535993a9",
      "num_instances": 1,
      "image": {
        "nova_object.name": "ImageMeta",
        "nova_object.version": "1.8",
        "nova_object.data": {
          "min_disk": 40,
          "disk_format": "raw",
          "min_ram": 0,
          "container_format": "bare",
          "properties": {
            "nova_object.name": "ImageMetaProps",
            "nova_object.version": "1.20",
            "nova_object.data": {},
            "nova_object.namespace": "nova"
          }
        },
        "nova_object.namespace": "nova",
        "nova_object.changes": [
          "properties",
          "min_ram",
          "container_format",
          "disk_format",
          "min_disk"
        ]
      },
      "availability_zone": "AZ3",
      "flavor": {
        "nova_object.name": "Flavor",
        "nova_object.version": "1.2",
        "nova_object.data": {
          "id": 28,
          "name": "cpu2-ram6-disk40",
          "is_public": true,
          "rxtx_factor": 1,
          "deleted_at": null,
          "root_gb": 40,
          "vcpus": 2,
          "memory_mb": 6144,
          "disabled": false,
          "extra_specs": {},
          "updated_at": null,
          "flavorid": "e29f3ee9-3f07-46d2-b2e2-efa4950edc95",
          "deleted": false,
          "swap": 0,
          "description": null,
          "created_at": "2019-02-07T07:48:21Z",
          "vcpu_weight": 0,
          "ephemeral_gb": 0
        },
        "nova_object.namespace": "nova"
      },
      "force_hosts": null,
      "retry": null,
      "instance_group": {
        "nova_object.name": "InstanceGroup",
        "nova_object.version": "1.11",
        "nova_object.data": {
          "members": null,
          "hosts": null,
          "policy": "anti-affinity"
        },
        "nova_object.namespace": "nova",
        "nova_object.changes": [
          "policy",
          "members",
          "hosts"
        ]
      },
      "scheduler_hints": {
        "group": [
          "295c99ea-2db6-469a-877f-454a3903a8d8"
        ]
      },
      "limits": {
        "nova_object.name": "SchedulerLimits",
        "nova_object.version": "1.0",
        "nova_object.data": {
          "disk_gb": null,
          "numa_topology": null,
          "memory_mb": null,
          "vcpu": null
        },
        "nova_object.namespace": "nova",
        "nova_object.changes": [
          "disk_gb",
          "vcpu",
          "memory_mb",
          "numa_topology"
        ]
      },
      "force_nodes": null,
      "project_id": "1bf4dbb3d2c746658f462bf8e59ec6be",
      "user_id": "255cca4584c24b16a684e3e8322b436b",
      "numa_topology": null,
      "is_bfv": false,
      "pci_requests": {
        "nova_object.name": "InstancePCIRequests",
        "nova_object.version": "1.1",
        "nova_object.data": {
          "instance_uuid": "2098b550-c749-460a-a44e-5932535993a9",
          "requests": []
        },
        "nova_object.namespace": "nova"
      }
    },
    "nova_object.namespace": "nova",
    "nova_object.changes": [
      "ignore_hosts",
      "requested_destination",
      "num_instances",
      "image",
      "availability_zone",
      "instance_uuid",
      "flavor",
      "scheduler_hints",
      "pci_requests",
      "instance_group",
      "limits",
      "project_id",
      "user_id",
      "numa_topology",
      "is_bfv",
      "retry"
    ]
  }

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1830747/+subscriptions
References

[Bug 1830747] [NEW] Error 500 trying to migrate an instance after wrong request_spec
From: Thomas Goirand, 2019-05-28