← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2097359] Re: InstanceNUMACell ovo version is incorrect in the instance_extra table

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/940953
Committed: https://opendev.org/openstack/nova/commit/9507b7b92f9bcfff755ebef9fc6b115ef0b1167f
Submitter: "Zuul (22348)"
Branch:    master

commit 9507b7b92f9bcfff755ebef9fc6b115ef0b1167f
Author: Balazs Gibizer <gibi@xxxxxxxxxx>
Date:   Mon Feb 3 11:07:34 2025 +0100

    Update InstanceNUMACell version in more cases
    
    The data migration of InstanceNUMACell 1.4 to 1.5 only moved the data to
    the new pcpuset field but does not update the ovo version string of the
    object in the DB. The previous patch added the missing version update
    logic. However it only fixes the issue if the data is not already "half"
    migrated to the new structure. So this patch adds logic to also do the
    right thing if the wrong data migration already happened.
    
    At the end the solution needs to consider multiple scenarios:
    * data is never migrated to the new schema so the new code needs
      to migrate it and update the version string to match the new schema.
      (done by the previous patch)
    * data is half migrated by the buggy code and the new code need to
      finish the migration by stamping the version in the DB.
    * data is half migrated and then further modified to use the new 1.6
      feature cpu_policy mixed.
    * data version is older in the DB than we can meaningfully upgrade
    
    Closes-Bug: #2097359
    Change-Id: I10ecfa7841b15637dea3e4736e90faa5f33ddff3


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2097359

Title:
  InstanceNUMACell ovo version is incorrect in the instance_extra table

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  In Victoria the InstanceNUMACell ovo got a new field `pcpuset` (ovo
  version 1.5) and also it got a data migration that moves the value of
  the pre-existing `cpuset` variable to the new `pcpuset` variable for
  instances with cpu_policy `dedicated` during the load of the
  InstanceNUMATopogy ovo from the numa_topology field of the
  instance_extra table of the cell DB.

  If the nova-conductor is Victoria or newer (supporting ovo version
  1.6) and there is a nova-compute that is older than Victoria
  (supporting ovo version 1.4) the nova-compute service gets a wrong
  IntanceNUMACell ovo via RPC when loading the instance e.g. in
  _init_instance during the nova-compute startup.

  The root cause of the problem is that the data migration logic only do
  the data move between the fields but does not bump the ovo version in
  DB. So the DB will contain a data structure in 1.6 format but it has a
  version field set to 1.4.

  {
    "nova_object.name": "InstanceNUMATopology",
    "nova_object.namespace": "nova",
    "nova_object.version": "1.3",
    "nova_object.data": {
      "cells": [
        {
          "nova_object.name": "InstanceNUMACell",
          "nova_object.namespace": "nova",
          "nova_object.version": "1.4",  <------- !!!
          "nova_object.data": {
            "id": 0,
            "cpuset": [],
            "pcpuset": [      <------ !!!
              0
            ],
            "cpuset_reserved": null,
            "memory": 512,
            "pagesize": null,
            "cpu_pinning_raw": {
              "0": 1
            },
            "cpu_policy": "dedicated",
            "cpu_thread_policy": null
          },
          "nova_object.changes": [
            "cpuset_reserved",
            "id",
            "cpuset",
            "cpu_pinning_raw",
            "pcpuset"
          ]
        }
      ],
      "emulator_threads_policy": null
    },
    "nova_object.changes": [
      "emulator_threads_policy",
      "cells"
    ]
  }

  
  This result in multiple issues:

  1. when the nova-compute gets this data it only sees the cpuset field
  and not the pcpuset field as it is not part of the ovo 1.4 version it
  understands. But because the version field indicates version 1.4 the
  compute does not request backlevelling of the ovo from the conductor
  as it is not considered a too new ovo. Instead the compute tries to
  use the object as is, with the empty cpuset field. If the compute is
  configured to restart the instance at nova-compute startup with
  resume_guest_state_on_host_boot config, or if the user try to reboot
  the instance via the API, then the nova-compute will generate an
  invalid XML based on the empyt cpuset field.

  <cputune>
  <shares>2048</shares>
  <emulatorpin cpuset=""/>
  </cputune>

  Then libvirt rejects such XML with

  Failed to start libvirt guest: libvirt.libvirtError: invalid argument:
  Failed to parse bitmap ''

  as the emulatorpin cpuset cannot be empty. So the reboot of the
  instance fails and the instance is put into ERROR state.

  2. During 1. the compute sets the instance to ERROR state and saves
  the new instance state back to the DB. As part of this it sends back
  the incorrect InstanceNUMACell ovo data to the conductor that blindly
  persist it into the DB. So the DB will now contain inconsistent data.
  The cpuset is empty and the pcpuset field is lost:

  
  {
    "nova_object.name": "InstanceNUMATopology",
    "nova_object.namespace": "nova",
    "nova_object.version": "1.3",
    "nova_object.data": {
      "cells": [
        {
          "nova_object.name": "InstanceNUMACell",
          "nova_object.namespace": "nova",
          "nova_object.version": "1.4",
          "nova_object.data": {
            "id": 0,
            "cpuset": [],  <------------------------------- empty!!!
            "cpuset_reserved": null,
            "memory": 4096,
            "pagesize": 1048576,
            "cpu_pinning_raw": {
              "0": 63,
              "1": 7
            },
            "cpu_policy": "dedicated",
            "cpu_thread_policy": null
          },
          "nova_object.changes": [
            "id",
            "cpu_pinning_raw",
            "cpuset_reserved",
            "cpuset",
            "pagesize"
          ]
        }
      ],
      "emulator_threads_policy": null
    },
    "nova_object.changes": [
      "emulator_threads_policy",
      "cells"
    ]
  }

  Any subsequent instance lifecycle operation will fail due to the empyt
  cpuset field.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2097359/+subscriptions



References