← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2097359] [NEW] InstanceNUMACell ovo version is incorrect in the instance_extra table

 

Public bug reported:

In Victoria the InstanceNUMACell ovo got a new field `pcpuset` (ovo
version 1.5) and also it got a data migration that moves the value of
the pre-existing `cpuset` variable to the new `pcpuset` variable for
instances with cpu_policy `dedicated` during the load of the
InstanceNUMATopogy ovo from the numa_topology field of the
instance_extra table of the cell DB.

If the nova-conductor is Victoria or newer (supporting ovo version 1.6)
and there is a nova-compute that is older than Victoria (supporting ovo
version 1.4) the nova-compute service gets a wrong IntanceNUMACell ovo
via RPC when loading the instance e.g. in _init_instance during the
nova-compute startup.

The root cause of the problem is that the data migration logic only do
the data move between the fields but does not bump the ovo version in
DB. So the DB will contain a data structure in 1.6 format but it has a
version field set to 1.4.

{
  "nova_object.name": "InstanceNUMATopology",
  "nova_object.namespace": "nova",
  "nova_object.version": "1.3",
  "nova_object.data": {
    "cells": [
      {
        "nova_object.name": "InstanceNUMACell",
        "nova_object.namespace": "nova",
        "nova_object.version": "1.4",  <------- !!!
        "nova_object.data": {
          "id": 0,
          "cpuset": [],
          "pcpuset": [      <------ !!!
            0
          ],
          "cpuset_reserved": null,
          "memory": 512,
          "pagesize": null,
          "cpu_pinning_raw": {
            "0": 1
          },
          "cpu_policy": "dedicated",
          "cpu_thread_policy": null
        },
        "nova_object.changes": [
          "cpuset_reserved",
          "id",
          "cpuset",
          "cpu_pinning_raw",
          "pcpuset"
        ]
      }
    ],
    "emulator_threads_policy": null
  },
  "nova_object.changes": [
    "emulator_threads_policy",
    "cells"
  ]
}


This result in multiple issues:

1. when the nova-compute gets this data it only sees the cpuset field
and not the pcpuset field as it is not part of the ovo 1.4 version it
understands. But because the version field indicates version 1.4 the
compute does not request backlevelling of the ovo from the conductor as
it is not considered a too new ovo. Instead the compute tries to use the
object as is, with the empty cpuset field. If the compute is configured
to restart the instance at nova-compute startup with
resume_guest_state_on_host_boot config, or if the user try to reboot the
instance via the API, then the nova-compute will generate an invalid XML
based on the empyt cpuset field.

<cputune>
<shares>2048</shares>
<emulatorpin cpuset=""/>
</cputune>

Then libvirt rejects such XML with

Failed to start libvirt guest: libvirt.libvirtError: invalid argument:
Failed to parse bitmap ''

as the emulatorpin cpuset cannot be empty. So the reboot of the instance
fails and the instance is put into ERROR state.

2. During 1. the compute sets the instance to ERROR state and saves the
new instance state back to the DB. As part of this it sends back the
incorrect InstanceNUMACell ovo data to the conductor that blindly
persist it into the DB. So the DB will now contain inconsistent data.
The cpuset is empty and the pcpuset field is lost:


{
  "nova_object.name": "InstanceNUMATopology",
  "nova_object.namespace": "nova",
  "nova_object.version": "1.3",
  "nova_object.data": {
    "cells": [
      {
        "nova_object.name": "InstanceNUMACell",
        "nova_object.namespace": "nova",
        "nova_object.version": "1.4",
        "nova_object.data": {
          "id": 0,
          "cpuset": [],  <------------------------------- empty!!!
          "cpuset_reserved": null,
          "memory": 4096,
          "pagesize": 1048576,
          "cpu_pinning_raw": {
            "0": 63,
            "1": 7
          },
          "cpu_policy": "dedicated",
          "cpu_thread_policy": null
        },
        "nova_object.changes": [
          "id",
          "cpu_pinning_raw",
          "cpuset_reserved",
          "cpuset",
          "pagesize"
        ]
      }
    ],
    "emulator_threads_policy": null
  },
  "nova_object.changes": [
    "emulator_threads_policy",
    "cells"
  ]
}

Any subsequent instance lifecycle operation will fail due to the empyt
cpuset field.

** Affects: nova
     Importance: Undecided
     Assignee: Balazs Gibizer (balazs-gibizer)
         Status: New

** Changed in: nova
     Assignee: (unassigned) => Balazs Gibizer (balazs-gibizer)

** Summary changed:

- InstanceNUMATopology ovo version is incorrect in the instance_extra table
+ InstanceNUMACell ovo version is incorrect in the instance_extra table

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2097359

Title:
  InstanceNUMACell ovo version is incorrect in the instance_extra table

Status in OpenStack Compute (nova):
  New

Bug description:
  In Victoria the InstanceNUMACell ovo got a new field `pcpuset` (ovo
  version 1.5) and also it got a data migration that moves the value of
  the pre-existing `cpuset` variable to the new `pcpuset` variable for
  instances with cpu_policy `dedicated` during the load of the
  InstanceNUMATopogy ovo from the numa_topology field of the
  instance_extra table of the cell DB.

  If the nova-conductor is Victoria or newer (supporting ovo version
  1.6) and there is a nova-compute that is older than Victoria
  (supporting ovo version 1.4) the nova-compute service gets a wrong
  IntanceNUMACell ovo via RPC when loading the instance e.g. in
  _init_instance during the nova-compute startup.

  The root cause of the problem is that the data migration logic only do
  the data move between the fields but does not bump the ovo version in
  DB. So the DB will contain a data structure in 1.6 format but it has a
  version field set to 1.4.

  {
    "nova_object.name": "InstanceNUMATopology",
    "nova_object.namespace": "nova",
    "nova_object.version": "1.3",
    "nova_object.data": {
      "cells": [
        {
          "nova_object.name": "InstanceNUMACell",
          "nova_object.namespace": "nova",
          "nova_object.version": "1.4",  <------- !!!
          "nova_object.data": {
            "id": 0,
            "cpuset": [],
            "pcpuset": [      <------ !!!
              0
            ],
            "cpuset_reserved": null,
            "memory": 512,
            "pagesize": null,
            "cpu_pinning_raw": {
              "0": 1
            },
            "cpu_policy": "dedicated",
            "cpu_thread_policy": null
          },
          "nova_object.changes": [
            "cpuset_reserved",
            "id",
            "cpuset",
            "cpu_pinning_raw",
            "pcpuset"
          ]
        }
      ],
      "emulator_threads_policy": null
    },
    "nova_object.changes": [
      "emulator_threads_policy",
      "cells"
    ]
  }

  
  This result in multiple issues:

  1. when the nova-compute gets this data it only sees the cpuset field
  and not the pcpuset field as it is not part of the ovo 1.4 version it
  understands. But because the version field indicates version 1.4 the
  compute does not request backlevelling of the ovo from the conductor
  as it is not considered a too new ovo. Instead the compute tries to
  use the object as is, with the empty cpuset field. If the compute is
  configured to restart the instance at nova-compute startup with
  resume_guest_state_on_host_boot config, or if the user try to reboot
  the instance via the API, then the nova-compute will generate an
  invalid XML based on the empyt cpuset field.

  <cputune>
  <shares>2048</shares>
  <emulatorpin cpuset=""/>
  </cputune>

  Then libvirt rejects such XML with

  Failed to start libvirt guest: libvirt.libvirtError: invalid argument:
  Failed to parse bitmap ''

  as the emulatorpin cpuset cannot be empty. So the reboot of the
  instance fails and the instance is put into ERROR state.

  2. During 1. the compute sets the instance to ERROR state and saves
  the new instance state back to the DB. As part of this it sends back
  the incorrect InstanceNUMACell ovo data to the conductor that blindly
  persist it into the DB. So the DB will now contain inconsistent data.
  The cpuset is empty and the pcpuset field is lost:

  
  {
    "nova_object.name": "InstanceNUMATopology",
    "nova_object.namespace": "nova",
    "nova_object.version": "1.3",
    "nova_object.data": {
      "cells": [
        {
          "nova_object.name": "InstanceNUMACell",
          "nova_object.namespace": "nova",
          "nova_object.version": "1.4",
          "nova_object.data": {
            "id": 0,
            "cpuset": [],  <------------------------------- empty!!!
            "cpuset_reserved": null,
            "memory": 4096,
            "pagesize": 1048576,
            "cpu_pinning_raw": {
              "0": 63,
              "1": 7
            },
            "cpu_policy": "dedicated",
            "cpu_thread_policy": null
          },
          "nova_object.changes": [
            "id",
            "cpu_pinning_raw",
            "cpuset_reserved",
            "cpuset",
            "pagesize"
          ]
        }
      ],
      "emulator_threads_policy": null
    },
    "nova_object.changes": [
      "emulator_threads_policy",
      "cells"
    ]
  }

  Any subsequent instance lifecycle operation will fail due to the empyt
  cpuset field.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2097359/+subscriptions



Follow ups