← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1999814] Re: [SRU] Allow for specifying common baseline CPU model with disabled feature

 

The patch is present in Ubuntu Mantic and later.

$ git grep skip_cpu_compare_at_startup pkg/ubuntu/oracular-devel -- nova/conf/workarounds.py
pkg/ubuntu/oracular-devel:nova/conf/workarounds.py:    cfg.BoolOpt('skip_cpu_compare_at_startup',

$ git grep skip_cpu_compare_at_startup pkg/ubuntu/noble-devel -- nova/conf/workarounds.py
pkg/ubuntu/noble-devel:nova/conf/workarounds.py:    cfg.BoolOpt('skip_cpu_compare_at_startup',

$ git grep skip_cpu_compare_at_startup pkg/ubuntu/mantic-devel -- nova/conf/workarounds.py
pkg/ubuntu/mantic-devel:nova/conf/workarounds.py:    cfg.BoolOpt('skip_cpu_compare_at_startup',

$ git grep skip_cpu_compare_at_startup pkg/ubuntu/jammy-devel -- nova/conf/workarounds.py
$

$ git grep skip_cpu_compare_at_startup pkg/ubuntu/focal-devel -- nova/conf/workarounds.py
$

$ git grep skip_cpu_compare_at_startup pkg/ubuntu/bionic-devel -- nova/conf/workarounds.py
$


** Also affects: nova (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Changed in: nova (Ubuntu)
       Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1999814

Title:
  [SRU] Allow for specifying common baseline CPU model with disabled
  feature

Status in OpenStack Compute (nova):
  Expired
Status in OpenStack Compute (nova) ussuri series:
  New
Status in OpenStack Compute (nova) victoria series:
  New
Status in OpenStack Compute (nova) wallaby series:
  New
Status in OpenStack Compute (nova) xena series:
  New
Status in OpenStack Compute (nova) yoga series:
  New
Status in nova package in Ubuntu:
  Fix Released
Status in nova source package in Bionic:
  New
Status in nova source package in Focal:
  New
Status in nova source package in Jammy:
  New

Bug description:
  ******** SRU TEMPLATE AT THE BOTTOM *******

  Hello,

  This is very similar to pad.lv/1852437 (and the related blueprint at
  https://blueprints.launchpad.net/nova/+spec/allow-disabling-cpu-
  flags), but there is a very different and important nuance.

  A customer I'm working with has two classes of blades that they're
  trying to use.  Their existing ones are Cascade Lake-based; they are
  presently using the Cascadelake-Server-noTSX CPU model via
  libvirt.cpu_model in nova.conf.  Their new blades are Ice Lake-based,
  which is a newer processor, which typically would also be able to run
  based on the Cascade Lake feature set - except that these Ice Lake
  processors lack the MPX feature defined in the Cascadelake-Server-
  noTSX model.

  The result of this is evident when I try to start nova on the new
  blades with the Ice Lake CPUs.  Even if I specify the following in my
  nova.conf:

  [libvirt]
  cpu_mode = custom
  cpu_model = Cascadelake-Server-noTSX
  cpu_model_extra_flags = -mpx

  That is not enough to allow Nova to start; it fails in the libvirt
  driver in the _check_cpu_compatibility function:

  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service Traceback (most recent call last):
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 771, in _check_cpu_compatibility
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service     self._compare_cpu(cpu, self._get_cpu_info(), None)
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 8817, in _compare_cpu
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service     raise exception.InvalidCPUInfo(reason=m % {'ret': ret, 'u': u})
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service nova.exception.InvalidCPUInfo: Unacceptable CPU info: CPU doesn't have compatibility.
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service 0
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service During handling of the above exception, another exception occurred:
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service Traceback (most recent call last):
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service   File "/usr/lib/python3/dist-packages/oslo_service/service.py", line 810, in run_service
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service     service.start()
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service   File "/usr/lib/python3/dist-packages/nova/service.py", line 173, in start
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service     self.manager.init_host()
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service   File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 1404, in init_host
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service     self.driver.init_host(host=self.host)
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 743, in init_host
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service     self._check_cpu_compatibility()
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service   File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 777, in _check_cpu_compatibility
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service     raise exception.InvalidCPUInfo(msg)
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service nova.exception.InvalidCPUInfo: Configured CPU model: Cascadelake-Server-noTSX is not compatible with host CPU. Please correct your config and try again. Unacceptable CPU info: CPU doesn't have compatibility.
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service 0
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult
  2022-12-15 17:20:59.562 1836708 ERROR oslo_service.service

  If I make a custom libvirt CPU map file which removes the "<feature
  name='mpx'/>" feature and specify that as the cpu_model instead, I am
  able to make Nova start - so it does indeed seem to specifically be
  that single feature which is blocking me.  However, editing the
  libvirt CPU mapping files is probably not the right way to fix this -
  hence why I'm filing this bug, for discussion of how to support cases
  like this.

  Currently the only "proper" way I'm aware of to work around this right
  now is to fall back to a Broadwell-based configuration which lacks the
  "mpx" feature to use as a common baseline, but that's a much older
  configuration than Cascade Lake and would mean missing out on all the
  other features which are common in both Cascade Lake and Ice Lake.  I
  would rather if there were a way to use the Cascade Lake settings but
  simply remove that "mpx" feature from use.

  ----

  Steps to reproduce
  ==================

  On an Ice Lake system lacking the MPX feature (e.g. /proc/cpuinfo
  reporting model of "Intel(R) Xeon(R) Gold 5318Y"), specify the
  following settings in nova.conf in libvirt settings:

  [libvirt]
  cpu_mode = custom
  cpu_model = Cascadelake-Server-noTSX
  cpu_model_extra_flags = -mpx

  Then try to start nova.

  Expected result
  ===============

  Nova should start since Cascadelake-Server-noTSX is a subset of
  Icelake-Server-noTSX, thus allowing the use of Cascadelake-Server-
  noTSX as a common baseline model for both Cascade Lake and Ice Lake
  servers.

  Actual result
  =============

  Nova refuses to start, claiming the specified CPU model is
  incompatible.  The "cpu_model_extra_flags = -mpx" config option does
  not help.

  Environment
  ===========

  Nova/OpenStack version: OpenStack Ussuri running on Ubuntu Focal.
  Specifically, nova packages are at version 2:21.2.4-0ubuntu2.

  Hypervisor: libvirt + KVM

  Other relevant notes
  ====================

  There are some other open related bugs.  The removal of the MPX
  feature in some Ice Lake processors has manifested in other ways as
  well.  These bugs are primarily in regards to the missing MPX feature
  breaking how Ice Lake processors are detected, so the nuance is
  somewhat different - however, they may be worth reviewing as well.

  * https://gitlab.com/libvirt/libvirt/-/issues/304: bug regarding the
  Icelake CPU maps in libvirt not working to detect certain Ice Lakes,
  instead detecting them as Broadwell-noTSX-IBRS according to "virsh
  capabilities" due to lacking the MPX feature.  (I've personally tested
  that removing the mpx feature from the associated CPU mapping files
  allows for detecting as Ice Lake, but that's not the correct way to
  fix this.)

  There is also an interesting comment on this bug at
  https://gitlab.com/libvirt/libvirt/-/issues/304#note_1065798706.  It
  basically implies that rather than looking at "virsh capabilities",
  "virsh domcapabilities" should be used instead as it seems to more
  correctly identify the CPU model even if there are disabled flags like
  MPX.

  * https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1978064:
  Launchpad-side bug regarding the above issue as encountered in Ubuntu.

  
  ===============
  SRU Description
  ===============

  [Impact]

  When using IceLake CPUs alongside CascadeLake CPUs, the Nova code does
  not start due to comparing CPU models. It fails before even comparing
  the flags. Unfortunately, IceLake CPUs are detected as having
  compatibility with Broadwell, not CascadeLake. Using Broadwell as a
  common denominator disables many modern features. The Libvirt upstream
  team will not add specific support to IceLake [1]. The fix [2] in Nova
  is to ignore CPU check (as a configurable workaround) as let libvirt
  handle the added/removed flags, which is assumed to work for this
  specific case.

  [Test case]

  Due to not having Icelake and Cascadelake CPUs in the lab for testing
  of this specific scenario, the test case for this will be run for this
  SRU is running the charmed-openstack-tester [1] against the
  environment containing the upgraded package (essentially as it would
  be in a point release SRU) and expect the test to pass. Test run
  evidence will be attached to LP.

  [Regression Potential]

  There is 1 new behavior introduced and 1 changed. The behavior
  introduced is gated by a new config option that needs to be enabled.
  The behavior changed is the one assumed by the default disabled value
  of the config option, and is not (in theory) intended to be the code
  path that addresses the bug. If we had the capability of testing the
  bug and fix in the lab, we could minimize risk by just introducing the
  config option and no further changes. On the other hand, the fact that
  the code being backported in Yoga is exactly the same as in currently
  Master (Caracal+), it means that no issues have been found with the
  code across 4 releases.

  [Other Info]

  [1] https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1978064
  [2] https://review.opendev.org/c/openstack/nova/+/871969

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1999814/+subscriptions



References