yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1864665] [NEW] Circular reference error during re-schedule

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Balazs Gibizer <balazs.gibizer@xxxxxxxx>
Date: Tue, 25 Feb 2020 15:51:48 -0000
Reply-to: Bug 1864665 <1864665@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Description
===========
Server cold migration fails after re-schedule.

Steps to reproduce
==================
* create a devstack with two compute hosts with libvirt driver
* set allow_resize_to_same_host=True on both computes
* set up cellsv2 without cell conductor and rabbit separation to allow re-schedule logic to call back to the super conductor / scheduler
* enable NUMATopologyFilter and make sure both computes has NUMA resources
* create a flavor with hw:cpu_policy='dedicated' extra spec
* boot a server with the flavor. Check which compute the server is placed (let's call it host1)
* boot enough servers on host2 so that the next scheduling request could still be fulfilled by both computes but host1 will be preferred by the weighers
* cold migrate the pinned server

Expected result
===============
* scheduler selects host1 first but that host fails with UnableToMigrateToSelf exception as libvirt does not have the capability
* re-schedule happens
* scheduler selects host2 where the server spawns successfully

Actual result
=============
* during the re-schedule when the conductor sends prep_resize RPC to host2 the json serialization of the request spec fails with Circural reference error.

Environment
===========
* two node devstack with libvirt driver
* stable/pike nova. But expected to be reproduced in newer branches but not since stein. See triage part

Triage
======
The json serialization blows up in the migrate conductor task. [1] After debugging I see that the infinit loop happens when jsonutils.to_primitive tries to serialize a VirtCPUTopology instance.

The problematic piece of code has been removed by
I4244f7dd8fe74565180f73684678027067b4506e in Stein.

[1]
https://github.com/openstack/nova/blob/4224a61b4f3a8b910dcaa498f9663479d61a6060/nova/conductor/tasks/migrate.py#L87

** Affects: nova
     Importance: Medium
     Assignee: Balazs Gibizer (balazs-gibizer)
         Status: Invalid

** Affects: nova/ocata
     Importance: Undecided
         Status: New

** Affects: nova/pike
     Importance: Medium
     Assignee: Balazs Gibizer (balazs-gibizer)
         Status: Triaged

** Affects: nova/queens
     Importance: Undecided
         Status: New

** Affects: nova/rocky
     Importance: Undecided
         Status: New


** Tags: stable-only

** Tags added: stable-only

** Changed in: nova
     Assignee: (unassigned) => Balazs Gibizer (balazs-gibizer)

** Changed in: nova
       Status: New => Triaged

** Changed in: nova
   Importance: Undecided => Medium

** Also affects: nova/pike
   Importance: Undecided
       Status: New

** Also affects: nova/rocky
   Importance: Undecided
       Status: New

** Also affects: nova/queens
   Importance: Undecided
       Status: New

** Also affects: nova/ocata
   Importance: Undecided
       Status: New

** Changed in: nova
       Status: Triaged => Invalid

** Changed in: nova/pike
       Status: New => Triaged

** Changed in: nova/pike
   Importance: Undecided => Medium

** Changed in: nova/pike
     Assignee: (unassigned) => Balazs Gibizer (balazs-gibizer)

** Description changed:

  Description
  ===========
  Server cold migration fails after re-schedule.
  
  Steps to reproduce
  ==================
  * create a devstack with two compute hosts with libvirt driver
  * set allow_resize_to_same_host=True on both computes
  * set up cellsv2 without cell conductor and rabbit separation to allow re-schedule logic to call back to the super conductor / scheduler
  * enable NUMATopologyFilter and make sure both computes has NUMA resources
  * create a flavor with hw:cpu_policy='dedicated' extra spec
- * boot a server with the flavor and ensure that the server. Check which compute the server is placed (let's call it host1)
+ * boot a server with the flavor. Check which compute the server is placed (let's call it host1)
  * boot enough servers on host2 so that the next scheduling request could still be fulfilled by both computes but host1 will be preferred by the weighers
  * cold migrate the pinned server
  
  Expected result
  ===============
  * scheduler selects host1 first but that host fails with UnableToMigrateToSelf exception as libvirt does not have the capability
  * re-schedule happens
  * scheduler selects host2 where the server spawns successfully
  
  Actual result
  =============
  * during the re-schedule when the conductor sends prep_resize RPC to host2 the json serialization of the request spec fails with Circural reference error.
  
  Environment
  ===========
- * two node devstack with libvirt driver 
+ * two node devstack with libvirt driver
  * stable/pike nova. But expected to be reproduced in newer branches but not since stein. See triage part
- 
  
  Triage
  ======
  The json serialization blows up in the migrate conductor task. [1] After debugging I see that the infinit loop happens when jsonutils.to_primitive tries to serialize a VirtCPUTopology instance.
  
  The problematic piece of code has been removed by
  I4244f7dd8fe74565180f73684678027067b4506e in Stein.
  
  [1]
  https://github.com/openstack/nova/blob/4224a61b4f3a8b910dcaa498f9663479d61a6060/nova/conductor/tasks/migrate.py#L87

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1864665

Title:
  Circular reference error during re-schedule

Status in OpenStack Compute (nova):
  Invalid
Status in OpenStack Compute (nova) ocata series:
  New
Status in OpenStack Compute (nova) pike series:
  Triaged
Status in OpenStack Compute (nova) queens series:
  New
Status in OpenStack Compute (nova) rocky series:
  New

Bug description:
  Description
  ===========
  Server cold migration fails after re-schedule.

  Steps to reproduce
  ==================
  * create a devstack with two compute hosts with libvirt driver
  * set allow_resize_to_same_host=True on both computes
  * set up cellsv2 without cell conductor and rabbit separation to allow re-schedule logic to call back to the super conductor / scheduler
  * enable NUMATopologyFilter and make sure both computes has NUMA resources
  * create a flavor with hw:cpu_policy='dedicated' extra spec
  * boot a server with the flavor. Check which compute the server is placed (let's call it host1)
  * boot enough servers on host2 so that the next scheduling request could still be fulfilled by both computes but host1 will be preferred by the weighers
  * cold migrate the pinned server

  Expected result
  ===============
  * scheduler selects host1 first but that host fails with UnableToMigrateToSelf exception as libvirt does not have the capability
  * re-schedule happens
  * scheduler selects host2 where the server spawns successfully

  Actual result
  =============
  * during the re-schedule when the conductor sends prep_resize RPC to host2 the json serialization of the request spec fails with Circural reference error.

  Environment
  ===========
  * two node devstack with libvirt driver
  * stable/pike nova. But expected to be reproduced in newer branches but not since stein. See triage part

  Triage
  ======
  The json serialization blows up in the migrate conductor task. [1] After debugging I see that the infinit loop happens when jsonutils.to_primitive tries to serialize a VirtCPUTopology instance.

  The problematic piece of code has been removed by
  I4244f7dd8fe74565180f73684678027067b4506e in Stein.

  [1]
  https://github.com/openstack/nova/blob/4224a61b4f3a8b910dcaa498f9663479d61a6060/nova/conductor/tasks/migrate.py#L87

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1864665/+subscriptions
Follow ups

[Bug 1864665] Fix included in openstack/nova rocky-eol
From: OpenStack Infra, 2022-11-11
[Bug 1864665] Fix included in openstack/nova queens-eol
From: OpenStack Infra, 2022-11-11
[Bug 1864665] Fix included in openstack/nova pike-eol
From: OpenStack Infra, 2022-08-01