yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1821594] Re: Error in confirm_migration leaves stale allocations and 'confirming' migration state

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Mon, 25 Mar 2019 15:56:30 -0000
Reply-to: Bug 1821594 <1821594@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

This goes back to Pike as noted in the bug description. Before Pike the
ResourceTracker.update_available_resource code would at least correct
the allocations based on the instance.flavor and instance.host.

** Also affects: nova/rocky
   Importance: Undecided
       Status: New

** Also affects: nova/pike
   Importance: Undecided
       Status: New

** Also affects: nova/stein
   Importance: Undecided
       Status: New

** Also affects: nova/queens
   Importance: Undecided
       Status: New

** Changed in: nova/pike
       Status: New => Triaged

** Changed in: nova/rocky
       Status: New => Triaged

** Changed in: nova/stein
       Status: New => Triaged

** Changed in: nova/rocky
   Importance: Undecided => Medium

** Changed in: nova/pike
   Importance: Undecided => Medium

** Changed in: nova/queens
       Status: New => Triaged

** Changed in: nova/queens
   Importance: Undecided => Medium

** Changed in: nova/stein
   Importance: Undecided => Medium

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1821594

Title:
  Error in confirm_migration leaves stale allocations and 'confirming'
  migration state

Status in OpenStack Compute (nova):
  Triaged
Status in OpenStack Compute (nova) pike series:
  Triaged
Status in OpenStack Compute (nova) queens series:
  Triaged
Status in OpenStack Compute (nova) rocky series:
  Triaged
Status in OpenStack Compute (nova) stein series:
  Triaged

Bug description:
  
  Description:

  When performing a cold migration, if an exception is raised by the
  driver during confirm_migration (this runs in the source node), the
  migration record is stuck in "confirming" state and the allocations
  against the source node are not removed.

  The instance is fine at the destination in this stage, but the source
  host has allocations that is not possible to clean without going to
  the database or invoking the Placement API via curl. After several
  migration attempts that fail in the same spot, the source node is
  filled with these allocations that prevent new instances from being
  created or instances migrated to this node.

  When confirm_migration fails in this stage, the migrating instance can
  be saved through a hard reboot or a reset state to active.

  Steps to reproduce:

  Unfortunately, I don't have logs of the real root cause of the problem
  inside driver.confirm_migration running libvirt driver. However, the
  stale allocations and migration status problem can be easily
  reproduced by raising an exception in libvirt driver's
  confirm_migration method, and it would affect any driver.

  Expected results:

  Discussed this issue with efried and mriedem over #openstack-nova on
  March 25th, 2019. They confirmed that allocations not being cleared up
  is a bug.

  Actual results:

  Instance is fine at the destination after a reset-state. Source node
  has stale allocations that prevent new instances from being
  created/migrated to the source node. Migration record is stuck in
  "confirming" state.

  Environment:

  I verified this bug on on pike, queens and stein branches. Running
  libvirt KVM driver.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1821594/+subscriptions

References

[Bug 1821594] [NEW] Error in confirm_migration leaves stale allocations and 'confirming' migration state
From: Rodrigo Barbieri, 2019-03-25