← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1924123] [NEW] If source compute node is overcommitted instances can't be migrated

 

Public bug reported:

I'm facing a similar issue to "https://bugs.launchpad.net/nova/+bug/1918419";
but somehow different which makes me open a new bug.

I'm giving some context to this bug to better explain how this affects
operations. Here's the story...

When a compute node needs a hardware intervention we have an automated
process that the repair team uses (they don't have access to OpenStack
APIs) to live migrate all the instances before starting the repair. The
motivation is to minimize the impact on users.

However, instances can't be live migrated if the compute node becomes
overcommitted!

It happens that if a DIMM fails in a compute node that has all the
memory allocated to VMs, it's not possible to move those VMs.

"No valid host was found. Unable to replace instance claim on source
(HTTP 400)"

The compute node becomes overcommitted (because the DIMM is not visible
anymore) and placement can't create the migration allocation in the
source.

The operator can workaround and "tune" the memory overcommit for the
affected compute node, but that requires investigation and a manual
intervention of an operator defeating automation and delegation to other
teams. Extremely complicated in large deployments.

I don't believe this behaviour is correct. 
If there are available resources to host the instances in a different compute node, placement shouldn't block the live migration because the source is overcommitted.

+++

Using Nova Stein.
For what I checked looks it's still the behaviour in recent releases.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1924123

Title:
  If source compute node is overcommitted instances can't be migrated

Status in OpenStack Compute (nova):
  New

Bug description:
  I'm facing a similar issue to "https://bugs.launchpad.net/nova/+bug/1918419";
  but somehow different which makes me open a new bug.

  I'm giving some context to this bug to better explain how this affects
  operations. Here's the story...

  When a compute node needs a hardware intervention we have an automated
  process that the repair team uses (they don't have access to OpenStack
  APIs) to live migrate all the instances before starting the repair.
  The motivation is to minimize the impact on users.

  However, instances can't be live migrated if the compute node becomes
  overcommitted!

  It happens that if a DIMM fails in a compute node that has all the
  memory allocated to VMs, it's not possible to move those VMs.

  "No valid host was found. Unable to replace instance claim on source
  (HTTP 400)"

  The compute node becomes overcommitted (because the DIMM is not
  visible anymore) and placement can't create the migration allocation
  in the source.

  The operator can workaround and "tune" the memory overcommit for the
  affected compute node, but that requires investigation and a manual
  intervention of an operator defeating automation and delegation to
  other teams. Extremely complicated in large deployments.

  I don't believe this behaviour is correct. 
  If there are available resources to host the instances in a different compute node, placement shouldn't block the live migration because the source is overcommitted.

  +++

  Using Nova Stein.
  For what I checked looks it's still the behaviour in recent releases.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1924123/+subscriptions