yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1683858] [NEW] Allocation records do not contain overhead information

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Tue, 18 Apr 2017 15:52:29 -0000
Reply-to: Bug 1683858 <1683858@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Some virt drivers report additional overhead per instance for memory and
disk usage on a compute node. That is not reported in the allocations
records for a given instance on a resource provider (compute node),
however:

https://github.com/openstack/nova/blob/15.0.0/nova/scheduler/client/report.py#L157

It is used as part of the claim test on the compute when creating an
instance or moving an instance. For creating an instance, that's done
here:

https://github.com/openstack/nova/blob/15.0.0/nova/compute/resource_tracker.py#L144-L156

https://github.com/openstack/nova/blob/15.0.0/nova/compute/claims.py#L165

Where Claim.memory_mb is the instance.flavor.memory_mb + overhead:

https://github.com/openstack/nova/blob/15.0.0/nova/compute/claims.py#L106

So ultimately what we claim on the compute node is not what we report to
placement for allocations for that instance. This matters because when
the filter scheduler is asking placement for a list of resource
providers that can fit a given request memory_mb and disk_gb it relies
on the inventory for the compute node resource provider and the existing
usage (allocations) for that provider, and we aren't reporting the full
story to placement.

This could lead to placement telling the filter scheduler there is room
to place an instance on a given compute node when in fact that could
fail the claim once we get to the host, which would results in a retry
of the build on another host (which can be expensive).

Also, when we start having multi-cell support with a top-level conductor
that the computes can't reach, we won't have build retries anymore, so
you'd just fail the claim and the build would be done and the instance
would go to ERROR state. So it's critical that the placement service has
the proper information for making the correct decision on the first try.

** Affects: nova
     Importance: High
         Status: Triaged


** Tags: placement resource-tracker scheduler

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1683858

Title:
  Allocation records do not contain overhead information

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  Some virt drivers report additional overhead per instance for memory
  and disk usage on a compute node. That is not reported in the
  allocations records for a given instance on a resource provider
  (compute node), however:

  https://github.com/openstack/nova/blob/15.0.0/nova/scheduler/client/report.py#L157

  It is used as part of the claim test on the compute when creating an
  instance or moving an instance. For creating an instance, that's done
  here:

  https://github.com/openstack/nova/blob/15.0.0/nova/compute/resource_tracker.py#L144-L156

  https://github.com/openstack/nova/blob/15.0.0/nova/compute/claims.py#L165

  Where Claim.memory_mb is the instance.flavor.memory_mb + overhead:

  https://github.com/openstack/nova/blob/15.0.0/nova/compute/claims.py#L106

  So ultimately what we claim on the compute node is not what we report
  to placement for allocations for that instance. This matters because
  when the filter scheduler is asking placement for a list of resource
  providers that can fit a given request memory_mb and disk_gb it relies
  on the inventory for the compute node resource provider and the
  existing usage (allocations) for that provider, and we aren't
  reporting the full story to placement.

  This could lead to placement telling the filter scheduler there is
  room to place an instance on a given compute node when in fact that
  could fail the claim once we get to the host, which would results in a
  retry of the build on another host (which can be expensive).

  Also, when we start having multi-cell support with a top-level
  conductor that the computes can't reach, we won't have build retries
  anymore, so you'd just fail the claim and the build would be done and
  the instance would go to ERROR state. So it's critical that the
  placement service has the proper information for making the correct
  decision on the first try.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1683858/+subscriptions
Follow ups

[Bug 1683858] Re: Allocation records do not contain overhead information
From: Chris Dent, 2018-06-27