yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2011127] [NEW] Nova scheduler stacks allocations in heterogeneous environments

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Tyler Stachecki <2011127@xxxxxxxxxxxxxxxxxx>
Date: Fri, 10 Mar 2023 20:29:12 -0000
Reply-to: Bug 2011127 <2011127@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Our OpenStack clouds consist of different hypervisor hardware
configurations, all of which are members of the same cell.

What we have observed is that many of the weighers in Nova will
encourage "stacking" of allocations instead of "spreading". That is to
say, the weighers will preferentially keep assigning greater weights to
the hypervisors with more resources until said hypervisors are
objectively over-provisioned compared to the hypervisors with less
resources.

Suppose for example that some of these hypervisors have 1/4th the amount
of RAM and physical CPU cores compared to others. What we observe is
that, assuming all hypervisors start empty, the hypervisors with 1/4th
the amount of RAM will not have a *single* instance assigned to them
even when others can have 1/2 or more of their resources allocated.

We dug into why, and landed upon this commit from 2013 which normalized the weights:
https://github.com/openstack/nova/commit/e5ba8494374a1b049eae257fe05b10c5804049ae

The normalization on the surface seems correct:
"weight = w1_multiplier * norm(w1) + w2_multiplier * norm(w2) + ..."

However, the computed values for w1 by the CPUWeigher and RAMWeigher,
etc. are objectively *not* correct anymore. The commits mentions that
all weighers should fall under two cases:

   Case 1: Use of a percentage instead of absolute values (for example, % of free RAM).
   Case 2: Use of absolute values.

However, if we look at current implementation, it does neither of these
things. In the case of the RAMWeigher, it returns the free RAM as an
absolute value. The normalization occurs with respect to the hypervisor
which has the most free RAM at the point in time of scheduling -- this
is not % free RAM per hypervisor and it has some hidden implications:

Suppose we take a fictitious example of two hypervisors, one ("HypA")
with 2 units of RAM and one ("HypB") with 10 units of RAM. And we assume
VMs of 0.25 units of RAM are allocated:

Upon the first first allocation, we compute these weights:
HypA: 2 units of free RAM, normalized weight = 0.2 (2/10)
HypB: 10 units of free RAM, normalized weight = 1.0 (10/10)

And the second:
HypA: 2 units of free RAM, normalized weight = 0.20512820512820512 (2/9.75)
HypB: 9.75 units of free RAM, normalized weight = 1.0 (9.75/9.75)

And the third:
HypA: 2 units of free RAM, normalized weight = 0.21052631578947367 (2/9.5)
HypB: 9.5 units of free RAM, normalized weight = 1.0 (9.5/9.5)

etc...

Thus the RAMWeigher continues stacking instances on HypB until HypB has
2 units of free RAM remaining, at which point it has 32 instances of
0.25 units of RAM. After this point, it begins spreading across both
hypervisors in lockstep fashion. But up until this points, it stacks.

This same problem occurs with the CPUWeigher, but it's even more
pernicious in that case because the CPUWeigher is acting on vCPUs wrt
operator-supplied CPU allocation ratios.

For example: lets suppose an operator configures Nova with
cpu_allocation_ratio = 3.0. In this case, a hypervisor with 2x as many
cores as another will have its cores over-provisioned (that is, more
than 1 vCPU/1 pCPU core allocated) before the other hypervisor gets a
single instance!

This is because the value returned to the normalization function is free
vCPUs over total vCPUs (# physical CPU cores * cpu_allocation_ratio). In
this way, stacking occurs on the hypervisor with twice the CPU cores up
until its physical CPU cores are over-provisioned @ 1.5vCPUs per
physical CPU core.

The documentation does not "even spreading", as it is referred to... but
this certainly does not seem correct.

** Affects: nova
     Importance: Undecided
     Assignee: Tyler Stachecki (tstachecki)
         Status: In Progress

** Changed in: nova
     Assignee: (unassigned) => Tyler Stachecki (tstachecki)

** Changed in: nova
       Status: New => In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2011127

Title:
  Nova scheduler stacks allocations in heterogeneous environments

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  Our OpenStack clouds consist of different hypervisor hardware
  configurations, all of which are members of the same cell.

  What we have observed is that many of the weighers in Nova will
  encourage "stacking" of allocations instead of "spreading". That is to
  say, the weighers will preferentially keep assigning greater weights
  to the hypervisors with more resources until said hypervisors are
  objectively over-provisioned compared to the hypervisors with less
  resources.

  Suppose for example that some of these hypervisors have 1/4th the
  amount of RAM and physical CPU cores compared to others. What we
  observe is that, assuming all hypervisors start empty, the hypervisors
  with 1/4th the amount of RAM will not have a *single* instance
  assigned to them even when others can have 1/2 or more of their
  resources allocated.

  We dug into why, and landed upon this commit from 2013 which normalized the weights:
  https://github.com/openstack/nova/commit/e5ba8494374a1b049eae257fe05b10c5804049ae

  The normalization on the surface seems correct:
  "weight = w1_multiplier * norm(w1) + w2_multiplier * norm(w2) + ..."

  However, the computed values for w1 by the CPUWeigher and RAMWeigher,
  etc. are objectively *not* correct anymore. The commits mentions that
  all weighers should fall under two cases:

     Case 1: Use of a percentage instead of absolute values (for example, % of free RAM).
     Case 2: Use of absolute values.

  However, if we look at current implementation, it does neither of
  these things. In the case of the RAMWeigher, it returns the free RAM
  as an absolute value. The normalization occurs with respect to the
  hypervisor which has the most free RAM at the point in time of
  scheduling -- this is not % free RAM per hypervisor and it has some
  hidden implications:

  Suppose we take a fictitious example of two hypervisors, one ("HypA")
  with 2 units of RAM and one ("HypB") with 10 units of RAM. And we
  assume VMs of 0.25 units of RAM are allocated:

  Upon the first first allocation, we compute these weights:
  HypA: 2 units of free RAM, normalized weight = 0.2 (2/10)
  HypB: 10 units of free RAM, normalized weight = 1.0 (10/10)

  And the second:
  HypA: 2 units of free RAM, normalized weight = 0.20512820512820512 (2/9.75)
  HypB: 9.75 units of free RAM, normalized weight = 1.0 (9.75/9.75)

  And the third:
  HypA: 2 units of free RAM, normalized weight = 0.21052631578947367 (2/9.5)
  HypB: 9.5 units of free RAM, normalized weight = 1.0 (9.5/9.5)

  etc...

  Thus the RAMWeigher continues stacking instances on HypB until HypB
  has 2 units of free RAM remaining, at which point it has 32 instances
  of 0.25 units of RAM. After this point, it begins spreading across
  both hypervisors in lockstep fashion. But up until this points, it
  stacks.

  This same problem occurs with the CPUWeigher, but it's even more
  pernicious in that case because the CPUWeigher is acting on vCPUs wrt
  operator-supplied CPU allocation ratios.

  For example: lets suppose an operator configures Nova with
  cpu_allocation_ratio = 3.0. In this case, a hypervisor with 2x as many
  cores as another will have its cores over-provisioned (that is, more
  than 1 vCPU/1 pCPU core allocated) before the other hypervisor gets a
  single instance!

  This is because the value returned to the normalization function is
  free vCPUs over total vCPUs (# physical CPU cores *
  cpu_allocation_ratio). In this way, stacking occurs on the hypervisor
  with twice the CPU cores up until its physical CPU cores are over-
  provisioned @ 1.5vCPUs per physical CPU core.

  The documentation does not "even spreading", as it is referred to...
  but this certainly does not seem correct.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2011127/+subscriptions
Follow ups

[Bug 2011127] Re: Nova scheduler stacks allocations in heterogeneous environments
From: sean mooney, 2023-03-10