yahoo-eng-team team mailing list archive

Thread
Date

[Bug 2011127] Re: Nova scheduler stacks allocations in heterogeneous environments

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: sean mooney <2011127@xxxxxxxxxxxxxxxxxx>
Date: Fri, 10 Mar 2023 22:05:52 -0000
Reply-to: Bug 2011127 <2011127@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

** Changed in: nova
Status: In Progress => Invalid

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2011127

Title:
Nova scheduler stacks allocations in heterogeneous environments

Status in OpenStack Compute (nova):
Invalid

Bug description:
Our OpenStack clouds consist of different hypervisor hardware
configurations, all of which are members of the same cell.

What we have observed is that many of the weighers in Nova will
encourage "stacking" of allocations instead of "spreading". That is to
say, the weighers will preferentially keep assigning greater weights
to the hypervisors with more resources until said hypervisors are
objectively over-provisioned compared to the hypervisors with less
resources.

Suppose for example that some of these hypervisors have 1/4th the
amount of RAM and physical CPU cores compared to others. What we
observe is that, assuming all hypervisors start empty, the hypervisors
with 1/4th the amount of RAM will not have a *single* instance
assigned to them even when others can have 1/2 or more of their
resources allocated.

We dug into why, and landed upon this commit from 2013 which normalized the weights:
https://github.com/openstack/nova/commit/e5ba8494374a1b049eae257fe05b10c5804049ae

The normalization on the surface seems correct:
"weight = w1_multiplier * norm(w1) + w2_multiplier * norm(w2) + ..."

However, the computed values for w1 by the CPUWeigher and RAMWeigher,
etc. are objectively *not* correct anymore. The commits mentions that
all weighers should fall under two cases:

Case 1: Use of a percentage instead of absolute values (for example, % of free RAM).
Case 2: Use of absolute values.

However, if we look at current implementation, there are hidden
implications with case 2 with some weighers. In the context of
RAMWeigher, for example, this is due to the fact that the
normalization occurs with respect to the hypervisor which has the most
free RAM at the point in time of scheduling -- this is not % free RAM
per hypervisor, so:

Suppose we take a fictitious example of two hypervisors, one ("HypA")
with 2 units of RAM and one ("HypB") with 10 units of RAM. And we
assume VMs of 0.25 units of RAM are allocated:

Upon the first first allocation, we compute these weights:
HypA: 2 units of free RAM, normalized weight = 0.2 (2/10)
HypB: 10 units of free RAM, normalized weight = 1.0 (10/10)

And the second:
HypA: 2 units of free RAM, normalized weight = 0.20512820512820512 (2/9.75)
HypB: 9.75 units of free RAM, normalized weight = 1.0 (9.75/9.75)

And the third:
HypA: 2 units of free RAM, normalized weight = 0.21052631578947367 (2/9.5)
HypB: 9.5 units of free RAM, normalized weight = 1.0 (9.5/9.5)

etc...

Thus the RAMWeigher continues stacking instances on HypB until HypB
has 2 units of free RAM remaining, at which point it has 32 instances
of 0.25 units of RAM. After this point, it begins spreading across
both hypervisors in lockstep fashion. But up until this points, it
stacks.

This same problem occurs with the CPUWeigher, but it's even more
pernicious in that case because the CPUWeigher is acting on vCPUs wrt
operator-supplied CPU allocation ratios.

For example: lets suppose an operator configures Nova with
cpu_allocation_ratio = 3.0. In this case, a hypervisor with 2x as many
cores as another will have its cores over-provisioned (that is, more
than 1 vCPU/1 pCPU core allocated) before the other hypervisor gets a
single instance!

This is because the value returned to the normalization function is
free vCPUs over total vCPUs (# physical CPU cores *
cpu_allocation_ratio). In this way, stacking occurs on the hypervisor
with twice the CPU cores up until its physical CPU cores are over-
provisioned @ 1.5vCPUs per physical CPU core.

The documentation does not define "even spreading", as it is referred
to... but this certainly does not seem correct.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2011127/+subscriptions

References

[Bug 2011127] [NEW] Nova scheduler stacks allocations in heterogeneous environments
From: Tyler Stachecki, 2023-03-10