yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #32847
[Bug 1454451] [NEW] simultaneous boot of multiple instances leads to cpu pinning overlap
Public bug reported:
I'm running into an issue with kilo-3 that I think is present in current
trunk.
I think there is a race between the claimed CPUs of an instance being
persisted to the DB, and the resource audit scanning the DB for
instances and subtracting pinned CPUs from the list of available CPUs.
The problem only shows up when the following sequence happens:
1) instance A (with dedicated cpus) boots on a compute node
2) resource audit runs on that compute node
3) instance B (with dedicated cpus) boots on the same compute node
So you need to either be booting many instances, or limiting the valid
compute nodes (host aggregate or server groups), or have a small cluster
in order to hit this.
The nitty-gritty view looks like this:
When booting up an instance we hold the COMPUTE_RESOURCE_SEMAPHORE in
compute.resource_tracker.ResourceTracker.instance_claim() and that
covers updating the resource usage on the compute node. But we don't
persist the instance numa topology to the database until after
instance_claim() returns, in
compute.manager.ComputeManager._build_instance(). Note that this is
done *after* we've given up the semaphore, so there is no longer any
sort of ordering guarantee.
compute.resource_tracker.ResourceTracker.update_available_resource()
then aquires COMPUTE_RESOURCE_SEMAPHORE, queries the database for a list
of instances and uses that to calculate a new view of what resources are
available. If the numa topology of the most recent instance hasn't been
persisted yet, then the new view of resources won't include any pCPUs
pinned by that instance.
compute.manager.ComputeManager._build_instance() runs for the next
instance and based on the new view of available resources it allocates
the same pCPU(s) used by the earlier instance. Boom, overlapping pinned
pCPUs.
** Affects: nova
Importance: Undecided
Status: New
** Tags: compute
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1454451
Title:
simultaneous boot of multiple instances leads to cpu pinning overlap
Status in OpenStack Compute (Nova):
New
Bug description:
I'm running into an issue with kilo-3 that I think is present in
current trunk.
I think there is a race between the claimed CPUs of an instance being
persisted to the DB, and the resource audit scanning the DB for
instances and subtracting pinned CPUs from the list of available CPUs.
The problem only shows up when the following sequence happens:
1) instance A (with dedicated cpus) boots on a compute node
2) resource audit runs on that compute node
3) instance B (with dedicated cpus) boots on the same compute node
So you need to either be booting many instances, or limiting the valid
compute nodes (host aggregate or server groups), or have a small
cluster in order to hit this.
The nitty-gritty view looks like this:
When booting up an instance we hold the COMPUTE_RESOURCE_SEMAPHORE in
compute.resource_tracker.ResourceTracker.instance_claim() and that
covers updating the resource usage on the compute node. But we don't
persist the instance numa topology to the database until after
instance_claim() returns, in
compute.manager.ComputeManager._build_instance(). Note that this is
done *after* we've given up the semaphore, so there is no longer any
sort of ordering guarantee.
compute.resource_tracker.ResourceTracker.update_available_resource()
then aquires COMPUTE_RESOURCE_SEMAPHORE, queries the database for a
list of instances and uses that to calculate a new view of what
resources are available. If the numa topology of the most recent
instance hasn't been persisted yet, then the new view of resources
won't include any pCPUs pinned by that instance.
compute.manager.ComputeManager._build_instance() runs for the next
instance and based on the new view of available resources it allocates
the same pCPU(s) used by the earlier instance. Boom, overlapping
pinned pCPUs.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1454451/+subscriptions
Follow ups
References