yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #74381
[Bug 1729621] Re: Inconsistent value for vcpu_used
Reviewed: https://review.openstack.org/520024
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c9b74bcfa09d11c2046ce1bfb6dd8463b3a2f3b0
Submitter: Zuul
Branch: master
commit c9b74bcfa09d11c2046ce1bfb6dd8463b3a2f3b0
Author: Maciej Józefczyk <maciej.jozefczyk@xxxxxxxxxxxx>
Date: Thu Nov 16 14:49:42 2017 +0100
Update resources once in update_available_resource
This change ensures that resources are updated only once per
update_available_resource() call.
Compute resources were previously updated during host
object initialization and at the end of
update_available_resource(). It could cause inconsistencies
in resource tracking between compute host and DB for couple
of second when final _update() at the end of
update_available_resource() is being called.
For example: nova-api shows that host uses 10GB of RAM, but
in fact its 12GB because DB doesn't have resources that belongs
to shutdown instance.
Because of that fact nova-scheduler (CachingScheduler) could
choose (based on imcomplete information) host which is already full.
For more informations please see realted bug: #1729621
Change-Id: I120a98cc4c11772f24099081ef3ac44a50daf71d
Closes-Bug: #1729621
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1729621
Title:
Inconsistent value for vcpu_used
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) ocata series:
New
Status in OpenStack Compute (nova) pike series:
New
Bug description:
Description
===========
Nova updates hypervisor resources using function called
./nova/compute/resource_tracker.py:update_available_resource().
In case of *shutdowned* instances it could impact inconsistent values
for resources like vcpu_used.
Resources are taken from function self.driver.get_available_resource():
https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L617
https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/virt/libvirt/driver.py#L5766
This function calculates allocated vcpu's based on function _get_vcpu_total().
https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/virt/libvirt/driver.py#L5352
As we see in _get_vcpu_total() function calls
*self._host.list_guests()* without "only_running=False" parameter. So
it doesn't respect shutdowned instances.
At the end of resource update process function _update_available_resource() is beign called:
> /opt/stack/nova/nova/compute/resource_tracker.py(733)
677 @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
678 def _update_available_resource(self, context, resources):
679
681 # initialize the compute node object, creating it
682 # if it does not already exist.
683 self._init_compute_node(context, resources)
It initialize compute node object with resources that are calculated
without shutdowned instances. If compute node object already exists it
*UPDATES* its fields - *for a while nova-api has other resources
values than it its in real.*
731 # update the compute_node
732 self._update(context, cn)
The inconsistency is automatically fixed during other code execution:
https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L709
But for heavy-loaded hypervisors (like 100 active instances and 30
shutdowned instances) it creates wrong informations in nova database
for about 4-5 seconds (in my usecase) - it could impact other issues
like spawning on already full hypervisor (because scheduler has wrong
informations about hypervisor usage).
Steps to reproduce
==================
1) Start devstack
2) Create 120 instances
3) Stop some instances
4) Watch blinking values in nova hypervisor-show
nova hypervisor-show e6dfc16b-7914-48fb-a235-6fe3a41bb6db
Expected result
===============
Returned values should be the same during test.
Actual result
=============
while true; do echo -n "$(date) "; echo "select hypervisor_hostname, vcpus_used from compute_nodes where hypervisor_hostname='example.compute.node.com';" | mysql nova_cell1; sleep 0.3; done
Thu Nov 2 14:50:09 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:10 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:10 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:10 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:11 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:11 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:11 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:11 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:12 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:12 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:12 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:13 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:13 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:13 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:14 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:14 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:14 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:15 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:15 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:15 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:16 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:16 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:16 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:17 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:17 UTC 2017 example.compute.node.com 117
Thu Nov 2 14:50:17 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:17 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:18 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:18 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:18 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:19 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:19 UTC 2017 example.compute.node.com 120
Thu Nov 2 14:50:19 UTC 2017 example.compute.node.com 120
Bad values were stored in nova DB for about 5 seconds. During this
time nova-scheduler could take this host.
Environment
===========
Devstack master (f974e3c3566f379211d7fdc790d07b5680925584).
For sure releases down to Newton are impacted.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1729621/+subscriptions
References