← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1729621] [NEW] Inconsistent value for vcpu_used

 

Public bug reported:

Description
===========

Nova updates hypervisor resources using function called
./nova/compute/resource_tracker.py:update_available_resource().

In case of *shutdowned* instances it could impact inconsistent values
for resources like vcpu_used.

Resources are taken from function self.driver.get_available_resource():
https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L617
https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/virt/libvirt/driver.py#L5766

This function calculates allocated vcpu's based on function _get_vcpu_total().
https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/virt/libvirt/driver.py#L5352

As we see in _get_vcpu_total() function calls *self._host.list_guests()*
without "only_running=False" parameter. So it doesn't respect shutdowned
instances.

At the end of resource update process function _update_available_resource() is beign called:
> /opt/stack/nova/nova/compute/resource_tracker.py(733)

 677         @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
 678         def _update_available_resource(self, context, resources):
 679
 681             # initialize the compute node object, creating it
 682             # if it does not already exist.
 683             self._init_compute_node(context, resources)

It initialize compute node object with resources that are calculated
without shutdowned instances. If compute node object already exists it
*UPDATES* its fields - *for a while nova-api has other resources values
than it its in real.*

 731             # update the compute_node
 732             self._update(context, cn)

The inconsistency is automatically fixed during other code execution:
https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L709

But for heavy-loaded hypervisors (like 100 active instances and 30
shutdowned instances) it creates wrong informations in nova database for
about 4-5 seconds (in my usecase) - it could impact other issues like
spawning on already full hypervisor (because scheduler has wrong
informations about hypervisor usage).

Steps to reproduce
==================

1) Start devstack
2) Create 120 instances
3) Stop some instances
4) Watch blinking values in nova hypervisor-show
nova hypervisor-show e6dfc16b-7914-48fb-a235-6fe3a41bb6db

Expected result
===============
Returned values should be the same during test.

Actual result
=============
while true; do echo -n "$(date) "; echo "select hypervisor_hostname, vcpus_used from compute_nodes where hypervisor_hostname='example.compute.node.com';" | mysql nova_cell1; sleep 0.3; done

Thu Nov  2 14:50:09 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:10 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:10 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:10 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:12 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:12 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:12 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:13 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:13 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:13 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:14 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:14 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:14 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:15 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:15 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:15 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:16 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:16 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:16 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  117
Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:18 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:18 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:18 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:19 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:19 UTC 2017 example.compute.node.com  120
Thu Nov  2 14:50:19 UTC 2017 example.compute.node.com  120

Bad values were stored in nova DB for about 5 seconds. During this time
nova-scheduler could take this host.

Environment
===========
Devstack master (f974e3c3566f379211d7fdc790d07b5680925584).
For sure releases down to Newton are impacted.

** Affects: nova
     Importance: Undecided
         Status: New

** Description changed:

  Description
  ===========
  
- 
- Nova updates hypervisor resources using function called ./nova/compute/resource_tracker.py:update_available_resource().
+ Nova updates hypervisor resources using function called
+ ./nova/compute/resource_tracker.py:update_available_resource().
  
  In case of *shutdowned* instances it could impact inconsistent values
  for resources like vcpu_used.
  
  Resources are taken from function self.driver.get_available_resource():
  https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L617
  https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/virt/libvirt/driver.py#L5766
  
  This function calculates allocated vcpu's based on function _get_vcpu_total().
  https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/virt/libvirt/driver.py#L5352
  
  As we see in _get_vcpu_total() function calls *self._host.list_guests()*
  without "only_running=False" parameter. So it doesn't respect shutdowned
  instances.
  
- 
  At the end of resource update process function _update_available_resource() is beign called:
  > /opt/stack/nova/nova/compute/resource_tracker.py(733)
  
-  677         @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)                                                         
-  678         def _update_available_resource(self, context, resources):                                               
-  679                                                                        
-  681             # initialize the compute node object, creating it                                                   
-  682             # if it does not already exist.                                                                     
-  683             self._init_compute_node(context, resources)
+  677         @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
+  678         def _update_available_resource(self, context, resources):
+  679
+  681             # initialize the compute node object, creating it
+  682             # if it does not already exist.
+  683             self._init_compute_node(context, resources)
  
  It initialize compute node object with resources that are calculated
  without shutdowned instances. If compute node object already exists it
  *UPDATES* its fields - *for a while nova-api has other resources values
  than it its in real.*
  
- 
-  731             # update the compute_node                                                                           
-  732             self._update(context, cn)
+  731             # update the compute_node
+  732             self._update(context, cn)
  
  The inconsistency is automatically fixed during other code execution:
  https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L709
  
- 
- But for heavy-loaded hypervisors (like 100 active instances and 30 shutdowned instances) it creates wrong informations in nova database for about 4-5 seconds (in my usecase) - it could impact other issues like spawning on already full hypervisor (because scheduler has wrong informations about hypervisor usage).
- 
+ But for heavy-loaded hypervisors (like 100 active instances and 30
+ shutdowned instances) it creates wrong informations in nova database for
+ about 4-5 seconds (in my usecase) - it could impact other issues like
+ spawning on already full hypervisor (because scheduler has wrong
+ informations about hypervisor usage).
  
  Steps to reproduce
  ==================
- 
  
  1) Start devstack
  2) Create 120 instances
  3) Stop some instances
  4) Watch blinking values in nova hypervisor-show
  nova hypervisor-show e6dfc16b-7914-48fb-a235-6fe3a41bb6db
  
  Expected result
  ===============
  Returned values should be the same during test.
  
- 
  Actual result
  =============
- while true; do echo -n "$(date) "; echo "select hypervisor_hostname, vcpus_used from compute_nodes where hypervisor_hostname='example.compute.node.com';" | mysql nova_cell1; sleep 0.3; done 
+ while true; do echo -n "$(date) "; echo "select hypervisor_hostname, vcpus_used from compute_nodes where hypervisor_hostname='example.compute.node.com';" | mysql nova_cell1; sleep 0.3; done
  
  Thu Nov  2 14:50:09 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:10 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:10 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:10 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:12 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:12 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:12 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:13 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:13 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:13 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:14 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:14 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:14 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:15 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:15 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:15 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:16 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:16 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:16 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:18 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:18 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:18 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:19 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:19 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:19 UTC 2017 example.compute.node.com  120
  
- Bad values where stored in for about 5 seconds. During this time nova-
- scheduler could take this host.
- 
+ Bad values were stored in nova DB for about 5 seconds. During this time
+ nova-scheduler could take this host.
  
  Environment
  ===========
  Devstack master (f974e3c3566f379211d7fdc790d07b5680925584).
  For sure releases down to Newton are impacted.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1729621

Title:
  Inconsistent value for vcpu_used

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  Nova updates hypervisor resources using function called
  ./nova/compute/resource_tracker.py:update_available_resource().

  In case of *shutdowned* instances it could impact inconsistent values
  for resources like vcpu_used.

  Resources are taken from function self.driver.get_available_resource():
  https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L617
  https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/virt/libvirt/driver.py#L5766

  This function calculates allocated vcpu's based on function _get_vcpu_total().
  https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/virt/libvirt/driver.py#L5352

  As we see in _get_vcpu_total() function calls
  *self._host.list_guests()* without "only_running=False" parameter. So
  it doesn't respect shutdowned instances.

  At the end of resource update process function _update_available_resource() is beign called:
  > /opt/stack/nova/nova/compute/resource_tracker.py(733)

   677         @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
   678         def _update_available_resource(self, context, resources):
   679
   681             # initialize the compute node object, creating it
   682             # if it does not already exist.
   683             self._init_compute_node(context, resources)

  It initialize compute node object with resources that are calculated
  without shutdowned instances. If compute node object already exists it
  *UPDATES* its fields - *for a while nova-api has other resources
  values than it its in real.*

   731             # update the compute_node
   732             self._update(context, cn)

  The inconsistency is automatically fixed during other code execution:
  https://github.com/openstack/nova/blob/f974e3c3566f379211d7fdc790d07b5680925584/nova/compute/resource_tracker.py#L709

  But for heavy-loaded hypervisors (like 100 active instances and 30
  shutdowned instances) it creates wrong informations in nova database
  for about 4-5 seconds (in my usecase) - it could impact other issues
  like spawning on already full hypervisor (because scheduler has wrong
  informations about hypervisor usage).

  Steps to reproduce
  ==================

  1) Start devstack
  2) Create 120 instances
  3) Stop some instances
  4) Watch blinking values in nova hypervisor-show
  nova hypervisor-show e6dfc16b-7914-48fb-a235-6fe3a41bb6db

  Expected result
  ===============
  Returned values should be the same during test.

  Actual result
  =============
  while true; do echo -n "$(date) "; echo "select hypervisor_hostname, vcpus_used from compute_nodes where hypervisor_hostname='example.compute.node.com';" | mysql nova_cell1; sleep 0.3; done

  Thu Nov  2 14:50:09 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:10 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:10 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:10 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:11 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:12 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:12 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:12 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:13 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:13 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:13 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:14 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:14 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:14 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:15 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:15 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:15 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:16 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:16 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:16 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  117
  Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:17 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:18 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:18 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:18 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:19 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:19 UTC 2017 example.compute.node.com  120
  Thu Nov  2 14:50:19 UTC 2017 example.compute.node.com  120

  Bad values were stored in nova DB for about 5 seconds. During this
  time nova-scheduler could take this host.

  Environment
  ===========
  Devstack master (f974e3c3566f379211d7fdc790d07b5680925584).
  For sure releases down to Newton are impacted.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1729621/+subscriptions


Follow ups