yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #55883
[Bug 1545675] Re: Resizing a pinned VM results in inconsistent state
** Changed in: nova/mitaka
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1545675
Title:
Resizing a pinned VM results in inconsistent state
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) mitaka series:
Fix Released
Bug description:
It appears that executing certain resize operations on a pinned
instance results in inconsistencies in the "state machine" that Nova
uses to track instances. This was identified using Tempest and
manifests itself in failures in follow up shelve/unshelve operations.
---
# Steps
Testing was conducted on host containing a single-node, Fedora
23-based (4.3.5-300.fc23.x86_64) OpenStack instance (built with
DevStack). The '12d224e' commit of Nova was used. The Tempest tests
(commit 'e913b82') were run using modified flavors, as seen below:
nova flavor-create m1.small_nfv 420 2048 0 2
nova flavor-create m1.medium_nfv 840 4096 0 4
nova flavor-key 420 set "hw:numa_nodes=2"
nova flavor-key 840 set "hw:numa_nodes=2"
nova flavor-key 420 set "hw:cpu_policy=dedicated"
nova flavor-key 840 set "hw:cpu_policy=dedicated"
cd $TEMPEST_DIR
cp etc/tempest.conf etc/tempest.conf.orig
sed -i "s/flavor_ref = .*/flavor_ref = 420/" etc/tempest.conf
sed -i "s/flavor_ref_alt = .*/flavor_ref_alt = 840/" etc/tempest.conf
Tests were run in the order given below.
1. tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
2. tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server
3. tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
4. tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
5. tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server
Like so:
./run_tempest.sh --
tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
# Expected Result
The tests should pass.
# Actual Result
+---+--------------------------------------+--------+
| # | test id | status |
+---+--------------------------------------+--------+
| 1 | 1164e700-0af0-4a4c-8792-35909a88743c | ok |
| 2 | 77eba8e0-036e-4635-944b-f7a8f3b78dc9 | ok |
| 3 | c03aab19-adb1-44f5-917d-c419577e9e68 | ok |
| 4 | 1164e700-0af0-4a4c-8792-35909a88743c | FAIL |
| 5 | c03aab19-adb1-44f5-917d-c419577e9e68 | ok* |
* this test reports as passing but is actually generating errors. Bad
test! :)
One test fails while the other "passes" but raises errors. The
failures, where raised, are CPUPinningInvalid exceptions:
CPUPinningInvalid: Cannot pin/unpin cpus [1] from the following
pinned set [0, 25]
**NOTE:** I also think there are issues with the non-reverted resize
test, though I've yet to investigate this:
*
tempest.scenario.test_server_advanced_ops.TestServerAdvancedOps.test_resize_server_confirm
What's worse, this error "snowballs" on successive runs. Because of
the nature of the failure (a failure to pin/unpin CPUs), we're left
with a list of CPUs that Nova thinks to be pinned but which are no
longer actually used. This is reflected by the resource tracker.
$ openstack server list
$ cat /opt/stack/logs/screen/n-cpu.log | grep 'Total usable vcpus' | tail -1
*snip* INFO nova.compute.resource_tracker [*snip*] Total usable vcpus: 40, total allocated vcpus: 8
The error messages for both are given below, along with examples of
this "snowballing" CPU list:
{0}
tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
[36.713046s] ... FAILED
Setting instance vm_state to ERROR
Traceback (most recent call last):
File "/opt/stack/nova/nova/compute/manager.py", line 2474, in do_terminate_instance
self._delete_instance(context, instance, bdms, quotas)
File "/opt/stack/nova/nova/hooks.py", line 149, in inner
rv = f(*args, **kwargs)
File "/opt/stack/nova/nova/compute/manager.py", line 2437, in _delete_instance
quotas.rollback()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
six.reraise(self.type_, self.value, self.tb)
File "/opt/stack/nova/nova/compute/manager.py", line 2432, in _delete_instance
self._update_resource_tracker(context, instance)
File "/opt/stack/nova/nova/compute/manager.py", line 751, in _update_resource_tracker
rt.update_usage(context, instance)
File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
return f(*args, **kwargs)
File "/opt/stack/nova/nova/compute/resource_tracker.py", line 376, in update_usage
self._update_usage_from_instance(context, instance)
File "/opt/stack/nova/nova/compute/resource_tracker.py", line 863, in _update_usage_from_instance
self._update_usage(instance, sign=sign)
File "/opt/stack/nova/nova/compute/resource_tracker.py", line 705, in _update_usage
self.compute_node, usage, free)
File "/opt/stack/nova/nova/virt/hardware.py", line 1441, in get_host_numa_usage_from_instance
host_numa_topology, instance_numa_topology, free=free))
File "/opt/stack/nova/nova/virt/hardware.py", line 1307, in numa_usage_from_instances
newcell.unpin_cpus(pinned_cpus)
File "/opt/stack/nova/nova/objects/numa.py", line 93, in unpin_cpus
pinned=list(self.pinned_cpus))
CPUPinningInvalid: Cannot pin/unpin cpus [0] from the following pinned set [1]
{0}
tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server
[29.131132s] ... ok
Traceback (most recent call last):
File "/opt/stack/nova/nova/compute/manager.py", line 2474, in do_terminate_instance
self._delete_instance(context, instance, bdms, quotas)
File "/opt/stack/nova/nova/hooks.py", line 149, in inner
rv = f(*args, **kwargs)
File "/opt/stack/nova/nova/compute/manager.py", line 2437, in _delete_instance
quotas.rollback()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
self.force_reraise()
File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
six.reraise(self.type_, self.value, self.tb)
File "/opt/stack/nova/nova/compute/manager.py", line 2432, in _delete_instance
self._update_resource_tracker(context, instance)
File "/opt/stack/nova/nova/compute/manager.py", line 751, in _update_resource_tracker
rt.update_usage(context, instance)
File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
return f(*args, **kwargs)
File "/opt/stack/nova/nova/compute/resource_tracker.py", line 376, in update_usage
self._update_usage_from_instance(context, instance)
File "/opt/stack/nova/nova/compute/resource_tracker.py", line 863, in _update_usage_from_instance
self._update_usage(instance, sign=sign)
File "/opt/stack/nova/nova/compute/resource_tracker.py", line 705, in _update_usage
self.compute_node, usage, free)
File "/opt/stack/nova/nova/virt/hardware.py", line 1441, in get_host_numa_usage_from_instance
host_numa_topology, instance_numa_topology, free=free))
File "/opt/stack/nova/nova/virt/hardware.py", line 1307, in numa_usage_from_instances
newcell.unpin_cpus(pinned_cpus)
File "/opt/stack/nova/nova/objects/numa.py", line 93, in unpin_cpus
pinned=list(self.pinned_cpus))
CPUPinningInvalid: Cannot pin/unpin cpus [1] from the following pinned set [0, 25]
The nth run (n ~= 6):
CPUPinningInvalid: Cannot pin/unpin cpus [24] from the following
pinned set [0, 1, 9, 8, 25]
The nth+1 run:
CPUPinningInvalid: Cannot pin/unpin cpus [27] from the following
pinned set [0, 1, 24, 25, 8, 9]
The nth+2 run:
CPUPinningInvalid: Cannot pin/unpin cpus [2] from the following pinned
set [0, 1, 24, 25, 8, 9, 27]
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1545675/+subscriptions
References