← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1545675] Re: Resizing a pinned VM results in inconsistent state

 

Reviewed:  https://review.openstack.org/281483
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c7a6673fd5621d1c121c20376634ec49644fae59
Submitter: Jenkins
Branch:    master

commit c7a6673fd5621d1c121c20376634ec49644fae59
Author: Nikola Dipanov <ndipanov@xxxxxxxxxx>
Date:   Wed Feb 17 19:27:36 2016 +0000

    RT: aborting claims clears instance host and NUMA info
    
    When the claim is aborted, this information is no longer correct for the
    instance, so we clear it to avoid inconsistencies.
    
    Change-Id: I83a5f06adb22c21392d5fc867728181ea4b0454d
    Resolves-bug: 1545675


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1545675

Title:
  Resizing a pinned VM results in inconsistent state

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  It appears that executing certain resize operations on a pinned
  instance results in inconsistencies in the "state machine" that Nova
  uses to track instances. This was identified using Tempest and
  manifests itself in failures in follow up shelve/unshelve operations.

  ---

  # Steps

  Testing was conducted on host containing a single-node, Fedora
  23-based (4.3.5-300.fc23.x86_64) OpenStack instance (built with
  DevStack). The '12d224e' commit of Nova was used. The Tempest tests
  (commit 'e913b82') were run using modified flavors, as seen below:

      nova flavor-create m1.small_nfv 420 2048 0 2
      nova flavor-create m1.medium_nfv 840 4096 0 4
      nova flavor-key 420 set "hw:numa_nodes=2"
      nova flavor-key 840 set "hw:numa_nodes=2"
      nova flavor-key 420 set "hw:cpu_policy=dedicated"
      nova flavor-key 840 set "hw:cpu_policy=dedicated"

      cd $TEMPEST_DIR
      cp etc/tempest.conf etc/tempest.conf.orig
      sed -i "s/flavor_ref = .*/flavor_ref = 420/" etc/tempest.conf
      sed -i "s/flavor_ref_alt = .*/flavor_ref_alt = 840/" etc/tempest.conf

  Tests were run in the order given below.

  1. tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
  2. tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server
  3. tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_revert
  4. tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
  5. tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server

  Like so:

      ./run_tempest.sh --
  tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance

  # Expected Result

  The tests should pass.

  # Actual Result

      +---+--------------------------------------+--------+
      | # |                 test id              | status |
      +---+--------------------------------------+--------+
      | 1 | 1164e700-0af0-4a4c-8792-35909a88743c |   ok   |
      | 2 | 77eba8e0-036e-4635-944b-f7a8f3b78dc9 |   ok   |
      | 3 | c03aab19-adb1-44f5-917d-c419577e9e68 |   ok   |
      | 4 | 1164e700-0af0-4a4c-8792-35909a88743c |  FAIL  |
      | 5 | c03aab19-adb1-44f5-917d-c419577e9e68 |   ok*  |

  * this test reports as passing but is actually generating errors. Bad
  test! :)

  One test fails while the other "passes" but raises errors. The
  failures, where raised, are CPUPinningInvalid exceptions:

      CPUPinningInvalid: Cannot pin/unpin cpus [1] from the following
  pinned set [0, 25]

  **NOTE:** I also think there are issues with the non-reverted resize
  test, though I've yet to investigate this:

  *
  tempest.scenario.test_server_advanced_ops.TestServerAdvancedOps.test_resize_server_confirm

  What's worse, this error "snowballs" on successive runs. Because of
  the nature of the failure (a failure to pin/unpin CPUs), we're left
  with a list of CPUs that Nova thinks to be pinned but which are no
  longer actually used. This is reflected by the resource tracker.

      $ openstack server list

      $ cat /opt/stack/logs/screen/n-cpu.log | grep 'Total usable vcpus' | tail -1
      *snip* INFO nova.compute.resource_tracker [*snip*] Total usable vcpus: 40, total allocated vcpus: 8

  The error messages for both are given below, along with examples of
  this "snowballing" CPU list:

  {0}
  tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
  [36.713046s] ... FAILED

   Setting instance vm_state to ERROR
   Traceback (most recent call last):
     File "/opt/stack/nova/nova/compute/manager.py", line 2474, in do_terminate_instance
       self._delete_instance(context, instance, bdms, quotas)
     File "/opt/stack/nova/nova/hooks.py", line 149, in inner
       rv = f(*args, **kwargs)
     File "/opt/stack/nova/nova/compute/manager.py", line 2437, in _delete_instance
       quotas.rollback()
     File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
       self.force_reraise()
     File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
       six.reraise(self.type_, self.value, self.tb)
     File "/opt/stack/nova/nova/compute/manager.py", line 2432, in _delete_instance
       self._update_resource_tracker(context, instance)
     File "/opt/stack/nova/nova/compute/manager.py", line 751, in _update_resource_tracker
       rt.update_usage(context, instance)
     File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
       return f(*args, **kwargs)
     File "/opt/stack/nova/nova/compute/resource_tracker.py", line 376, in update_usage
       self._update_usage_from_instance(context, instance)
     File "/opt/stack/nova/nova/compute/resource_tracker.py", line 863, in _update_usage_from_instance
       self._update_usage(instance, sign=sign)
     File "/opt/stack/nova/nova/compute/resource_tracker.py", line 705, in _update_usage
       self.compute_node, usage, free)
     File "/opt/stack/nova/nova/virt/hardware.py", line 1441, in get_host_numa_usage_from_instance
       host_numa_topology, instance_numa_topology, free=free))
     File "/opt/stack/nova/nova/virt/hardware.py", line 1307, in numa_usage_from_instances
       newcell.unpin_cpus(pinned_cpus)
     File "/opt/stack/nova/nova/objects/numa.py", line 93, in unpin_cpus
       pinned=list(self.pinned_cpus))
   CPUPinningInvalid: Cannot pin/unpin cpus [0] from the following pinned set [1]

  {0}
  tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_shelve_unshelve_server
  [29.131132s] ... ok

   Traceback (most recent call last):
     File "/opt/stack/nova/nova/compute/manager.py", line 2474, in do_terminate_instance
       self._delete_instance(context, instance, bdms, quotas)
     File "/opt/stack/nova/nova/hooks.py", line 149, in inner
       rv = f(*args, **kwargs)
     File "/opt/stack/nova/nova/compute/manager.py", line 2437, in _delete_instance
       quotas.rollback()
     File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
       self.force_reraise()
     File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
       six.reraise(self.type_, self.value, self.tb)
     File "/opt/stack/nova/nova/compute/manager.py", line 2432, in _delete_instance
       self._update_resource_tracker(context, instance)
     File "/opt/stack/nova/nova/compute/manager.py", line 751, in _update_resource_tracker
       rt.update_usage(context, instance)
     File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
       return f(*args, **kwargs)
     File "/opt/stack/nova/nova/compute/resource_tracker.py", line 376, in update_usage
       self._update_usage_from_instance(context, instance)
     File "/opt/stack/nova/nova/compute/resource_tracker.py", line 863, in _update_usage_from_instance
       self._update_usage(instance, sign=sign)
     File "/opt/stack/nova/nova/compute/resource_tracker.py", line 705, in _update_usage
       self.compute_node, usage, free)
     File "/opt/stack/nova/nova/virt/hardware.py", line 1441, in get_host_numa_usage_from_instance
       host_numa_topology, instance_numa_topology, free=free))
     File "/opt/stack/nova/nova/virt/hardware.py", line 1307, in numa_usage_from_instances
       newcell.unpin_cpus(pinned_cpus)
     File "/opt/stack/nova/nova/objects/numa.py", line 93, in unpin_cpus
       pinned=list(self.pinned_cpus))
   CPUPinningInvalid: Cannot pin/unpin cpus [1] from the following pinned set [0, 25]

  The nth run (n ~= 6):

  CPUPinningInvalid: Cannot pin/unpin cpus [24] from the following
  pinned set [0, 1, 9, 8, 25]

  The nth+1 run:

  CPUPinningInvalid: Cannot pin/unpin cpus [27] from the following
  pinned set [0, 1, 24, 25, 8, 9]

  The nth+2 run:

  CPUPinningInvalid: Cannot pin/unpin cpus [2] from the following pinned
  set [0, 1, 24, 25, 8, 9, 27]

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1545675/+subscriptions


References