← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1582278] Re: [SR-IOV][CPU Pinning] nova compute can try to boot VM with CPUs from one NUMA node and PCI device from another NUMA node.

 

As we use the "direct-release" model in Nova we don't use the
"Fix Comitted" status for merged bug fixes anymore. I'm setting
this manually to "Fix Released" to be consistent.

[1] "[openstack-dev] [release][all] bugs will now close automatically
    when patches merge"; Doug Hellmann; 2015-12-07;
    http://lists.openstack.org/pipermail/openstack-dev/2015-December/081612.html


** Changed in: nova
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1582278

Title:
  [SR-IOV][CPU Pinning] nova compute can try to boot VM with CPUs from
  one NUMA node and PCI device from another NUMA node.

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Environment:
  Two NUMA nodes on compute host (node-0 and node-1).
  One SR-IOV PCI device associated with NUMA node-1.

  Steps to reproduce:

  Steps to reproduce:
   1) Deploy env with SR-IOV and CPU pinning enable
   2) Create new flavor with cpu pinning:
  nova flavor-show m1.small.performance
  +----------------------------+-------------------------------------------------------------------------------------------------------+
  | Property | Value |
  +----------------------------+-------------------------------------------------------------------------------------------------------+
  | OS-FLV-DISABLED:disabled | False |
  | OS-FLV-EXT-DATA:ephemeral | 0 |
  | disk | 20 |
  | extra_specs | {"hw:cpu_policy": "dedicated", "hw:numa_nodes": "1"} |
  | id | 7b0e5ee0-0bf7-4a46-9653-9279a947c650 |
  | name | m1.small.performance |
  | os-flavor-access:is_public | True |
  | ram | 2048 |
  | rxtx_factor | 1.0 |
  | swap | |
  | vcpus | 1 |
  +----------------------------+--------------------------------------------------------------------------------
   3) download ubuntu image
   4) create sr-iov port and boot vm on this port with m1.small.performance flavor:
  NODE_1='node-4.test.domain.local'
  NODE_2='node-5.test.domain.local'
  NET_ID_1=$(neutron net-list | grep net_EW_2 | awk '{print$2}')
  neutron port-create $NET_ID_1 --binding:vnic-type direct --device_owner nova-compute --name sriov_23
  port_id=$(neutron port-list | grep 'sriov_23' | awk '{print$2}')
  nova boot vm23 --flavor m1.small.performance --image ubuntu_image --availability-zone nova:$NODE_1 --nic port-id=$port_id --key-name vm_key

  Expected results:
   VM is an ACTIVE state
  Actual result:
   In most cases the state is ERROR with following logs:

  2016-05-13 08:25:56.598 29097 ERROR nova.pci.stats [req-26138c0b-fa55-4ff8-8f3a-aad980e3c815 d864c4308b104454b7b46fb652f4f377 9322dead0b5d440986b12596d9cbff5b - - -] Failed to allocate PCI devices for instance. Unassigning devices back to pools. This should not happen, since the scheduler should have accurate information, and allocation during claims is controlled via a hold on the compute node semaphore
  2016-05-13 08:25:57.502 29097 INFO nova.virt.libvirt.driver [req-26138c0b-fa55-4ff8-8f3a-aad980e3c815 d864c4308b104454b7b46fb652f4f377 9322dead0b5d440986b12596d9cbff5b - - -] [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] Creating image
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager [req-26138c0b-fa55-4ff8-8f3a-aad980e3c815 d864c4308b104454b7b46fb652f4f377 9322dead0b5d440986b12596d9cbff5b - - -] Instance failed network setup after 1 attempt(s)
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager Traceback (most recent call last):
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1570, in _allocate_network_async
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager     bind_host_id=bind_host_id)
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 666, in allocate_for_instance
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager     self._delete_ports(neutron, instance, created_port_ids)
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager     self.force_reraise()
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager     six.reraise(self.type_, self.value, self.tb)
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 645, in allocate_for_instance
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager     bind_host_id=bind_host_id)
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 738, in _populate_neutron_extension_values
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager     port_req_body)
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager   File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 709, in _populate_neutron_binding_profile
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager     instance, pci_request_id).pop()
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager IndexError: pop from empty list
  2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [req-26138c0b-fa55-4ff8-8f3a-aad980e3c815 d864c4308b104454b7b46fb652f4f377 9322dead0b5d440986b12596d9cbff5b - - -] [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] Instance failed to spawn
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] Traceback (most recent call last):
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2218, in _build_resources
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     yield resources
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2064, in _build_and_run_instance
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     block_device_info=block_device_info)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2761, in spawn
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     admin_pass=admin_password)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3287, in _create_image
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     instance, network_info, admin_pass, files, suffix)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3066, in _inject_data
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     network_info, libvirt_virt_type=CONF.libvirt.virt_type)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/virt/netutils.py", line 78, in get_injected_network_template
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     if not (network_info and template):
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/network/model.py", line 517, in __len__
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     return self._sync_wrapper(fn, *args, **kwargs)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/network/model.py", line 504, in _sync_wrapper
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     self.wait()
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/network/model.py", line 536, in wait
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     self[:] = self._gt.wait()
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 175, in wait
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     return self._exit_event.wait()
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 125, in wait
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     current.throw(*self._exc)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     result = function(*args, **kwargs)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 1145, in context_wrapper
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     return func(*args, **kwargs)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1587, in _allocate_network_async
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     six.reraise(*exc_info)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1570, in _allocate_network_async
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     bind_host_id=bind_host_id)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 666, in allocate_for_instance
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     self._delete_ports(neutron, instance, created_port_ids)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     self.force_reraise()
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     six.reraise(self.type_, self.value, self.tb)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 645, in allocate_for_instance
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     bind_host_id=bind_host_id)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 738, in _populate_neutron_extension_values
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     port_req_body)
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]   File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 709, in _populate_neutron_binding_profile
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]     instance, pci_request_id).pop()
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] IndexError: pop from empty list
  2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]
  2016-05-13 08:25:57.939 29097 INFO nova.compute.manager [req-26138c0b-fa55-4ff8-8f3a-aad980e3c815 d864c4308b104454b7b46fb652f4f377 9322dead0b5d440986b12596d9cbff5b - - -] [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] Terminating instance

  The problem is in nova/compute/resource_tracker.py. In method
  instance_claim():

  claim = claims.Claim(context, instance_ref, self, self.compute_node,
                               overhead=overhead, limits=limits)
  if self.pci_tracker:
    self.pci_tracker.claim_instance(context, instance_ref)

    instance_ref.numa_topology = claim.claimed_numa_topology
    self._set_instance_host_and_node(instance_ref)

  1) here nova create a claim with correct NUMA node with CPU pinning and PCI devices (in our case it is node-1)
  2) nova call pci_tracker.claim_instance() with instance_ref BUT instance_ref does not contain information about needed NUMA node. That is why in claim_instance we choose node-0. In this case we can't associate requested PCI devices with the instance because these devices are associated with node-1.
  3) nova put to the instance_ref correct numa node-1 from step 1.
  4) we got an instance without PCI devices.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1582278/+subscriptions


References