yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #59337
[Bug 1582278] Re: [SR-IOV][CPU Pinning] nova compute can try to boot VM with CPUs from one NUMA node and PCI device from another NUMA node.
Marking as fix released for xenial/mitaka since 13.1.2 is now in xenial-
updates and mitaka-updates.
** Also affects: cloud-archive
Importance: Undecided
Status: New
** Also affects: cloud-archive/mitaka
Importance: Undecided
Status: New
** Changed in: cloud-archive
Status: New => Invalid
** Changed in: nova (Ubuntu Xenial)
Status: Triaged => Fix Released
** Changed in: cloud-archive/mitaka
Status: New => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1582278
Title:
[SR-IOV][CPU Pinning] nova compute can try to boot VM with CPUs from
one NUMA node and PCI device from another NUMA node.
Status in Ubuntu Cloud Archive:
Invalid
Status in Ubuntu Cloud Archive mitaka series:
Fix Released
Status in OpenStack Compute (nova):
Fix Released
Status in nova package in Ubuntu:
Fix Released
Status in nova source package in Xenial:
Fix Released
Status in nova source package in Yakkety:
Fix Released
Bug description:
Environment:
Two NUMA nodes on compute host (node-0 and node-1).
One SR-IOV PCI device associated with NUMA node-1.
Steps to reproduce:
Steps to reproduce:
1) Deploy env with SR-IOV and CPU pinning enable
2) Create new flavor with cpu pinning:
nova flavor-show m1.small.performance
+----------------------------+-------------------------------------------------------------------------------------------------------+
| Property | Value |
+----------------------------+-------------------------------------------------------------------------------------------------------+
| OS-FLV-DISABLED:disabled | False |
| OS-FLV-EXT-DATA:ephemeral | 0 |
| disk | 20 |
| extra_specs | {"hw:cpu_policy": "dedicated", "hw:numa_nodes": "1"} |
| id | 7b0e5ee0-0bf7-4a46-9653-9279a947c650 |
| name | m1.small.performance |
| os-flavor-access:is_public | True |
| ram | 2048 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 1 |
+----------------------------+--------------------------------------------------------------------------------
3) download ubuntu image
4) create sr-iov port and boot vm on this port with m1.small.performance flavor:
NODE_1='node-4.test.domain.local'
NODE_2='node-5.test.domain.local'
NET_ID_1=$(neutron net-list | grep net_EW_2 | awk '{print$2}')
neutron port-create $NET_ID_1 --binding:vnic-type direct --device_owner nova-compute --name sriov_23
port_id=$(neutron port-list | grep 'sriov_23' | awk '{print$2}')
nova boot vm23 --flavor m1.small.performance --image ubuntu_image --availability-zone nova:$NODE_1 --nic port-id=$port_id --key-name vm_key
Expected results:
VM is an ACTIVE state
Actual result:
In most cases the state is ERROR with following logs:
2016-05-13 08:25:56.598 29097 ERROR nova.pci.stats [req-26138c0b-fa55-4ff8-8f3a-aad980e3c815 d864c4308b104454b7b46fb652f4f377 9322dead0b5d440986b12596d9cbff5b - - -] Failed to allocate PCI devices for instance. Unassigning devices back to pools. This should not happen, since the scheduler should have accurate information, and allocation during claims is controlled via a hold on the compute node semaphore
2016-05-13 08:25:57.502 29097 INFO nova.virt.libvirt.driver [req-26138c0b-fa55-4ff8-8f3a-aad980e3c815 d864c4308b104454b7b46fb652f4f377 9322dead0b5d440986b12596d9cbff5b - - -] [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] Creating image
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager [req-26138c0b-fa55-4ff8-8f3a-aad980e3c815 d864c4308b104454b7b46fb652f4f377 9322dead0b5d440986b12596d9cbff5b - - -] Instance failed network setup after 1 attempt(s)
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager Traceback (most recent call last):
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1570, in _allocate_network_async
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager bind_host_id=bind_host_id)
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 666, in allocate_for_instance
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager self._delete_ports(neutron, instance, created_port_ids)
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager self.force_reraise()
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager six.reraise(self.type_, self.value, self.tb)
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 645, in allocate_for_instance
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager bind_host_id=bind_host_id)
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 738, in _populate_neutron_extension_values
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager port_req_body)
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 709, in _populate_neutron_binding_profile
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager instance, pci_request_id).pop()
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager IndexError: pop from empty list
2016-05-13 08:25:57.664 29097 ERROR nova.compute.manager
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [req-26138c0b-fa55-4ff8-8f3a-aad980e3c815 d864c4308b104454b7b46fb652f4f377 9322dead0b5d440986b12596d9cbff5b - - -] [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] Instance failed to spawn
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] Traceback (most recent call last):
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2218, in _build_resources
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] yield resources
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2064, in _build_and_run_instance
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] block_device_info=block_device_info)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 2761, in spawn
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] admin_pass=admin_password)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3287, in _create_image
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] instance, network_info, admin_pass, files, suffix)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 3066, in _inject_data
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] network_info, libvirt_virt_type=CONF.libvirt.virt_type)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/virt/netutils.py", line 78, in get_injected_network_template
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] if not (network_info and template):
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/network/model.py", line 517, in __len__
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] return self._sync_wrapper(fn, *args, **kwargs)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/network/model.py", line 504, in _sync_wrapper
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] self.wait()
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/network/model.py", line 536, in wait
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] self[:] = self._gt.wait()
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 175, in wait
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] return self._exit_event.wait()
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 125, in wait
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] current.throw(*self._exc)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] result = function(*args, **kwargs)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 1145, in context_wrapper
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] return func(*args, **kwargs)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1587, in _allocate_network_async
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] six.reraise(*exc_info)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 1570, in _allocate_network_async
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] bind_host_id=bind_host_id)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 666, in allocate_for_instance
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] self._delete_ports(neutron, instance, created_port_ids)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] self.force_reraise()
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] six.reraise(self.type_, self.value, self.tb)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 645, in allocate_for_instance
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] bind_host_id=bind_host_id)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 738, in _populate_neutron_extension_values
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] port_req_body)
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] File "/usr/lib/python2.7/dist-packages/nova/network/neutronv2/api.py", line 709, in _populate_neutron_binding_profile
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] instance, pci_request_id).pop()
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] IndexError: pop from empty list
2016-05-13 08:25:57.937 29097 ERROR nova.compute.manager [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566]
2016-05-13 08:25:57.939 29097 INFO nova.compute.manager [req-26138c0b-fa55-4ff8-8f3a-aad980e3c815 d864c4308b104454b7b46fb652f4f377 9322dead0b5d440986b12596d9cbff5b - - -] [instance: 4e691469-893d-4b24-a0a8-00bbee0fa566] Terminating instance
The problem is in nova/compute/resource_tracker.py. In method
instance_claim():
claim = claims.Claim(context, instance_ref, self, self.compute_node,
overhead=overhead, limits=limits)
if self.pci_tracker:
self.pci_tracker.claim_instance(context, instance_ref)
instance_ref.numa_topology = claim.claimed_numa_topology
self._set_instance_host_and_node(instance_ref)
1) here nova create a claim with correct NUMA node with CPU pinning and PCI devices (in our case it is node-1)
2) nova call pci_tracker.claim_instance() with instance_ref BUT instance_ref does not contain information about needed NUMA node. That is why in claim_instance we choose node-0. In this case we can't associate requested PCI devices with the instance because these devices are associated with node-1.
3) nova put to the instance_ref correct numa node-1 from step 1.
4) we got an instance without PCI devices.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1582278/+subscriptions
References