← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1784484] [NEW] intermittent issue getting assigned MACs for SRIOV nics, causes nova timeout

 

Public bug reported:

We're seeing an intermittent issue with the Pike release where a VM with SRIOV NIC gets stuck in the "rebuild_spawning" task state after a "nova evacuate" operation (which rebuilds the instance on a different compute
node).

The specific issue we're seeing is that occasionally neutron seems to get "stuck".  Nova is down in nova.virt.libvirt.driver.LibvirtDriver._create_domain_and_network(), and five minutes later
it hits the eventlet.timeout.Timeout exception clause at https://github.com/openstack/nova/blob/stable/pike/nova/virt/libvirt/driver.py#L5451


We think that maybe there's a race condition between nova/libvirt setting the MAC for the VF and the VF actually being usable/visible to the system.    Closest mention of this upstream is this:

https://www.spinics.net/lists/kvm/msg165297.html

To test this we modified get_assigned_macs()  at https://github.com/openstack/neutron/blob/stable/pike/neutron/plugins/ml2/drivers/mech_sriov/agent/pci_lib.py#L75 to wait 2 seconds if vf_lines
is empty,then call _get_vf_link_show() and get vf_lines again (max 3 tries).
After a few instance evacuates, the retry code was exercised after message "Cannot find vfs" was produced, and 1 retry was executed, and this time vf_lines had data in it.


Details:

The initial nova log looks like this:
2018-03-17 10:19:41.344 58584 WARNING nova.compute.manager [req-5283f1b9-23d3-472c-80df-4b8d73f8523b 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-unplugged-0ac89834-0220-48ef-8559-e48f65fe4efb for instance
2018-03-17 10:19:42.770 58584 INFO nova.network.neutronv2.api [req-f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4 78e01109154c49aab9960640407d96af - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Updating port 0ac89834-0220-48ef-8559-e48f65fe4efb with attributes {'binding:profile': {}, 'binding:host_id': 'compute-1'}
2018-03-17 10:19:44.032 58584 WARNING nova.compute.manager [req-a1f9f8fb-cdec-40c1-be2c-0b052305e331 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-unplugged-59ce6131-07ea-4b6c-ac62-9c27e99149b8 for instance
2018-03-17 10:19:44.048 58584 INFO nova.network.neutronv2.api [req-f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4 78e01109154c49aab9960640407d96af - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Updating port 59ce6131-07ea-4b6c-ac62-9c27e99149b8 with attributes {'binding:profile': {u'physical_network': u'group0-data0b', u'pci_request_id': u'b6ccb9b3-1551-4be7-812f-6108786b2b0a', u'pci_slot': u'0000:81:0f.7', u'pci_vendor_info': u'8086:154c'}, 'binding:host_id': 'compute-1'}
2018-03-17 10:19:44.438 58584 WARNING nova.compute.manager [req-7478c956-46cd-4651-9ea7-e477f2a99747 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-plugged-59ce6131-07ea-4b6c-ac62-9c27e99149b8 for instance
2018-03-17 10:19:45.475 58584 INFO nova.compute.manager [req-f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4 78e01109154c49aab9960640407d96af - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Detaching volume d4cb0ea9-a689-4203-8e38-40f16a448915
2018-03-17 10:19:46.091 58584 WARNING nova.compute.manager [req-b1d1c529-d32f-4d99-8b45-c2bea4ada9b3 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-unplugged-59ce6131-07ea-4b6c-ac62-9c27e99149b8 for instance


Then the neutron log shows "Trigger reload_allocations" a few times, and the last one looks like:

2018-03-17 10:19:45.032 56804 INFO neutron.agent.dhcp.agent [req-5240cf46-0dff-4344-9d57-9e8f133559b0 1188dcf2254945e493726a4245e1a676 4a7e8af978b043c1a0a26eafd52e0301 - - -] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=compute-1, binding:profile=pci_request_id=b6ccb9b3-1551-4be7-812f-6108786b2b0a, pci_slot=0000:81:0f.7, pci_vendor_info=8086:154c, physical_network=group0-data0b, binding:vif_details=port_filter=False, vlan=2389, binding:vif_type=hw_veb, binding:vnic_type=direct, created_at=2018-03-17T10:18:29Z, description=, device_id=983e3130-cb8f-4715-9e22-4d373fe4bb77, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'ip_address': u'10.1.1.24'}], id=59ce6131-07ea-4b6c-ac62-9c27e99149b8, mac_address=fa:16:3e:8a:c2:d2, name=, network_id=a201980f-83e9-4fdd-9933-2fdfd58b8089, project_id=37fad54aace64d13b8d03bfa83563d59, revision_number=11, security_groups=[], status=DOWN, tags=[], tenant_id=37fad54aace64d13b8d03bfa83563d59, updated_at=2018-03-17T10:19:44Z, wrs-binding:mac_filtering=False, wrs-binding:mtu=9216, wrs-binding:vif_model=pci-sriov
2018-03-17 10:19:53.860 57003 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-4b2db1f7-06a8-42c5-9dab-1bbf7a937da3 - - - - -] Port fa:16:3e:8a:c2:d2 updated. Details: {u'allowed_address_pairs': [], u'device_owner': u'compute:nova', u'port_security_enabled': False, u'mtu': 9216, u'port_id': u'59ce6131-07ea-4b6c-ac62-9c27e99149b8', u'network': {u'provider:physical_network': u'group0-data0b', u'updated_at': u'2018-03-16T20:58:50Z', u'revision_number': 2, u'mtu': 9216, u'id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089', u'description': u'', u'router:external': False, u'availability_zone_hints': [], u'availability_zones': [u'nova'], u'ipv4_address_scope': None, u'shared': True, u'project_id': u'78e01109154c49aab9960640407d96af', u'status': u'ACTIVE', u'subnets': [u'a7062041-37ab-468d-a1eb-f33d3b3621ee'], u'wrs-tm:qos': u'b1ec2429-3158-4716-8c5e-1848aac9c5fb', u'tags': [], u'ipv6_address_scope': None, u'provider:segmentation_id': 2389, u'name': u'internal0-net0-1', u'admin_state_up': True, u'tenant_id': u'78e01109154c49aab9960640407d96af', u'created_at': u'2018-03-16T20:58:36Z', u'provider:network_type': u'vlan', u'vlan_transparent': True}, u'fixed_ips': [{u'subnet_id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'ip_address': u'10.1.1.24'}], u'physical_network': u'group0-data0b', u'trunk_port': False, u'mac_address': u'fa:16:3e:8a:c2:d2', u'profile': {u'pci_request_id': u'b6ccb9b3-1551-4be7-812f-6108786b2b0a', u'pci_slot': u'0000:81:0f.7', u'physical_network': u'group0-data0b', u'pci_vendor_info': u'8086:154c'}, u'subnets': [{u'subnet': {u'ipv6_ra_mode': None, u'allocation_pools': [{u'subnet_id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'last_ip': u'10.1.1.254', u'id': u'55b74d0a-0e02-4026-8964-5e254d8d2566', u'first_ip': u'10.1.1.17'}], u'standard_attr_id': 88, u'ipv6_address_mode': None, u'id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'vlan_id': 0, u'dns_nameservers': [], u'network_standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-03-16T20:58:36.000000', u'updated_at': u'2018-03-16T20:58:50.000000', u'revision_number': 2, u'id': 49, u'resource_type': u'networks'}, u'standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-03-16T20:58:50.000000', u'updated_at': u'2018-03-16T20:58:50.000000', u'revision_number': 0, u'id': 88, u'resource_type': u'subnets'}, u'gateway_ip': None, u'project_id': u'78e01109154c49aab9960640407d96af', u'subnetpool': None, u'routes': [], u'cidr': u'10.1.1.0/24', u'subnetpool_id': None, u'service_types': [], u'name': u'internal0-subnet0-1', u'enable_dhcp': True, u'segment_id': None, u'network_id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089', u'ip_version': 4, u'rbac_entries': [{u'action': u'access_as_shared', u'project_id': u'78e01109154c49aab9960640407d96af', u'id': u'66bc32fc-99ca-468b-a1de-d55c5c79b3bc', u'target_tenant': u'*', u'object_id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089'}]}, u'physical_network': u'group0-data0b', u'network_type': u'vlan', u'segmentation_id': 2389}], u'network_qos_policy_id': None, u'segmentation_id': 2389, u'port_qos_policy': None, u'device': u'fa:16:3e:8a:c2:d2', u'qos_policy_id': None, u'admin_state_up': True, u'network_id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089', u'tenant_id': u'37fad54aace64d13b8d03bfa83563d59', u'network_qos_policy': {u'scheduler': {u'weight': u'4'}}, u'network_type': u'vlan'}


Then the nova log shows libvirt timing out after 5 minutes:

2018-03-17 10:24:54.058 58584 WARNING nova.virt.libvirt.driver [req-
f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4
78e01109154c49aab9960640407d96af - default default] [instance:
983e3130-cb8f-4715-9e22-4d373fe4bb77] Timeout waiting for vif plugging
callback for instance with vm_state error and task_state
rebuild_spawning.


Repeating the experiment with debug logs enabled for neutron gave the
following:

2018-04-23 18:14:19.561 73750 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Agent loop found changes! {'current': set([('00:00:00:00:00:00', '0000:81:0f.5'), ('00:00:00:00:00:00', '0000:81:0f.4'), ('00:00:00:00:00:00', '0000:81:0e.6'), ('00:00:00:00:00:00', '0000:81:0e.7'), ('00:00:00:00:00:00', '0000:81:0f.6'), ('00:00:00:00:00:00', '0000:81:0e.4'), ('00:00:00:00:00:00', '0000:81:0e.5'), ('00:00:00:00:00:00', '0000:81:0e.2'), ('00:00:00:00:00:00', '0000:81:0e.3'), ('00:00:00:00:00:00', '0000:81:0f.1'), ('fa:16:3e:d6:01:a7', '0000:81:0f.7'), ('00:00:00:00:00:00', '0000:81:0e.0'), ('00:00:00:00:00:00', '0000:81:0f.0'), ('00:00:00:00:00:00', '0000:81:0e.1'), ('00:00:00:00:00:00', '0000:81:0f.3'), ('00:00:00:00:00:00', '0000:81:0f.2')]), 'removed': set([('00:00:00:00:00:00', '0000:81:0f.7')]), 'added': set([('fa:16:3e:d6:01:a7', '0000:81:0f.7')]), 'updated': set([])} daemon_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:388
2018-04-23 18:14:19.562 73750 INFO neutron.agent.securitygroups_rpc [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Skipping method prepare_devices_filter as firewall is disabled or configured as NoopFirewallDriver.
2018-04-23 18:14:19.828 73750 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Port with MAC address fa:16:3e:d6:01:a7 is added treat_devices_added_updated /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:294 
2018-04-23 18:14:19.829 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Port fa:16:3e:d6:01:a7 updated. Details: {u'allowed_address_pairs': [], u'device_owner': u'compute:nova', u'port_security_enabled': False, u'mtu': 9216, u'port_id': u'f90e58b4-bbc7-4eb2-991b-46406d3d6470', u'network': {u'provider:physical_network': u'group0-data0b', u'updated_at': u'2018-04-23T16:48:51Z', u'revision_number': 2, u'mtu': 9216, u'id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8', u'description': u'', u'router:external': False, u'availability_zone_hints': [], u'availability_zones': [u'nova'], u'ipv4_address_scope': None, u'shared': True, u'project_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'status': u'ACTIVE', u'subnets': [u'f452f5c3-a686-45cd-8098-0d092bd5e21e'], u'wrs-tm:qos': u'7d2f03f3-b9c6-4183-9859-878bbb8141e7', u'tags': [], u'ipv6_address_scope': None, u'provider:segmentation_id': 2285, u'name': u'internal0-net0', u'admin_state_up': True, u'tenant_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'created_at': u'2018-04-23T16:48:46Z', u'provider:network_type': u'vlan', u'vlan_transparent': True}, u'fixed_ips': [{u'subnet_id': u'f452f5c3-a686-45cd-8098-0d092bd5e21e', u'ip_address': u'10.0.0.132'}], u'physical_network': u'group0-data0b', u'trunk_port': False, u'mac_address': u'fa:16:3e:d6:01:a7', u'profile': {u'pci_request_id': u'89135f77-a4df-4744-b228-d396b7c0e876', u'pci_slot': u'0000:81:0f.7', u'physical_network': u'group0-data0b', u'pci_vendor_info': u'8086:154c'}, u'subnets': [{u'subnet': {u'ipv6_ra_mode': None, u'allocation_pools': [{u'subnet_id': u'f452f5c3-a686-45cd-8098-0d092bd5e21e', u'last_ip': u'10.0.0.254', u'id': u'86242517-d9dc-4c59-a3d6-17881ee0d42c', u'first_ip': u'10.0.0.128'}], u'standard_attr_id': 11, u'ipv6_address_mode': None, u'id': u'f452f5c3-a686-45cd-8098-0d092bd5e21e', u'vlan_id': 0, u'dns_nameservers': [], u'network_standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-04-23T16:48:46.000000', u'updated_at': u'2018-04-23T16:48:51.000000', u'revision_number': 2, u'id': 9, u'resource_type': u'networks'}, u'standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-04-23T16:48:51.000000', u'updated_at': u'2018-04-23T16:48:51.000000', u'revision_number': 0, u'id': 11, u'resource_type': u'subnets'}, u'gateway_ip': None, u'project_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'subnetpool': None, u'routes': [], u'cidr': u'10.0.0.0/24', u'subnetpool_id': None, u'service_types': [], u'name': u'internal0-subnet0', u'enable_dhcp': True, u'segment_id': None, u'network_id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8', u'ip_version': 4, u'rbac_entries': [{u'action': u'access_as_shared', u'project_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'id': u'9c699b99-a98c-48ce-bdee-19e6e8483429', u'target_tenant': u'*', u'object_id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8'}]}, u'physical_network': u'group0-data0b', u'network_type': u'vlan', u'segmentation_id': 2285}], u'network_qos_policy_id': None, u'segmentation_id': 2285, u'port_qos_policy': None, u'device': u'fa:16:3e:d6:01:a7', u'qos_policy_id': None, u'admin_state_up': True, u'network_id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8', u'tenant_id': u'34a91443256b4db1aa2ba9d305079d41', u'network_qos_policy': {u'scheduler': {u'weight': u'4'}}, u'network_type': u'vlan'} 
2018-04-23 18:14:19.830 73750 DEBUG neutron.agent.linux.utils [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Running command: ['sudo', 'ip', 'link', 'show'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:92 
2018-04-23 18:14:19.850 73750 DEBUG neutron.agent.linux.utils [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Running command: ['sudo', 'ip', 'link', 'show', 'ens787f3'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:92 
2018-04-23 18:14:19.872 73750 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Cannot find vfs [15] in device ens787f3 
2018-04-23 18:14:19.873 73750 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.eswitch_manager [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] device pci mismatch: fa:16:3e:d6:01:a7 - 0000:81:0f.7 
2018-04-23 18:14:19.874 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] No device with MAC fa:16:3e:d6:01:a7 defined on agent. 
2018-04-23 18:14:19.880 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Removing device with MAC address 00:00:00:00:00:00 and PCI slot 0000:81:0f.7 
2018-04-23 18:14:19.881 73750 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] port_id to device with MAC 00:00:00:00:00:00 not found 
2018-04-23 18:14:19.964 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Port with MAC 00:00:00:00:00:00 and PCI slot 0000:81:0f.7 updated.


It seems like the VF's MAC address was programmed okay, but we don't see
the VF being removed even though there's a "Cannot find vfs" log.


There's a host kernel log that makes us wonder if there's some sort of delay on this hardware:
2018-04-23T18:14:19.377 compute-1 kernel: info [  418.459004] i40e 0000:81:00.3: Setting MAC fa:16:3e:d6:01:a7 on VF 15 
2018-04-23T18:14:19.472 compute-1 kernel: info [  418.553081] i40e 0000:81:00.3: Reload the VF driver to make this change effective. 
2018-04-23T18:14:19.472 compute-1 kernel: info [  418.553138] i40e 0000:81:00.3: Setting VLAN 2285, QOS 0x0 on VF 15 
2018-04-23T18:14:19.851 compute-1 kernel: info [  418.930504] vfio-pci 0000:81:0f.7: enabling device (0000 -> 0002) 
2018-04-23T18:14:19.869 compute-1 kernel: err [  418.944122] i40e 0000:81:00.3: VF 15 still in reset. Try again.

** Affects: neutron
     Importance: Undecided
         Status: New

** Summary changed:

- intermittent issue getting assigned MACs for SRIOV nics
+ intermittent issue getting assigned MACs for SRIOV nics, causes nova timeout

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1784484

Title:
  intermittent issue getting assigned MACs for SRIOV nics, causes nova
  timeout

Status in neutron:
  New

Bug description:
  We're seeing an intermittent issue with the Pike release where a VM with SRIOV NIC gets stuck in the "rebuild_spawning" task state after a "nova evacuate" operation (which rebuilds the instance on a different compute
  node).

  The specific issue we're seeing is that occasionally neutron seems to get "stuck".  Nova is down in nova.virt.libvirt.driver.LibvirtDriver._create_domain_and_network(), and five minutes later
  it hits the eventlet.timeout.Timeout exception clause at https://github.com/openstack/nova/blob/stable/pike/nova/virt/libvirt/driver.py#L5451

  
  We think that maybe there's a race condition between nova/libvirt setting the MAC for the VF and the VF actually being usable/visible to the system.    Closest mention of this upstream is this:

  https://www.spinics.net/lists/kvm/msg165297.html

  To test this we modified get_assigned_macs()  at https://github.com/openstack/neutron/blob/stable/pike/neutron/plugins/ml2/drivers/mech_sriov/agent/pci_lib.py#L75 to wait 2 seconds if vf_lines
  is empty,then call _get_vf_link_show() and get vf_lines again (max 3 tries).
  After a few instance evacuates, the retry code was exercised after message "Cannot find vfs" was produced, and 1 retry was executed, and this time vf_lines had data in it.



  Details:

  The initial nova log looks like this:
  2018-03-17 10:19:41.344 58584 WARNING nova.compute.manager [req-5283f1b9-23d3-472c-80df-4b8d73f8523b 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-unplugged-0ac89834-0220-48ef-8559-e48f65fe4efb for instance
  2018-03-17 10:19:42.770 58584 INFO nova.network.neutronv2.api [req-f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4 78e01109154c49aab9960640407d96af - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Updating port 0ac89834-0220-48ef-8559-e48f65fe4efb with attributes {'binding:profile': {}, 'binding:host_id': 'compute-1'}
  2018-03-17 10:19:44.032 58584 WARNING nova.compute.manager [req-a1f9f8fb-cdec-40c1-be2c-0b052305e331 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-unplugged-59ce6131-07ea-4b6c-ac62-9c27e99149b8 for instance
  2018-03-17 10:19:44.048 58584 INFO nova.network.neutronv2.api [req-f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4 78e01109154c49aab9960640407d96af - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Updating port 59ce6131-07ea-4b6c-ac62-9c27e99149b8 with attributes {'binding:profile': {u'physical_network': u'group0-data0b', u'pci_request_id': u'b6ccb9b3-1551-4be7-812f-6108786b2b0a', u'pci_slot': u'0000:81:0f.7', u'pci_vendor_info': u'8086:154c'}, 'binding:host_id': 'compute-1'}
  2018-03-17 10:19:44.438 58584 WARNING nova.compute.manager [req-7478c956-46cd-4651-9ea7-e477f2a99747 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-plugged-59ce6131-07ea-4b6c-ac62-9c27e99149b8 for instance
  2018-03-17 10:19:45.475 58584 INFO nova.compute.manager [req-f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4 78e01109154c49aab9960640407d96af - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Detaching volume d4cb0ea9-a689-4203-8e38-40f16a448915
  2018-03-17 10:19:46.091 58584 WARNING nova.compute.manager [req-b1d1c529-d32f-4d99-8b45-c2bea4ada9b3 3893b5675ba84cb5b6610dc0370d820b 4a7e8af978b043c1a0a26eafd52e0301 - default default] [instance: 983e3130-cb8f-4715-9e22-4d373fe4bb77] Received unexpected event network-vif-unplugged-59ce6131-07ea-4b6c-ac62-9c27e99149b8 for instance

  
  Then the neutron log shows "Trigger reload_allocations" a few times, and the last one looks like:

  2018-03-17 10:19:45.032 56804 INFO neutron.agent.dhcp.agent [req-5240cf46-0dff-4344-9d57-9e8f133559b0 1188dcf2254945e493726a4245e1a676 4a7e8af978b043c1a0a26eafd52e0301 - - -] Trigger reload_allocations for port admin_state_up=True, allowed_address_pairs=[], binding:host_id=compute-1, binding:profile=pci_request_id=b6ccb9b3-1551-4be7-812f-6108786b2b0a, pci_slot=0000:81:0f.7, pci_vendor_info=8086:154c, physical_network=group0-data0b, binding:vif_details=port_filter=False, vlan=2389, binding:vif_type=hw_veb, binding:vnic_type=direct, created_at=2018-03-17T10:18:29Z, description=, device_id=983e3130-cb8f-4715-9e22-4d373fe4bb77, device_owner=compute:nova, extra_dhcp_opts=[], fixed_ips=[{u'subnet_id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'ip_address': u'10.1.1.24'}], id=59ce6131-07ea-4b6c-ac62-9c27e99149b8, mac_address=fa:16:3e:8a:c2:d2, name=, network_id=a201980f-83e9-4fdd-9933-2fdfd58b8089, project_id=37fad54aace64d13b8d03bfa83563d59, revision_number=11, security_groups=[], status=DOWN, tags=[], tenant_id=37fad54aace64d13b8d03bfa83563d59, updated_at=2018-03-17T10:19:44Z, wrs-binding:mac_filtering=False, wrs-binding:mtu=9216, wrs-binding:vif_model=pci-sriov
  2018-03-17 10:19:53.860 57003 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-4b2db1f7-06a8-42c5-9dab-1bbf7a937da3 - - - - -] Port fa:16:3e:8a:c2:d2 updated. Details: {u'allowed_address_pairs': [], u'device_owner': u'compute:nova', u'port_security_enabled': False, u'mtu': 9216, u'port_id': u'59ce6131-07ea-4b6c-ac62-9c27e99149b8', u'network': {u'provider:physical_network': u'group0-data0b', u'updated_at': u'2018-03-16T20:58:50Z', u'revision_number': 2, u'mtu': 9216, u'id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089', u'description': u'', u'router:external': False, u'availability_zone_hints': [], u'availability_zones': [u'nova'], u'ipv4_address_scope': None, u'shared': True, u'project_id': u'78e01109154c49aab9960640407d96af', u'status': u'ACTIVE', u'subnets': [u'a7062041-37ab-468d-a1eb-f33d3b3621ee'], u'wrs-tm:qos': u'b1ec2429-3158-4716-8c5e-1848aac9c5fb', u'tags': [], u'ipv6_address_scope': None, u'provider:segmentation_id': 2389, u'name': u'internal0-net0-1', u'admin_state_up': True, u'tenant_id': u'78e01109154c49aab9960640407d96af', u'created_at': u'2018-03-16T20:58:36Z', u'provider:network_type': u'vlan', u'vlan_transparent': True}, u'fixed_ips': [{u'subnet_id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'ip_address': u'10.1.1.24'}], u'physical_network': u'group0-data0b', u'trunk_port': False, u'mac_address': u'fa:16:3e:8a:c2:d2', u'profile': {u'pci_request_id': u'b6ccb9b3-1551-4be7-812f-6108786b2b0a', u'pci_slot': u'0000:81:0f.7', u'physical_network': u'group0-data0b', u'pci_vendor_info': u'8086:154c'}, u'subnets': [{u'subnet': {u'ipv6_ra_mode': None, u'allocation_pools': [{u'subnet_id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'last_ip': u'10.1.1.254', u'id': u'55b74d0a-0e02-4026-8964-5e254d8d2566', u'first_ip': u'10.1.1.17'}], u'standard_attr_id': 88, u'ipv6_address_mode': None, u'id': u'a7062041-37ab-468d-a1eb-f33d3b3621ee', u'vlan_id': 0, u'dns_nameservers': [], u'network_standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-03-16T20:58:36.000000', u'updated_at': u'2018-03-16T20:58:50.000000', u'revision_number': 2, u'id': 49, u'resource_type': u'networks'}, u'standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-03-16T20:58:50.000000', u'updated_at': u'2018-03-16T20:58:50.000000', u'revision_number': 0, u'id': 88, u'resource_type': u'subnets'}, u'gateway_ip': None, u'project_id': u'78e01109154c49aab9960640407d96af', u'subnetpool': None, u'routes': [], u'cidr': u'10.1.1.0/24', u'subnetpool_id': None, u'service_types': [], u'name': u'internal0-subnet0-1', u'enable_dhcp': True, u'segment_id': None, u'network_id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089', u'ip_version': 4, u'rbac_entries': [{u'action': u'access_as_shared', u'project_id': u'78e01109154c49aab9960640407d96af', u'id': u'66bc32fc-99ca-468b-a1de-d55c5c79b3bc', u'target_tenant': u'*', u'object_id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089'}]}, u'physical_network': u'group0-data0b', u'network_type': u'vlan', u'segmentation_id': 2389}], u'network_qos_policy_id': None, u'segmentation_id': 2389, u'port_qos_policy': None, u'device': u'fa:16:3e:8a:c2:d2', u'qos_policy_id': None, u'admin_state_up': True, u'network_id': u'a201980f-83e9-4fdd-9933-2fdfd58b8089', u'tenant_id': u'37fad54aace64d13b8d03bfa83563d59', u'network_qos_policy': {u'scheduler': {u'weight': u'4'}}, u'network_type': u'vlan'}

  
  Then the nova log shows libvirt timing out after 5 minutes:

  2018-03-17 10:24:54.058 58584 WARNING nova.virt.libvirt.driver [req-
  f13b904e-3711-445d-8dcd-22d122d5791e ff8718b957f6454680b820ee0d3a8cb4
  78e01109154c49aab9960640407d96af - default default] [instance:
  983e3130-cb8f-4715-9e22-4d373fe4bb77] Timeout waiting for vif plugging
  callback for instance with vm_state error and task_state
  rebuild_spawning.



  Repeating the experiment with debug logs enabled for neutron gave the
  following:

  2018-04-23 18:14:19.561 73750 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Agent loop found changes! {'current': set([('00:00:00:00:00:00', '0000:81:0f.5'), ('00:00:00:00:00:00', '0000:81:0f.4'), ('00:00:00:00:00:00', '0000:81:0e.6'), ('00:00:00:00:00:00', '0000:81:0e.7'), ('00:00:00:00:00:00', '0000:81:0f.6'), ('00:00:00:00:00:00', '0000:81:0e.4'), ('00:00:00:00:00:00', '0000:81:0e.5'), ('00:00:00:00:00:00', '0000:81:0e.2'), ('00:00:00:00:00:00', '0000:81:0e.3'), ('00:00:00:00:00:00', '0000:81:0f.1'), ('fa:16:3e:d6:01:a7', '0000:81:0f.7'), ('00:00:00:00:00:00', '0000:81:0e.0'), ('00:00:00:00:00:00', '0000:81:0f.0'), ('00:00:00:00:00:00', '0000:81:0e.1'), ('00:00:00:00:00:00', '0000:81:0f.3'), ('00:00:00:00:00:00', '0000:81:0f.2')]), 'removed': set([('00:00:00:00:00:00', '0000:81:0f.7')]), 'added': set([('fa:16:3e:d6:01:a7', '0000:81:0f.7')]), 'updated': set([])} daemon_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:388
  2018-04-23 18:14:19.562 73750 INFO neutron.agent.securitygroups_rpc [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Skipping method prepare_devices_filter as firewall is disabled or configured as NoopFirewallDriver.
  2018-04-23 18:14:19.828 73750 DEBUG neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Port with MAC address fa:16:3e:d6:01:a7 is added treat_devices_added_updated /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py:294 
  2018-04-23 18:14:19.829 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Port fa:16:3e:d6:01:a7 updated. Details: {u'allowed_address_pairs': [], u'device_owner': u'compute:nova', u'port_security_enabled': False, u'mtu': 9216, u'port_id': u'f90e58b4-bbc7-4eb2-991b-46406d3d6470', u'network': {u'provider:physical_network': u'group0-data0b', u'updated_at': u'2018-04-23T16:48:51Z', u'revision_number': 2, u'mtu': 9216, u'id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8', u'description': u'', u'router:external': False, u'availability_zone_hints': [], u'availability_zones': [u'nova'], u'ipv4_address_scope': None, u'shared': True, u'project_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'status': u'ACTIVE', u'subnets': [u'f452f5c3-a686-45cd-8098-0d092bd5e21e'], u'wrs-tm:qos': u'7d2f03f3-b9c6-4183-9859-878bbb8141e7', u'tags': [], u'ipv6_address_scope': None, u'provider:segmentation_id': 2285, u'name': u'internal0-net0', u'admin_state_up': True, u'tenant_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'created_at': u'2018-04-23T16:48:46Z', u'provider:network_type': u'vlan', u'vlan_transparent': True}, u'fixed_ips': [{u'subnet_id': u'f452f5c3-a686-45cd-8098-0d092bd5e21e', u'ip_address': u'10.0.0.132'}], u'physical_network': u'group0-data0b', u'trunk_port': False, u'mac_address': u'fa:16:3e:d6:01:a7', u'profile': {u'pci_request_id': u'89135f77-a4df-4744-b228-d396b7c0e876', u'pci_slot': u'0000:81:0f.7', u'physical_network': u'group0-data0b', u'pci_vendor_info': u'8086:154c'}, u'subnets': [{u'subnet': {u'ipv6_ra_mode': None, u'allocation_pools': [{u'subnet_id': u'f452f5c3-a686-45cd-8098-0d092bd5e21e', u'last_ip': u'10.0.0.254', u'id': u'86242517-d9dc-4c59-a3d6-17881ee0d42c', u'first_ip': u'10.0.0.128'}], u'standard_attr_id': 11, u'ipv6_address_mode': None, u'id': u'f452f5c3-a686-45cd-8098-0d092bd5e21e', u'vlan_id': 0, u'dns_nameservers': [], u'network_standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-04-23T16:48:46.000000', u'updated_at': u'2018-04-23T16:48:51.000000', u'revision_number': 2, u'id': 9, u'resource_type': u'networks'}, u'standard_attr': {u'description': u'', u'tags': [], u'created_at': u'2018-04-23T16:48:51.000000', u'updated_at': u'2018-04-23T16:48:51.000000', u'revision_number': 0, u'id': 11, u'resource_type': u'subnets'}, u'gateway_ip': None, u'project_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'subnetpool': None, u'routes': [], u'cidr': u'10.0.0.0/24', u'subnetpool_id': None, u'service_types': [], u'name': u'internal0-subnet0', u'enable_dhcp': True, u'segment_id': None, u'network_id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8', u'ip_version': 4, u'rbac_entries': [{u'action': u'access_as_shared', u'project_id': u'198ed4d2d0ec4440a98729ffc7bc0635', u'id': u'9c699b99-a98c-48ce-bdee-19e6e8483429', u'target_tenant': u'*', u'object_id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8'}]}, u'physical_network': u'group0-data0b', u'network_type': u'vlan', u'segmentation_id': 2285}], u'network_qos_policy_id': None, u'segmentation_id': 2285, u'port_qos_policy': None, u'device': u'fa:16:3e:d6:01:a7', u'qos_policy_id': None, u'admin_state_up': True, u'network_id': u'a156b9b9-167c-447c-9ee0-f209f7ebb3f8', u'tenant_id': u'34a91443256b4db1aa2ba9d305079d41', u'network_qos_policy': {u'scheduler': {u'weight': u'4'}}, u'network_type': u'vlan'} 
  2018-04-23 18:14:19.830 73750 DEBUG neutron.agent.linux.utils [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Running command: ['sudo', 'ip', 'link', 'show'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:92 
  2018-04-23 18:14:19.850 73750 DEBUG neutron.agent.linux.utils [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Running command: ['sudo', 'ip', 'link', 'show', 'ens787f3'] create_process /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:92 
  2018-04-23 18:14:19.872 73750 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.pci_lib [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Cannot find vfs [15] in device ens787f3 
  2018-04-23 18:14:19.873 73750 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.eswitch_manager [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] device pci mismatch: fa:16:3e:d6:01:a7 - 0000:81:0f.7 
  2018-04-23 18:14:19.874 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] No device with MAC fa:16:3e:d6:01:a7 defined on agent. 
  2018-04-23 18:14:19.880 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Removing device with MAC address 00:00:00:00:00:00 and PCI slot 0000:81:0f.7 
  2018-04-23 18:14:19.881 73750 WARNING neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] port_id to device with MAC 00:00:00:00:00:00 not found 
  2018-04-23 18:14:19.964 73750 INFO neutron.plugins.ml2.drivers.mech_sriov.agent.sriov_nic_agent [req-293f2f36-0257-4056-9898-5938f71e1e9e - - - - -] Port with MAC 00:00:00:00:00:00 and PCI slot 0000:81:0f.7 updated.


  It seems like the VF's MAC address was programmed okay, but we don't
  see the VF being removed even though there's a "Cannot find vfs" log.

  
  There's a host kernel log that makes us wonder if there's some sort of delay on this hardware:
  2018-04-23T18:14:19.377 compute-1 kernel: info [  418.459004] i40e 0000:81:00.3: Setting MAC fa:16:3e:d6:01:a7 on VF 15 
  2018-04-23T18:14:19.472 compute-1 kernel: info [  418.553081] i40e 0000:81:00.3: Reload the VF driver to make this change effective. 
  2018-04-23T18:14:19.472 compute-1 kernel: info [  418.553138] i40e 0000:81:00.3: Setting VLAN 2285, QOS 0x0 on VF 15 
  2018-04-23T18:14:19.851 compute-1 kernel: info [  418.930504] vfio-pci 0000:81:0f.7: enabling device (0000 -> 0002) 
  2018-04-23T18:14:19.869 compute-1 kernel: err [  418.944122] i40e 0000:81:00.3: VF 15 still in reset. Try again.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1784484/+subscriptions


Follow ups