← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1644042] [NEW] Race condition with audit on migrate revert with PCI devices

 

Public bug reported:

Description
===========

This bug was found during stress testing cold migration revert for an
instance with an SR-IOV PCI device.  Migrate then revert migrate are
executed in a loop for the same guest.

The exact problem was found to be a race condition with the resource
tracker periodic audit.

When the audit is triggered after revert_resize() is executed on the
destination (before finish_revert_resize() is executed on the source
node), the resource tracker on the source gets the list of all instances
and migrations running on this node.  It then call the PCI manager
clean_usage() to clean any PCI resources that would be allocated to an
instance that was not found to be running (or found that no migration is
in progress for this instance).  So what happens in this case is that
the PCI devices for the instance on the source node are being freed
while the revert is done finished executing.

Following this condition, the instance actually succeed to revert, but
the PCI devices are removed from instance.pci_devices because of the
clean_usage(). [1]

On the following migration, the instance fails on finish_resize()
because we can't correlate the neutron port because the
migration_context.old_pci_devices is None [2].

Steps to reproduce
==================

1) Configure a multi-node (two nodes) setup with SR-IOV enabled on both computes
2) Boot a guest with SR-IOV device
3) In  a loop, do:
    a) Cold migrate the guest
    b) Revert resize the guest

Expected result
===============

It should be valid to successfully revert migrate a guest over and over
(multiple times in a row).

Actual result
=============
What happened instead of the expected result?
How did the issue look like?

Environment
===========

commit 633c817de5a67e798d8610d0df1135e5a568fd8a
Author: Matt Riedemann <mriedem@xxxxxxxxxx>
Date:   Sat Nov 12 11:59:13 2016 -0500

    api-ref: fix server_id in metadata docs
    
    The api-ref was saying that the server_id was in the body of the
    server metadata requests but it's actually in the path for all
    of the requests.
    
    Change-Id: Icdecd980767f89ee5fcc5bdd4802b2c263268a26
    Closes-Bug: #1641331

Logs & Configs
==============

In the output below, all the logs with '***' are code I've added in
order to trouble shoot the issue.

[1] On source node (last successful revert migrate):

2016-11-22 17:35:11.875 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_available_resource
2016-11-22 17:35:11.936 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_usage_from_instance
2016-11-22 17:35:11.971 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_usage_from_migrations
2016-11-22 17:35:11.972 ERROR nova.pci.manager [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** clean_usage
2016-11-22 17:35:11.973 ERROR nova.pci.stats [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** Adding device 0000:85:11.7
2016-11-22 17:35:11.973 ERROR nova.pci.stats [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** Pool is now: 
[vendor:8086, product:10ed, numa:1, physnet:default, dev_type:type-VF]: ['0000:85:10.1', '0000:85:10.3', '0000:85:10.5', '0000:85:11.1', '0000:85:11.3', '0000:85:11.7']

2016-11-22 17:35:33.817 ERROR nova.compute.manager [req-fba89cf1-21bf-482f-9b1d-3d634f5b3d73 admin admin] *** finish_revert_resize
2016-11-22 17:35:35.903 INFO nova.compute.manager [req-7d27d388-0613-451b-a9da-34dde9d5d0f2 None None] [instance: e99dc7ad-3c1c-4cd0-96d1-00e6cbcb8725] VM Resumed (Lifecycle Event)
2016-11-22 17:35:35.999 INFO nova.compute.manager [req-7d27d388-0613-451b-a9da-34dde9d5d0f2 None None] [instance: e99dc7ad-3c1c-4cd0-96d1-00e6cbcb8725] VM Started (Lifecycle Event)

[2] On destination node (next migration):

2016-11-22 17:41:31.739 ERROR nova.compute.manager [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** finish_resize - uuid: 19d615e1-04d3-42ea-b30f-f6cfc25f0397
2016-11-22 17:41:31.740 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** _update_port_binding_for_instance
2016-11-22 17:41:31.793 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** port: {u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'updated_at': u'2016-11-22T22:41:31Z', u'device_owner': u'compute:None', u'revision_number': 5011, u'port_security_enabled': True, u'binding:profile': {u'pci_slot': u'0000:85:11.7', u'physical_network': u'default', u'pci_vendor_info': u'8086:10ed'}, u'fixed_ips': [{u'subnet_id': u'd9c807e8-f5d3-4eb8-9d1e-dee0edc922b7', u'ip_address': u'10.0.0.6'}, {u'subnet_id': u'd1a618f1-217b-4a53-9bd4-c89cb8e55316', u'ip_address': u'2001:db8:8000:0:f816:3eff:fee7:b3d'}], u'id': u'd55c021a-d45c-4cfa-9607-11cf572b687e', u'security_groups': [u'fa65d63f-7466-47c0-aa56-bd827077fed9'], u'binding:vif_details': {u'port_filter': False, u'vlan': u'468'}, u'binding:vif_type': u'hw_veb', u'mac_address': u'fa:16:3e:e7:0b:3d', u'project_id': u'caa3cffc0402480ba611bd4ccb87492a', u'status': u'DOWN', u'binding:host_id': u'IronPass-3', u'description': u'', u'device_id': u'19d615e1-04d3-42ea-b30f-f6cfc25f0397', u'name': u'sriov_port_4', u'admin_state_up': True, u'network_id': u'3a4e3f83-7b84-4af8-be90-288f944f8fa7', u'tenant_id': u'caa3cffc0402480ba611bd4ccb87492a', u'created_at': u'2016-11-21T17:10:33Z', u'binding:vnic_type': u'direct'}
2016-11-22 17:41:31.793 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** _get_pci_mapping_for_migration
2016-11-22 17:41:31.794 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context: MigrationContext(created_at=<?>,deleted=<?>,deleted_at=<?>,instance_uuid=19d615e1-04d3-42ea-b30f-f6cfc25f0397,migration_id=2127,new_numa_topology=None,new_pci_devices=PciDeviceList,new_pci_requests=InstancePCIRequests,old_numa_topology=None,old_pci_devices=PciDeviceList,old_pci_requests=InstancePCIRequests,updated_at=<?>)
2016-11-22 17:41:31.794 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context.old_pci_devices.objects: []
2016-11-22 17:41:31.795 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context.new_pci_devices.objects: [PciDevice(address='0000:05:11.5',compute_node_id=1,created_at=2016-11-14T15:44:04Z,deleted=False,deleted_at=None,dev_id='pci_0000_05_11_5',dev_type='type-VF',extra_info={},id=7,instance_uuid='19d615e1-04d3-42ea-b30f-f6cfc25f0397',label='label_8086_10ed',numa_node=0,parent_addr='0000:05:00.1',product_id='10ed',request_id='06dd5913-5960-4fc9-ac1c-4f0bdfe344a7',status='claimed',updated_at=2016-11-22T22:40:17Z,vendor_id='8086')]
2016-11-22 17:41:31.795 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** pci_mapping: {}
2016-11-22 17:41:31.795 ERROR nova.compute.manager [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] Setting instance vm_state to ERROR
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] Traceback (most recent call last):
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]   File "/opt/stack/nova/nova/compute/manager.py", line 4003, in finish_resize
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]     disk_info, image_meta)
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]   File "/opt/stack/nova/nova/compute/manager.py", line 3935, in _finish_resize
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]     migration_p)
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]   File "/opt/stack/nova/nova/network/neutronv2/api.py", line 2033, in migrate_instance_finish
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]     migration=migration)
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]   File "/opt/stack/nova/nova/network/neutronv2/api.py", line 2424, in _update_port_binding_for_instance
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]     pci_slot)
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] PortUpdateFailed: Port update failed for port d55c021a-d45c-4cfa-9607-11cf572b687e: Unable to correlate PCI slot 0000:85:11.7
2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1644042

Title:
  Race condition with audit on migrate revert with PCI devices

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  This bug was found during stress testing cold migration revert for an
  instance with an SR-IOV PCI device.  Migrate then revert migrate are
  executed in a loop for the same guest.

  The exact problem was found to be a race condition with the resource
  tracker periodic audit.

  When the audit is triggered after revert_resize() is executed on the
  destination (before finish_revert_resize() is executed on the source
  node), the resource tracker on the source gets the list of all
  instances and migrations running on this node.  It then call the PCI
  manager clean_usage() to clean any PCI resources that would be
  allocated to an instance that was not found to be running (or found
  that no migration is in progress for this instance).  So what happens
  in this case is that the PCI devices for the instance on the source
  node are being freed while the revert is done finished executing.

  Following this condition, the instance actually succeed to revert, but
  the PCI devices are removed from instance.pci_devices because of the
  clean_usage(). [1]

  On the following migration, the instance fails on finish_resize()
  because we can't correlate the neutron port because the
  migration_context.old_pci_devices is None [2].

  Steps to reproduce
  ==================

  1) Configure a multi-node (two nodes) setup with SR-IOV enabled on both computes
  2) Boot a guest with SR-IOV device
  3) In  a loop, do:
      a) Cold migrate the guest
      b) Revert resize the guest

  Expected result
  ===============

  It should be valid to successfully revert migrate a guest over and
  over (multiple times in a row).

  Actual result
  =============
  What happened instead of the expected result?
  How did the issue look like?

  Environment
  ===========

  commit 633c817de5a67e798d8610d0df1135e5a568fd8a
  Author: Matt Riedemann <mriedem@xxxxxxxxxx>
  Date:   Sat Nov 12 11:59:13 2016 -0500

      api-ref: fix server_id in metadata docs
      
      The api-ref was saying that the server_id was in the body of the
      server metadata requests but it's actually in the path for all
      of the requests.
      
      Change-Id: Icdecd980767f89ee5fcc5bdd4802b2c263268a26
      Closes-Bug: #1641331

  Logs & Configs
  ==============

  In the output below, all the logs with '***' are code I've added in
  order to trouble shoot the issue.

  [1] On source node (last successful revert migrate):

  2016-11-22 17:35:11.875 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_available_resource
  2016-11-22 17:35:11.936 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_usage_from_instance
  2016-11-22 17:35:11.971 ERROR nova.compute.resource_tracker [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** _update_usage_from_migrations
  2016-11-22 17:35:11.972 ERROR nova.pci.manager [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** clean_usage
  2016-11-22 17:35:11.973 ERROR nova.pci.stats [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** Adding device 0000:85:11.7
  2016-11-22 17:35:11.973 ERROR nova.pci.stats [req-fa15f1be-db06-46c2-b861-7347be7f3e03 None None] *** Pool is now: 
  [vendor:8086, product:10ed, numa:1, physnet:default, dev_type:type-VF]: ['0000:85:10.1', '0000:85:10.3', '0000:85:10.5', '0000:85:11.1', '0000:85:11.3', '0000:85:11.7']

  2016-11-22 17:35:33.817 ERROR nova.compute.manager [req-fba89cf1-21bf-482f-9b1d-3d634f5b3d73 admin admin] *** finish_revert_resize
  2016-11-22 17:35:35.903 INFO nova.compute.manager [req-7d27d388-0613-451b-a9da-34dde9d5d0f2 None None] [instance: e99dc7ad-3c1c-4cd0-96d1-00e6cbcb8725] VM Resumed (Lifecycle Event)
  2016-11-22 17:35:35.999 INFO nova.compute.manager [req-7d27d388-0613-451b-a9da-34dde9d5d0f2 None None] [instance: e99dc7ad-3c1c-4cd0-96d1-00e6cbcb8725] VM Started (Lifecycle Event)

  [2] On destination node (next migration):

  2016-11-22 17:41:31.739 ERROR nova.compute.manager [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** finish_resize - uuid: 19d615e1-04d3-42ea-b30f-f6cfc25f0397
  2016-11-22 17:41:31.740 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** _update_port_binding_for_instance
  2016-11-22 17:41:31.793 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** port: {u'allowed_address_pairs': [], u'extra_dhcp_opts': [], u'updated_at': u'2016-11-22T22:41:31Z', u'device_owner': u'compute:None', u'revision_number': 5011, u'port_security_enabled': True, u'binding:profile': {u'pci_slot': u'0000:85:11.7', u'physical_network': u'default', u'pci_vendor_info': u'8086:10ed'}, u'fixed_ips': [{u'subnet_id': u'd9c807e8-f5d3-4eb8-9d1e-dee0edc922b7', u'ip_address': u'10.0.0.6'}, {u'subnet_id': u'd1a618f1-217b-4a53-9bd4-c89cb8e55316', u'ip_address': u'2001:db8:8000:0:f816:3eff:fee7:b3d'}], u'id': u'd55c021a-d45c-4cfa-9607-11cf572b687e', u'security_groups': [u'fa65d63f-7466-47c0-aa56-bd827077fed9'], u'binding:vif_details': {u'port_filter': False, u'vlan': u'468'}, u'binding:vif_type': u'hw_veb', u'mac_address': u'fa:16:3e:e7:0b:3d', u'project_id': u'caa3cffc0402480ba611bd4ccb87492a', u'status': u'DOWN', u'binding:host_id': u'IronPass-3', u'description': u'', u'device_id': u'19d615e1-04d3-42ea-b30f-f6cfc25f0397', u'name': u'sriov_port_4', u'admin_state_up': True, u'network_id': u'3a4e3f83-7b84-4af8-be90-288f944f8fa7', u'tenant_id': u'caa3cffc0402480ba611bd4ccb87492a', u'created_at': u'2016-11-21T17:10:33Z', u'binding:vnic_type': u'direct'}
  2016-11-22 17:41:31.793 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** _get_pci_mapping_for_migration
  2016-11-22 17:41:31.794 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context: MigrationContext(created_at=<?>,deleted=<?>,deleted_at=<?>,instance_uuid=19d615e1-04d3-42ea-b30f-f6cfc25f0397,migration_id=2127,new_numa_topology=None,new_pci_devices=PciDeviceList,new_pci_requests=InstancePCIRequests,old_numa_topology=None,old_pci_devices=PciDeviceList,old_pci_requests=InstancePCIRequests,updated_at=<?>)
  2016-11-22 17:41:31.794 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context.old_pci_devices.objects: []
  2016-11-22 17:41:31.795 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** migration_context.new_pci_devices.objects: [PciDevice(address='0000:05:11.5',compute_node_id=1,created_at=2016-11-14T15:44:04Z,deleted=False,deleted_at=None,dev_id='pci_0000_05_11_5',dev_type='type-VF',extra_info={},id=7,instance_uuid='19d615e1-04d3-42ea-b30f-f6cfc25f0397',label='label_8086_10ed',numa_node=0,parent_addr='0000:05:00.1',product_id='10ed',request_id='06dd5913-5960-4fc9-ac1c-4f0bdfe344a7',status='claimed',updated_at=2016-11-22T22:40:17Z,vendor_id='8086')]
  2016-11-22 17:41:31.795 ERROR nova.network.neutronv2.api [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] *** pci_mapping: {}
  2016-11-22 17:41:31.795 ERROR nova.compute.manager [req-f276b624-1304-42dd-902d-eba34afab2f2 admin admin] [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] Setting instance vm_state to ERROR
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] Traceback (most recent call last):
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]   File "/opt/stack/nova/nova/compute/manager.py", line 4003, in finish_resize
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]     disk_info, image_meta)
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]   File "/opt/stack/nova/nova/compute/manager.py", line 3935, in _finish_resize
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]     migration_p)
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]   File "/opt/stack/nova/nova/network/neutronv2/api.py", line 2033, in migrate_instance_finish
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]     migration=migration)
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]   File "/opt/stack/nova/nova/network/neutronv2/api.py", line 2424, in _update_port_binding_for_instance
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]     pci_slot)
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397] PortUpdateFailed: Port update failed for port d55c021a-d45c-4cfa-9607-11cf572b687e: Unable to correlate PCI slot 0000:85:11.7
  2016-11-22 17:41:31.795 TRACE nova.compute.manager [instance: 19d615e1-04d3-42ea-b30f-f6cfc25f0397]

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1644042/+subscriptions