← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1896463] Related fix merged to nova (master)

 

Reviewed:  https://review.opendev.org/754100
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3f348602ae4a40c52c7135b2cb48deaa6052c488
Submitter: Zuul
Branch:    master

commit 3f348602ae4a40c52c7135b2cb48deaa6052c488
Author: Balazs Gibizer <balazs.gibizer@xxxxxxxx>
Date:   Thu Sep 24 15:04:21 2020 +0200

    Reproduce bug 1896463 in func env
    
    There is a race condition between the rebuild and the
    _update_available_resource periodic task on the compute. This patch adds
    a reproducer functional test. Unfortunately it needs some injected sleep
    to make the race happen in a stable way. This is suboptimal but only
    adds 3 seconds of slowness to the test execution.
    
    Change-Id: Id0577bceed9808b52da4acc352cf9c204f6c8861
    Related-Bug: #1896463


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1896463

Title:
  evacuation failed: Port update failed : Unable to correlate PCI slot

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  Triaged
Status in OpenStack Compute (nova) rocky series:
  In Progress
Status in OpenStack Compute (nova) stein series:
  Triaged
Status in OpenStack Compute (nova) train series:
  Triaged
Status in OpenStack Compute (nova) ussuri series:
  Triaged
Status in OpenStack Compute (nova) victoria series:
  In Progress

Bug description:
  Description
  ===========
  if the _update_available_resource() of resource_tracker is called between _do_rebuild_instance_with_claim() and instance.save() when evacuating VM instances on destination host,  

  nova/compute/manager.py

  2931     def rebuild_instance(self, context, instance, orig_image_ref, image_ref,
  2932 +-- 84 lines: injected_files, new_pass, orig_sys_metadata,-------------------------------------------------------------------
  3016                 claim_ctxt = rebuild_claim(
  3017                     context, instance, scheduled_node,
  3018                     limits=limits, image_meta=image_meta,
  3019                     migration=migration)
  3020                 self._do_rebuild_instance_with_claim(
  3021 +-- 47 lines: claim_ctxt, context, instance, orig_image_ref,-----------------------------------------------------------------
  3068                 instance.apply_migration_context()
  3069                 # NOTE (ndipanov): This save will now update the host and node
  3070                 # attributes making sure that next RT pass is consistent since
  3071                 # it will be based on the instance and not the migration DB
  3072                 # entry.
  3073                 instance.host = self.host
  3074                 instance.node = scheduled_node
  3075                 instance.save()
  3076                 instance.drop_migration_context()

  the instance is not handled as managed instance of the destination
  host because it is not updated on DB yet.

  2020-09-19 07:27:36.321 8 WARNING nova.compute.resource_tracker [req-
  b35d5b9a-0786-4809-bd81-ad306cdda8d5 - - - - -] Instance
  22f6ca0e-f964-4467-83a3-f2bf12bb05ae is not being actively managed by
  this compute host but has allocations referencing this compute host:
  {u'resources': {u'MEMORY_MB': 12288, u'VCPU': 2, u'DISK_GB': 10}}.
  Skipping heal of allocation because we do not know what to do.

  And so the SRIOV ports (PCI device) was free by clean_usage()
  eventhough the VM has the VF port already.

   743     def _update_available_resource(self, context, resources):
   744 +-- 45 lines: # initialize the compute node object, creating it--------------------------------------------------------------
   789         self.pci_tracker.clean_usage(instances, migrations, orphans)
   790         dev_pools_obj = self.pci_tracker.stats.to_device_pools_obj()

  After that, evacuated this VM to another compute host again, we got
  the error like below.


  Steps to reproduce
  ==================
  1. create a VM on com1 with SRIOV VF ports.
  2. stop and disable nova-compute service on com1
  3. wait 60 sec (nova-compute reporting interval)
  4. evauate the VM to com2
  5. wait the VM is active on com2
  6. enable and start nova-compute on com1
  7. wait 60 sec (nova-compute reporting interval)
  8. stop and disable nova-compute service on com2
  9. wait 60 sec (nova-compute reporting interval)
  10. evauate the VM to com1
  11. wait the VM is active on com1
  12. enable and start nova-compute on com2
  13. wait 60 sec (nova-compute reporting interval)
  14. go to step 2.

  Expected result
  ===============
  Evacuation should be done without errors.

  Actual result
  =============
  Evacuation failed with "Port update failed"

  Environment
  ===========
  openstack-nova-compute-18.0.1-1 with SRIOV ports are used. libvirt is used.

  Logs & Configs
  ==============
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [req-38dd0be2-7223-4a59-8073-dd1b072125c5 c424fbb3d41f444bb7a025266fda36da 6255a6910b9b4d3ba34a93624fe7fb22 - default default] [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] Setting instance vm_state to ERROR: PortUpdateFailed: Port update failed for port 76dc33dc-5b3b-4c45-b2cb-fd59025a4dbd: Unable to correlate PCI slot 0000:05:12.2
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] Traceback (most recent call last):
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7993, in _error_out_instance_on_exception
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]     yield
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3025, in rebuild_instance
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]     migration, request_spec)
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3087, in _do_rebuild_instance_with_claim
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]     self._do_rebuild_instance(*args, **kwargs)
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3190, in _do_rebuild_instance
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]     context, instance, self.host, migration)
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 2953, in setup_instance_network_on_host
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]     migration)
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]   File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 3058, in _update_port_binding_for_instance
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]     pci_slot)
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae] PortUpdateFailed: Port update failed for port 76dc33dc-5b3b-4c45-b2cb-fd59025a4dbd: Unable to correlate PCI slot 0000:05:12.2
  2020-09-19 07:34:22.670 8 ERROR nova.compute.manager [instance: 22f6ca0e-f964-4467-83a3-f2bf12bb05ae]

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1896463/+subscriptions


References