← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1944619] Re: Instances with hardware offloaded ovs ports lose access after failed live migrations

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/815324
Committed: https://opendev.org/openstack/nova/commit/63ffba7496182f6f6f49a380f3c639fc3ded9772
Submitter: "Zuul (22348)"
Branch:    master

commit 63ffba7496182f6f6f49a380f3c639fc3ded9772
Author: Erlon R. Cruz <erlon@xxxxxxxxxxxxx>
Date:   Tue Dec 7 17:39:58 2021 -0300

    Fix pre_live_migration rollback
    
    During the pre live migration process, Nova performs most of the
    tasks related to the creation and operation of the VM in the destination
    host. That is done without interrupting any of the hardware in the source
    host. If the pre_live_migration fails, those same operations should be
    rolled back.
    
    Currently nova is sharing the _rollback_live_migration for both
    live and pre_live migration rollbacks, and that is causing the source
    host to try to re-attach network interfaces on the source host where
    they weren't actually de-attached.
    
    This patch fixes that by adding a conditional to allow nova to do
    different paths for migration and pre_live_migration rollbacks.
    
    Closes-bug: #1944619
    Change-Id: I784190ac356695dd508e0ad8ec31d8eaa3ebee56


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1944619

Title:
  Instances with hardware offloaded ovs ports lose access after failed
  live migrations

Status in neutron:
  Incomplete
Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  If for some reason a live migration fails for an instance with an
  SRIOV port during the '_pre_live_migration' hook. The instance will
  lose access to the network and leave behind duplicated port bindings
  on the database.

  The instance re-gains connectivity on the source host after a reboot
  (don't know if there's another way to restore connectivity). As a side
  effect of this behavior, the pre-live migration cleanup hook also
  fails with:

  PCI device 0000:3b:10.0 is in use by driver QEMU

  [How to reproduce]

  - Create an environment with SRIOV, (our case uses switchdev[1])
  - Create 1 VM
  - Provoke a failure in the _pre_live_migration process (for example creating a directory /var/lib/nova/instances/<instance id>)
  - Check the VM's connectivity
  - Check the logs for: libvirt.libvirtError: Requested operation is not valid: PCI device 0000:03:04.1 is in use by driver QEMU, domain instance-00000001
  Full-stack trace[2]

  [Expected]

  VM connectivity is restored even if it gets a brief disconnection
  As happens for non-SRIOV scenarios, after a failure, no leftovers remains (port bindings and instance path files)

  [Observed]
  VM loses connectivity which is only is restored after the VM status is set to ERROR and the VM is power recycled
  Port bindings are not removed

  [Environment]
  Focal Ussuri with Mellanox Connect5 cards

  [1] https://paste.ubuntu.com/p/PzBM7y6Dbr/
  [2] https://paste.ubuntu.com/p/ThQmDYtdSS/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1944619/+subscriptions



References