yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #88577
[Bug 1944619] Re: Instances with hardware offloaded ovs ports lose access after failed live migrations
Reviewed: https://review.opendev.org/c/openstack/nova/+/815324
Committed: https://opendev.org/openstack/nova/commit/63ffba7496182f6f6f49a380f3c639fc3ded9772
Submitter: "Zuul (22348)"
Branch: master
commit 63ffba7496182f6f6f49a380f3c639fc3ded9772
Author: Erlon R. Cruz <erlon@xxxxxxxxxxxxx>
Date: Tue Dec 7 17:39:58 2021 -0300
Fix pre_live_migration rollback
During the pre live migration process, Nova performs most of the
tasks related to the creation and operation of the VM in the destination
host. That is done without interrupting any of the hardware in the source
host. If the pre_live_migration fails, those same operations should be
rolled back.
Currently nova is sharing the _rollback_live_migration for both
live and pre_live migration rollbacks, and that is causing the source
host to try to re-attach network interfaces on the source host where
they weren't actually de-attached.
This patch fixes that by adding a conditional to allow nova to do
different paths for migration and pre_live_migration rollbacks.
Closes-bug: #1944619
Change-Id: I784190ac356695dd508e0ad8ec31d8eaa3ebee56
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1944619
Title:
Instances with hardware offloaded ovs ports lose access after failed
live migrations
Status in neutron:
Incomplete
Status in OpenStack Compute (nova):
Fix Released
Bug description:
If for some reason a live migration fails for an instance with an
SRIOV port during the '_pre_live_migration' hook. The instance will
lose access to the network and leave behind duplicated port bindings
on the database.
The instance re-gains connectivity on the source host after a reboot
(don't know if there's another way to restore connectivity). As a side
effect of this behavior, the pre-live migration cleanup hook also
fails with:
PCI device 0000:3b:10.0 is in use by driver QEMU
[How to reproduce]
- Create an environment with SRIOV, (our case uses switchdev[1])
- Create 1 VM
- Provoke a failure in the _pre_live_migration process (for example creating a directory /var/lib/nova/instances/<instance id>)
- Check the VM's connectivity
- Check the logs for: libvirt.libvirtError: Requested operation is not valid: PCI device 0000:03:04.1 is in use by driver QEMU, domain instance-00000001
Full-stack trace[2]
[Expected]
VM connectivity is restored even if it gets a brief disconnection
As happens for non-SRIOV scenarios, after a failure, no leftovers remains (port bindings and instance path files)
[Observed]
VM loses connectivity which is only is restored after the VM status is set to ERROR and the VM is power recycled
Port bindings are not removed
[Environment]
Focal Ussuri with Mellanox Connect5 cards
[1] https://paste.ubuntu.com/p/PzBM7y6Dbr/
[2] https://paste.ubuntu.com/p/ThQmDYtdSS/
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1944619/+subscriptions
References