yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1798690] Re: Live migrate of iscsi-backed VM loses internal network connectivity

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: sean mooney <smooney@xxxxxxxxxx>
Date: Tue, 06 Nov 2018 15:14:29 -0000
Reply-to: Bug 1798690 <1798690@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
i have not looked into this closely but my guess is this could be related to the arp suppression rules 
used in the dvr case not being updated correctly that said it is just a guess
so there may be something more going on here.

eric can you try doing a hard reboot of the migrated instance and see if
that corrects the connectivity to the internal network ips.

it would also be helpful to know if you are using the iptabels firewall or openvswtich firewall
and if following the migration the port status and port admin status are active/up?


i may not have time to help futher but ill try and check in on this bug again in a few days.
from a nova perspective i dont think this is a nova but or os-vif for that matter but i have been investaging live migration related issues this cycle and this is yet another edgecase that apperars
to need fixing.

** Changed in: nova
       Status: New => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1798690

Title:
  Live migrate of iscsi-backed VM loses internal network connectivity

Status in neutron:
  New
Status in OpenStack Compute (nova):
  Opinion

Bug description:
  Description
  ===========

  Note that this may be a Neutron issue, but since it is happening
  during live migration, I wanted to point it out to the Nova group
  first, and let them decide whether to include the Neutron group on
  this ticket.

  Also note that this may not be related to iSCSI at all - I just don't
  have access to Ceph-backed VMs at the moment to test.

  Live migration of a VM that uses an iSCSI-backed volume-based boot
  disk (no other disks attached) will migrate correctly, including the
  volume, and DVR router functionality with floating IPs, but internal
  network connectivity won't work (pings between VMs on the same Neutron
  network fail).

  After live migrating the "bad" VM back to the original host, internal
  networking works again!

  NOTE - this seems to be only reproducible if you deploy the VMs, do
  "not" ping between the VMs, migrate one of the VMs, and "then" ping
  between the VMs.  The ping fails in this case.  In the case where
  pings are performed "prior" to migration, the pings succeed!

  So, it appears that something in Neutron isn't being migrated.

  I had tested this configuration back in the Liberty days and ran into
  the same issue, and thought it was possibly a bug that was fixed by
  now, but it looks like the problem still exists.

  Note that I'm still looking at logs to determine whether there is good
  evidence for why/when this happens, but wanted to get a bug report
  placed in case it was a known issue.

  
  Steps to reproduce
  ==================

  Deploy 2 VMs with an internal network, each with floating IPs, with
  security groups that are not very restrictive (allow everything
  including pings between VMs and the Internet).

  In our case, the two VMs were deployed on separate physical hosts.

  If VM #2 resides on physical host compute002 after deployment, live migrate this VM to physical host compute003 with:
  openstack server migrate --live compute003 d3d45afb-e913-4cb7-89df-a1c1d51d6339

  From VM #2, ping VM #1.  There is no ping response.

  If you perform all of the above, but ping between the VMs "prior" to
  migration, pings work fine after migrations (hiding the issue).

  
  Expected result
  ===============

  Network should function correctly after a migration - pings should
  work, for example, between VMs.

  
  Actual result
  =============

  Testing with VM to VM pings:  pings are lost and connectivity "never"
  resumes.  I deployed the 2 VMs, migrated one of them, and started a
  ping from one VM to the other, waited 16+ minutes, and pings are still
  failing.

  Perform a live migrate of VM #2 back to the original host using:
  openstack server migrate --live compute002 d3d45afb-e913-4cb7-89df-a1c1d51d6339

  and pings start to work again.

  Perform a live migrate of VM #2 to the same host as VM #1 and pings
  between VMs "also" work!

  
  Environment
  ===========

  stable/rocky deployment with Kolla-Ansible 7.0.0.0rc3devXX (the latest
  as of October 15th, 2018) and Kolla 7.0.0.0rc3devXX

  CentOS 7.5 with latest updates as of October 15, 2018.

  Kernel:  Linux 4.18.14-1.el7.elrepo.x86_64

  Hypervisor:  KVM

  Storage:  Blockbridge (unsupported, but functions the same as other
  iSCSI based backends)

  Networking:  DVR with OpenVSwitch

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1798690/+subscriptions
References

[Bug 1798690] [NEW] Live migrate of iscsi-backed VM loses internal network connectivity
From: Eric Miller, 2018-10-18