yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1798690] [NEW] Live migrate of iscsi-backed VM loses internal network connectivity

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Eric Miller <1798690@xxxxxxxxxxxxxxxxxx>
Date: Thu, 18 Oct 2018 23:20:54 -0000
Reply-to: Bug 1798690 <1798690@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Description
===========

Note that this may be a Neutron issue, but since it is happening during
live migration, I wanted to point it out to the Nova group first, and
let them decide whether to include the Neutron group on this ticket.

Also note that this may not be related to iSCSI at all - I just don't
have access to Ceph-backed VMs at the moment to test.

Live migration of a VM that uses an iSCSI-backed volume-based boot disk
(no other disks attached) will migrate correctly, including the volume,
and DVR router functionality with floating IPs, but internal network
connectivity won't work (pings between VMs on the same Neutron network
fail).

After live migrating the "bad" VM back to the original host, internal
networking works again!

NOTE - this seems to be only reproducible if you deploy the VMs, do
"not" ping between the VMs, migrate one of the VMs, and "then" ping
between the VMs.  The ping fails in this case.  In the case where pings
are performed "prior" to migration, the pings succeed!

So, it appears that something in Neutron isn't being migrated.

I had tested this configuration back in the Liberty days and ran into
the same issue, and thought it was possibly a bug that was fixed by now,
but it looks like the problem still exists.

Note that I'm still looking at logs to determine whether there is good
evidence for why/when this happens, but wanted to get a bug report
placed in case it was a known issue.


Steps to reproduce
==================

Deploy 2 VMs with an internal network, each with floating IPs, with
security groups that are not very restrictive (allow everything
including pings between VMs and the Internet).

In our case, the two VMs were deployed on separate physical hosts.

If VM #2 resides on physical host compute002 after deployment, live migrate this VM to physical host compute003 with:
openstack server migrate --live compute003 d3d45afb-e913-4cb7-89df-a1c1d51d6339

>From VM #2, ping VM #1.  There is no ping response.

If you perform all of the above, but ping between the VMs "prior" to
migration, pings work fine after migrations (hiding the issue).


Expected result
===============

Network should function correctly after a migration - pings should work,
for example, between VMs.


Actual result
=============

Testing with VM to VM pings:  pings are lost and connectivity "never"
resumes.  I deployed the 2 VMs, migrated one of them, and started a ping
from one VM to the other, waited 16+ minutes, and pings are still
failing.

Perform a live migrate of VM #2 back to the original host using:
openstack server migrate --live compute002 d3d45afb-e913-4cb7-89df-a1c1d51d6339

and pings start to work again.

Perform a live migrate of VM #2 to the same host as VM #1 and pings
between VMs "also" work!


Environment
===========

stable/rocky deployment with Kolla-Ansible 7.0.0.0rc3devXX (the latest
as of October 15th, 2018) and Kolla 7.0.0.0rc3devXX

CentOS 7.5 with latest updates as of October 15, 2018.

Kernel:  Linux 4.18.14-1.el7.elrepo.x86_64

Hypervisor:  KVM

Storage:  Blockbridge (unsupported, but functions the same as other
iSCSI based backends)

Networking:  DVR with OpenVSwitch

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1798690

Title:
  Live migrate of iscsi-backed VM loses internal network connectivity

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  Note that this may be a Neutron issue, but since it is happening
  during live migration, I wanted to point it out to the Nova group
  first, and let them decide whether to include the Neutron group on
  this ticket.

  Also note that this may not be related to iSCSI at all - I just don't
  have access to Ceph-backed VMs at the moment to test.

  Live migration of a VM that uses an iSCSI-backed volume-based boot
  disk (no other disks attached) will migrate correctly, including the
  volume, and DVR router functionality with floating IPs, but internal
  network connectivity won't work (pings between VMs on the same Neutron
  network fail).

  After live migrating the "bad" VM back to the original host, internal
  networking works again!

  NOTE - this seems to be only reproducible if you deploy the VMs, do
  "not" ping between the VMs, migrate one of the VMs, and "then" ping
  between the VMs.  The ping fails in this case.  In the case where
  pings are performed "prior" to migration, the pings succeed!

  So, it appears that something in Neutron isn't being migrated.

  I had tested this configuration back in the Liberty days and ran into
  the same issue, and thought it was possibly a bug that was fixed by
  now, but it looks like the problem still exists.

  Note that I'm still looking at logs to determine whether there is good
  evidence for why/when this happens, but wanted to get a bug report
  placed in case it was a known issue.

  
  Steps to reproduce
  ==================

  Deploy 2 VMs with an internal network, each with floating IPs, with
  security groups that are not very restrictive (allow everything
  including pings between VMs and the Internet).

  In our case, the two VMs were deployed on separate physical hosts.

  If VM #2 resides on physical host compute002 after deployment, live migrate this VM to physical host compute003 with:
  openstack server migrate --live compute003 d3d45afb-e913-4cb7-89df-a1c1d51d6339

  From VM #2, ping VM #1.  There is no ping response.

  If you perform all of the above, but ping between the VMs "prior" to
  migration, pings work fine after migrations (hiding the issue).

  
  Expected result
  ===============

  Network should function correctly after a migration - pings should
  work, for example, between VMs.

  
  Actual result
  =============

  Testing with VM to VM pings:  pings are lost and connectivity "never"
  resumes.  I deployed the 2 VMs, migrated one of them, and started a
  ping from one VM to the other, waited 16+ minutes, and pings are still
  failing.

  Perform a live migrate of VM #2 back to the original host using:
  openstack server migrate --live compute002 d3d45afb-e913-4cb7-89df-a1c1d51d6339

  and pings start to work again.

  Perform a live migrate of VM #2 to the same host as VM #1 and pings
  between VMs "also" work!

  
  Environment
  ===========

  stable/rocky deployment with Kolla-Ansible 7.0.0.0rc3devXX (the latest
  as of October 15th, 2018) and Kolla 7.0.0.0rc3devXX

  CentOS 7.5 with latest updates as of October 15, 2018.

  Kernel:  Linux 4.18.14-1.el7.elrepo.x86_64

  Hypervisor:  KVM

  Storage:  Blockbridge (unsupported, but functions the same as other
  iSCSI based backends)

  Networking:  DVR with OpenVSwitch

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1798690/+subscriptions
Follow ups

[Bug 1798690] Re: Live migrate of iscsi-backed VM loses internal network connectivity
From: sean mooney, 2018-11-06
[Bug 1798690] Re: Live migrate of iscsi-backed VM loses internal network connectivity
From: Eric Miller, 2018-10-28