← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1880389] Re: lost net connection when live migration

 

regarding the VM Resumimg before the 'brctl addif'.

the libvirt xml we generate contains the name of the linux bridge the tap should be added
too. so the linux bridge agent does not need to actully do the brctl addif command.
the tap should already be a member of that bridge when the vm is resumed.

it looks like the vm paused on the source compute node at
020-05-26 14:56:05.246 7 INFO nova.compute.manager [req-fe4495ae-f1a7-4e93-871e-4d034098babd - - - - -] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] VM Paused (Lifecycle Event)

and it resumed on the dest at 
2020-05-26 14:56:05.303 6 INFO nova.compute.manager [req-8cd4842d-4783-42b3-9a8e-c4e757b8e6f0 - - - - -] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d] VM Resumed (Lifecycle Event)

previously at  2020-05-26 14:55:54.639 6 DEBUG nova.virt.libvirt.driver
[req-28bba2f2-89d9-4cbb-8b6a-7a9690469c86
b114d7969c0e465fbd15c2911ca4bb23 28e6517b7d6d4064be1bc878b590c40c -
default default] [instance: 7f050d9a-413c-4143-849b-75f931a2c07d]
Plugging VIFs before live migration. pre_live_migration
/var/lib/kolla/venv/lib/python2.7/site-
packages/nova/virt/libvirt/driver.py:7621

on the dest node we have started per pluging the network backend which
successfully completed at

2020-05-26 14:55:58.769 6 INFO os_vif [req-28bba2f2-89d9-4cbb-8b6a-
7a9690469c86 b114d7969c0e465fbd15c2911ca4bb23
28e6517b7d6d4064be1bc878b590c40c - default default] Successfully plugged
vif
VIFBridge(active=True,address=fa:16:3e:e1:50:ac,bridge_name='brq49b34298-a8',has_traffic_filtering=True,id=b3526533-dc6a-4174
-bd3b-c300e78eda62,network=Network(49b34298-a85a-
42a9-b264-b3a9242fef8f),plugin='linux_bridge',port_profile=,preserve_on_delete=False,vif_name='tapb3526533-dc')

this happens before we call libvirt to migrate the instance so at this
point os-vif has ensured the linux bridge "brq49b34298-a8" is created so
that when the vm starts the the tap is directly created in the correct
bridge.

at 2020-05-26 14:56:02.172 7 DEBUG nova.compute.manager [req-
40866f62-6362-4e9a-910e-365c3452d29f 030ec97d13dd4d9698209595a7ac01c4
ef210c7d6b2146139a9c94ef790081d8 - default default] [instance: 7f050d9a-
413c-4143-849b-75f931a2c07d] Received event network-
changed-b3526533-dc6a-4174-bd3b-c300e78eda62 external_instance_event
/var/lib/kolla/venv/lib/python2.7/site-
packages/nova/compute/manager.py:8050

we received a network-changed

and then at 2020-05-26 14:56:04.314 7 DEBUG nova.compute.manager [req-
379d2410-5959-45ae-89ce-649dca3ed666 030ec97d13dd4d9698209595a7ac01c4
ef210c7d6b2146139a9c94ef790081d8 - default default] [instance: 7f050d9a-
413c-4143-849b-75f931a2c07d] Received event network-vif-
plugged-b3526533-dc6a-4174-bd3b-c300e78eda62 external_instance_event
/var/lib/kolla/venv/lib/python2.7/site-
packages/nova/compute/manager.py:8050

we recive a network-vif-plugged event which should ideally only be sent
by the ml2 driver when the l2 agent has finished wireing up the
networking on the destintaiton node.

as you pointed out the l2 agent does not finish adding the vlan subport
to the correct bridge until

2020-05-26 14:56:25.743 6 DEBUG neutron.agent.linux.utils [req-
fcca2dcc-5578-4827-ae05-d10935d35223 - - - - -] Running command:
['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'brctl',
'addif', 'brq49b34298-a8', 'p2p2.64'] create_process
/var/lib/kolla/venv/lib/python2.7/site-
packages/neutron/agent/linux/utils.py:87


19 seconds later 

so i think the issue is that the linux bridge ml2 driver is not sending
plug time network-vif-plugged events but is instead sending bind time
events.

we wait for the netwroking to be configured here 
https://github.com/openstack/nova/blob/stable/queens/nova/compute/manager.py#L6420-L6425
which wait for the network-vif-plugged event i showed in the log
https://github.com/openstack/nova/blob/bea91b8d58d909852949726296149d93f2c639d5/nova/compute/manager.py#L6352-L6362

before actully starting the migration here
https://github.com/openstack/nova/blob/stable/queens/nova/compute/manager.py#L6467-L6470

the linux bridge l2 agent should only notify nova that the iterface is
plugged when the tap is fully wired up

https://github.com/openstack/neutron/blob/4acc6843e849e98cd04a6d01861555c3e120f081/neutron/plugins/ml2/drivers/agent/_common_agent.py#L303-L306
but as the comment suggest this behavior is racy

https://github.com/openstack/neutron/blob/4acc6843e849e98cd04a6d01861555c3e120f081/neutron/plugins/ml2/drivers/agent/_common_agent.py#L259-L296

in this it started ensurign the bridge had connectiyt to the physical network at
2020-05-26 14:56:17.576 6 DEBUG neutron.plugins.ml2.drivers.linuxbridge.agent.linuxbridge_neutron_agent [req-fcca2dcc-5578-4827-ae05-d10935d35223 - - - - -] Creating subinterface p2p2.64 for VLAN 64 on interface p2p2 ensure_vlan /var/lib/kolla/venv/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:303

which is long after the vm resumed at 2020-05-26 14:56:05.303

so with that in mind i think this is a neutron bug and its is reporting
the network is pluged before it actully is.

i should not the netwrok-vif-plugged event i noted in the log was
"unexpected"

2020-05-26 14:56:04.316 7 WARNING nova.compute.manager [req-
379d2410-5959-45ae-89ce-649dca3ed666 030ec97d13dd4d9698209595a7ac01c4
ef210c7d6b2146139a9c94ef790081d8 - default default] [instance: 7f050d9a-
413c-4143-849b-75f931a2c07d] Received unexpected event network-vif-
plugged-b3526533-dc6a-4174-bd3b-c300e78eda62 for instance with vm_state
active and task_state migrating.

meaning that nova had already recived a network-vif-plugged event
previously when it was waiting for the networking on the destintaiotn to
be completed but i think that unexpect plug event was the actual event
that should have start the migration.

that is still signifcantly before neutron actully finished wiring up the
networking which completed at 2020-05-26 14:56:21.061 6 DEBUG
neutron.plugins.ml2.drivers.linuxbridge.agent.linuxbridge_neutron_agent
[req-fcca2dcc-5578-4827-ae05-d10935d35223 - - - - -] Done creating
subinterface p2p2.64 ensure_vlan /var/lib/kolla/venv/lib/python2.7/site-
packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:317

neutron should not have sent any netwrok-vif-plugged event untill at
least the physical interface vlan subport was added to the bridge.

i dont think this is actully a nova bug but suspect its a neutron bug.
likely due to the cahgne that were intoduced for multiple port binding.

** Also affects: neutron
   Importance: Undecided
       Status: New

** Changed in: nova
       Status: Incomplete => New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1880389

Title:
  lost net connection  when live migration

Status in neutron:
  New
Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  I find VM will lost net connection when live migration. 

  I enabled the live_migration_wait_for_vif_plug and set
  vif_plugging_timeout=120.

  My openstack version is queens, and I used linux bridge as my L2
  plugin. and the physical adapter of my host I did bond0 mode.

  when the live migration done, I found the ping of VM is broken down,
  and I use tcpdump to catch the package, I found the package from
  switch still reach the destination host.

  
  Steps to reproduce
  ==================
  1. enable the live_migration_wait_for_vif_plug = True and vif_plugging_timeout = 120 (remeber restart the linuxbridge-agent in source and destination host)
  2. created a new VM ,and then ping the VM which created in step
  3. do the live migration action.(before live migration you need to make sure the vlan sub_interface of the network which vm attached not in the dest host. )
  4. when the processing of live migration, U will found the ping to the VM was broken down.

  Expected result
  ===============
  the ping should not broken down when live migration. 

  Actual result
  =============
  the ping was broken down. and the ping package from the physic switch was still set to the source.

  
  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/

     queens

  2. Which hypervisor did you use?
      
     Libvirt + KVM

  3. Which storage type did you use?
     ceph
     

  4. Which networking type did you use?
     Neutron with LinuxBirdge
     the network type of VM  is Vlan type.


  And I found the time of 5 RARP from VM is before the time of the vlan
  sub interface insert to the linuxbridge bridge when live migration.
  maybe this will help us to make sure the problem I found.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1880389/+subscriptions


References